Studies

Admissions

The Institute

Resources

Modes of Study

Bachelors Masters Single Courses Foundation FP Grado Superior 1 on 1 Classes

List of Courses

Barcelona Courses Bangkok Courses

Studies

Admissions

The Institute

Resources

Modes of Study

Bachelors Masters Single Courses Foundation FP Grado Superior 1 on 1 Classes

List of Courses

Barcelona Courses Bangkok Courses

Studies

Admissions

The Institute

Resources

Modes of Study

Bachelors Masters Single Courses Foundation FP Grado Superior 1 on 1 Classes

List of Courses

Barcelona Courses Bangkok Courses

DS414

Big Data and Distributed Data Analysis

Barcelona Campus

Aug 01, 2022 - Aug 19, 2022

During this course, the students will master and sharpen their knowledge in basic technologies of the modern Big Data landscape, namely: HDFS, MapReduce, Hive, Spark.

Barcelona Campus

Aug 01, 2022 - Aug 19, 2022

Faculty Profiles

Oleg Ivchenko

BigData system administrator at Yandex-CERN partnership

Julia Ivanova

Machine Learning Software Engineer, Information Analysis Centre of the Ministry of Emergency Situations

Mikhail Anukhin

Practical lecturer at MIPT

Course length

3 weeks

Duration

3 hours

per day

Total hours

45 hours

Credits

6 ECTS

Language

English

Course type

Offline

Fee for single course

€1500

Fee for degree students

€750

Skills you’ll learn

Big DataData AnalysisSpark DataFrameDistributed FilesystemMapReduce TasksReal-time Data Processing Pipeline Building

OverviewCourse outlineCourse materialsPrerequisitesMethod & grading

Overview

During this course, the students will master and sharpen their knowledge in basic technologies of the modern Big Data landscape, namely: HDFS, MapReduce, Hive, Spark (especially real-time Spark Streaming). The subject of particular interest during this course is efficient data warehousing using Hive and Spark.

Under the teacher’s supervision, they will study the intricacies of the system’s internals and their applications and learn distributed file systems, the purpose of their existence, and the ways of their application. The listeners will also practice using the MapReduce framework, a workhorse for many modern Big Data applications. The key element of this course is the possibility of applying knowledge into practice to process texts and solve sample business cases. Finally, the participants will deal with Spark, the next-generation computational framework, from its basic concepts up to advanced applications made to squeeze maximum performance.

Learning highlights

Graduates will be able to:

Construct their own Big Data Service using BigData frameworks

Optimise data warehouse for storage and processing

Create own real-time data processing pipeline

Apply the acquired skills in finance, social networks, telecommunications and many other fields

Course outline

15 classes

Dive into the details of the course and get a sense of what each class will cover.

Monday

Tuesday

Wednesday

Thursday

Friday

Monday

Session 1 - online

What is BigData? Working with distributed file systems (HDFS)

Tuesday

Session 2 - online

MapReduce paradigm. Basic knowledge

Wednesday

Session 3 - online

MapReduce paradigm.Advanced elements

Thursday

Session 4 - online

MapReduce paradigm. APIs knowledge, practical examples

Friday

Session 5

SQL over Big Data: Hive. Basic constructions

Monday

Session 6

SQL over Big Data: Hive extensions (Hive Streaming, UDFs). Not only Hive

Tuesday

Session 7

SQL over Big Data: Different data formats; practical cases

Wednesday

Session 8

Spark. In-memory computational model. RDD API

Thursday

Session 9

Spark. Dataframe API, SQL

Friday

Session 10

Big Data applications examples and Spark optimisation

Monday

Session 11

Real-Time computations over Big Data. Spark Streaming

Tuesday

Session 12

Real-Time message processing. Apache Kafka and its connection with Spark

Wednesday

Session 13

Real-Time message processing. Kafka streams.

Guest speaker Ivan Ponomarev, Staff software engineer at Synthesized.io, senior Java lecturer at MIPT

Thursday

Session 14

NoSQL over Big Data. Apache HBase framework and its working with Hadoop. Apache Cassandra.

Friday

Session 15

NoSQL over Big Data.Working with Apache Cassandra and Spark.

Course materials

Books

Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale

Tom White

Learning Spark: Lightning-fast Data Analytics

Jules Damji, Brooke Wenig, Tathagata Das, Denny Lee

MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems

Donald Miner, Adam Shook

Programming Hive: Data Warehouse and Query Language for Hadoop

Edward Capriolo, Dean Wampler, Jason Rutherglen

Prerequisites

Programming experience in Python. Python is required to complete programming assignments.

Basic Java or (and) Scala knowledge. Most Hadoop ecosystem frameworks are written on Java or Scala. So basic experience in these languages is good to deep dive into these services.

Unix basics. The Hadoop ecosystem is deployed on the computational servers. The modern servers usually work under the Linux operating system, so the learners should have at least minimal experience in Linux.

Git. Modern programming can’t be without a version control system. The most popular VCS is Git. During the course, we will emulate the real development process in the team (with git branches, merge requests etc.)

Methodology

In the modern world of Big Data, employers require practical skills and experience together with theoretical knowledge. Though, with the help of cloud providers such as Amazon AWS, Google Cloud, Microsoft Azure, it is becoming easier to spawn your own cluster. However, it is still a challenge to get practical skills without wasting CPU cycles, money and time. This course is built upon programming assignments that will be evaluated on pre-configured clusters. It will help students focus on practical assignments instead of cluster sizing and configuration. All the materials will be available for students after the course.

Grading

The final grade will be composed of the following criteria:

80% - Assignments & Quizzes

20% - Activity

Each subject of study (MapReduce, Hive, Spark, ...) will be covered by programming assignments. The grading is based on the points students can get by completing the assignments successfully.

Faculty

Oleg Ivchenko

BigData system administrator at Yandex-CERN partnership

Oleg is a Senior lecturer of the Department of Algorithms and Programming Technologies. Oleg started to work with BigData in 2015. Now he is the Head of the BigData course at the Department of Algorithms and Programming technologies and co-developer of the testing framework for “Big Data for Data Engineers” Coursera specialisation. He is also a Hadoop and HPC administrator at the Yandex-CERN partnership.

Under the direction of Alexey Dral, he developed HJudge - the testing system for application in the Hadoop ecosystem (Rospatent num. 2016660616). The following generation testing framework is used for autonomous testing of students' applications in this course.

See full profile

Faculty

Julia Ivanova

Machine Learning Software Engineer, Information Analysis Centre of the Ministry of Emergency Situations

Julia completed her bachelor's and Master's degrees at the Moscow Institute of Physics and Technology. Julia’s school love was physics, but at some point, the world of computer science lured her to its side. Now, in parallel with her work in the industry, she teaches several CS courses at her university.

See full profile

Faculty

Mikhail Anukhin

Practical lecturer at MIPT

Mikhail works at the Department of Industrial Data Analysis in Retail. He developed and taught a course about the Fundamentals of Distributed Systems Theory and Designing Data-intensive Applications. Mikhail has led a lot of courses at MIPT, such as: “Theory and Practice of Concurrent Computing”, ”Algorithms and Data Structures” and ”Foundations of Programming”. He participated in the production of a few Big Data online courses for the Higher School of Economics and Innopolis University.

See full profile

Apply for this course

Snap up your chance to enroll before all spaces fill up.

Big Data and Distributed Data Analysis

by Oleg Ivchenko, Julia Ivanova, Mikhail Anukhin

Total hours

45 Hours

Dates

Aug 01 - Aug 19, 2022

Fee for single course

€1500

Fee for degree students

€750

How to secure your spot

Complete the form below to kickstart your application

Schedule your Harbour.Space interview

If successful, get ready to join us on campus

FAQ

Will I receive a certificate after completion?

Yes. Upon completion of the course, you will receive a certificate signed by the director of the program your course belonged to.

Do I need a visa?

This depends on your case. Please check with the Spanish or Thai consulate in your country of residence about visa requirements. We will do our part to provide you with the necessary documents, such as the Certificate of Enrollment.

Can I get a discount?

Yes. The easiest way to enroll in a course at a discounted price is to register for multiple courses. Registering for multiple courses will reduce the cost per individual course. Please ask the Admissions Office for more information about the other kinds of discounts we offer and what you can do to receive one.