Distributed Data Management and Scalable Analytics

Kyriakos Efthymiadis, Stijn Vansummeren
Course description
Course Content:

The course will cover the following topics:

  • The big-data eco system and its programming models
    • Map-Reduce
    • Spark
    • Spark Streaming and Storm
    • Lambda architectures
  • Fundamental concepts of big data processing
    • Consistency and availability
    • Distributed query evaluation
    • Stream processing and sublinear algorithms
  • Distributed supervised machine learning methods
    • Locally weighted linear regression
    • Logistic regression
    • Naive Bayes
    • Random forests
    • Support vector machines
  • Distributed unsupervised machine learning methods
    • Clustering (k-means)
    • PCA/SVD
  • Distributed feature selection
    • Forward feature selection
    • Minimum redundancy and maximum relevance
  • Complexity analysis
    • Single versus multi-core complexities
    • Communication and parallelism tradeoffs
  • Online learning, active learning
    • Online models
    • Concept drift
    • Semi-supervised learning and active learning
  • Graph analytics
    • Recommendation systems, collaborative filtering, SVD
    • PageRank
    • Clustering and community detection
  • Deep learning
    • Deep neural networks.
    • Breakthroughs in deep learning
    • Convolution/Pooling layers. Dropout, rectified linear units
    • Machine learning tool-boxes: Tensorflow, Theano, Caffe
Learning Outcomes: 

After successful completion of this course, the student:

  1. Understands the characteristics of big data, and the challenges these represent
  2. Knows the principal architectures of Big Data Management and Analytics Systems (BDMAS), is able to explain the purpose of each their components, and is able to recognize and explain the key properties, strengths and limitations of each type of BDMAS and their components.
  3. Understands the key bottlenecks in managing and analyzing massive amounts of data and is familiar with modern algorithms for overcoming these bottlenecks using parallel and distributed computation.
  4. Is able to actively use this algorithmic knowledge in the design and implementation of applications that solve common data management and analytics problems using different types of BDMAS.
  5. Is able to build applications using specific instances of each type of BDMAS.  
  6. In addition, is able to use established software frameworks for reproducing/sharing her/his results, including virtualization software (Docker), version control systems (Git), and notebooks (Jupyter, Zeppelin).

The final grade is composed based on the following categories:
Other Exam determines 100% of the final mark.

Within the Other Exam category, the following assignments need to be completed:

  • Other exam with a relative weight of 1 which comprises 100% of the final mark.

All detailed and official information about the course here >