Distributed Data Management and Scalable Analytics | Artificial Intelligence Lab Brussels

Professor

Kyriakos Efthymiadis, Stijn Vansummeren

The course will cover the following topics:

The big-data eco system and its programming models
- Map-Reduce
- Spark
- Spark Streaming and Storm
- Lambda architectures
Fundamental concepts of big data processing
- Consistency and availability
- Distributed query evaluation
- Stream processing and sublinear algorithms
Distributed supervised machine learning methods
- Locally weighted linear regression
- Logistic regression
- Naive Bayes
- Random forests
- Support vector machines
Distributed unsupervised machine learning methods
- Clustering (k-means)
- PCA/SVD
Distributed feature selection
- Forward feature selection
- Minimum redundancy and maximum relevance
Complexity analysis
- Single versus multi-core complexities
- Communication and parallelism tradeoffs
Online learning, active learning
- Online models
- Concept drift
- Semi-supervised learning and active learning
Graph analytics
- Recommendation systems, collaborative filtering, SVD
- PageRank
- Clustering and community detection
Deep learning
- Deep neural networks.
- Breakthroughs in deep learning
- Convolution/Pooling layers. Dropout, rectified linear units
- Machine learning tool-boxes: Tensorflow, Theano, Caffe

After successful completion of this course, the student:

Understands the characteristics of big data, and the challenges these represent
Knows the principal architectures of Big Data Management and Analytics Systems (BDMAS), is able to explain the purpose of each their components, and is able to recognize and explain the key properties, strengths and limitations of each type of BDMAS and their components.
Understands the key bottlenecks in managing and analyzing massive amounts of data and is familiar with modern algorithms for overcoming these bottlenecks using parallel and distributed computation.
Is able to actively use this algorithmic knowledge in the design and implementation of applications that solve common data management and analytics problems using different types of BDMAS.
Is able to build applications using specific instances of each type of BDMAS.
In addition, is able to use established software frameworks for reproducing/sharing her/his results, including virtualization software (Docker), version control systems (Git), and notebooks (Jupyter, Zeppelin).