Course-ID: CS4830

Big Data Laboratory

Instructor: Prof. B. Ravindran

Description

This course will introduce the students to practical aspects of analytics at large scale, i.e., big data. The course will start with a basic introduction to big data concepts spanning hardware, systems and software, and then delve into the following topics.

Course Content

  • Introduction to Big Data concepts: divide-and-conquer, parallel algorithms, distributed virtualized storage, distributed resource management, orchestration and scheduling, lambda architecture, data flow paradigm, real-time event processing.
  • Big Data Technology: Map-Reduce using Python, Spark for Batch processing, Spark SQL, data flow processing libraries (Beam, Spark Streaming, Flink).
  • Hardware Concepts: Shared-nothing MPP architecture, Cloud architecture, GPU-based acceleration and processing.
  • Analytics at Large Scale: Libraries of algorithms including SparkMLlib, H20; integrations with TensorFlow and PyTorch; ML on cloud; use of Zeppelin, Databricks Notebooks.