Hands-On Data Engineering

Skills you will gain

Big Data Ecosystem Mastery: Gain practical experience with Hadoop, Spark, Kafka, Hive, and NoSQL databases like HBase.
MapReduce & Spark Development: Build and optimize data processing jobs using MapReduce and Apache Spark.
Real-Time Stream Processing: Develop streaming applications with Apache Spark for near real-time analytics.
SQL for Big Data: Use Hive to build ETL pipelines and perform large-scale data analysis with SQL-based tools.
Cloud-Based Architecture Fluency: Understand and apply cloud-friendly architectures for scalable Big Data solutions.

Course Description

Formerly: Data Engineering with Hadoop

Big Data platforms are distributed systems that can process large amounts of data across clusters of servers. They are being used across industries in internet startups and established enterprises. In this comprehensive course, you will get up to speed on the use of current Big Data platforms and gain insights into cloud-based Big Data architectures. We will cover Hadoop, Spark, Kafka and other Big Data platforms based on SQL, such as Hive.

The first half of the course includes an overview of the frameworks for MapReduce, Spark, Kafka, and Hive as well as some aspects of Python programming. You will learn how to write MapReduce/Spark jobs and how to optimize data processing applications and become familiar with SQL based tools for Big Data. We use Hive to build ETL jobs. The course also includes the fundamentals of NoSQL databases like HBase and Kafka.

The second half of the course covers stream processing capability and developing streaming applications with Apache Spark. You'll learn how to process large amounts of data using DataFrame, Apache Spark's structured data processing programming model that provides simple, powerful APIs. In addition to batch and iterative data processing, Apache Spark also supports stream processing, which enables companies to extract interesting and useful business insights at near real-time.

The course consists of interactive lectures, hands-on labs in class, and take home practice exercises. Upon completion of this course, you will possess a strong understanding of the tools used to build Big Data applications using MapReduce, Spark, and Hive.

Topics

Big Data applications architecture
Understanding Hadoop distributed file system (HDFS)
How MapReduce framework works
Introduction to HBase (Hadoop NoSQL database)
Introduction to Apache Kafka
Introduction to Spark and SparkSQL
Developing Spark/SparkSQL and Hive applications
Managing tables and query development in Hive
Introduction to data pipelines

Prerequisites / Skills Needed

Basic SQL skills and the ability to create simple programs in a modern programming language, like Python are required. An understanding of database, parallel or distributed computing is helpful.

Additional Information

This course uses AWS EMR and Databricks for Spark, Hive and HDFS programming. Students are required to have accounts with AWS and Databricks.

AI* - This course has students build AI agents to design and implement data pipelines as part of an introduction to data pipeline concepts.

Syllabus Library

Venkat Mavram

Jan. 13 - Mar. 17, Tuesday, 6:30pm - 9:30pm

Live-Online

Details

DBDA.X424.(16)

Schedule

Date	Start Time	End Time	Meeting Type	Location
Tue, 01-13-2026	6:30pm	9:30pm	Live-Online	REMOTE
Tue, 01-20-2026	6:30pm	9:30pm	Live-Online	REMOTE
Tue, 01-27-2026	6:30pm	9:30pm	Live-Online	REMOTE
Tue, 02-03-2026	6:30pm	9:30pm	Live-Online	REMOTE
Tue, 02-10-2026	6:30pm	9:30pm	Live-Online	REMOTE
Tue, 02-17-2026	6:30pm	9:30pm	Live-Online	REMOTE
Tue, 02-24-2026	6:30pm	9:30pm	Live-Online	REMOTE
Tue, 03-03-2026	6:30pm	9:30pm	Live-Online	REMOTE
Tue, 03-10-2026	6:30pm	9:30pm	Live-Online	REMOTE
Tue, 03-17-2026	6:30pm	9:30pm	Live-Online	REMOTE

This class is offered in an online synchronous format. Students are expected to log into this course via Canvas at the start time of scheduled meetings and participate via Zoom, for the duration of each scheduled class meeting.

To see all meeting dates, click "Full Schedule" below.

You will be granted access in Canvas to your course site and course materials approximately 24 hours prior to the published start date of the course.

Required Tools & Materials: Tools: Students will need to create accounts in AWS and Databricks. Use EMR in AWS and notebooks in Databricks.

Students are required to have computers with 64Bit processors and a minimum of 8GB of memory.

Recommended Texts:
Hadoop: The Definitive Guide, 4th Edition, Tom White, O'Reilly Media, 2015, ISBN-10: 1-4919-0163-2, ISBN-13: 978-1-4919-0163-2

Learning Spark: Lightning-Fast Big Data Analysis, Karau, Konwinski, Wendell and Zaharia, 2015, O'Reilly. ISBN-10: 1449358624.

Programming Hive, Capriolo, Wampler & Rutherglen, 2012, O'Reilly. ISBN-10: 1449319335

Data Algorithms: Recipes for Scaling Up with Hadoop and Spark, Mahmout Parsian, 2015, O'Reilly Media. Print ISBN:978-1-4919-0618-7

Hands-On Data Engineering

Skills you will gain

Course Description

Topics

Prerequisites / Skills Needed

This course applies to these programs:

Computer Programming

Data Science and Data Analytics

Data Engineering

Hands-On Data Engineering

Skills you will gain

Course Description

Topics

Prerequisites / Skills Needed

This course applies to these programs:

Computer Programming

Data Science and Data Analytics

Data Engineering

Ask A Question