ETL and Data Pipelines with Shell, Airflow and Kafka (Coursera)

Offered by IBM,
ETL and Data Pipelines with Shell, Airflow and Kafka (Coursera)

After taking this course, you will be able to describe two different approaches to converting raw data into analytics-ready data. One approach is the Extract, Transform, Load (ETL) process. The other contrasting approach is the Extract, Load, and Transform (ELT) process. ETL processes apply to data warehouses and data marts. ELT processes apply to data lakes, where the data is transformed on demand by the requesting/calling application.

Class Deals by MOOC List - Click here and see Coursera's Active Discounts, Deals, and Promo Codes.

Both ETL and ELT extract data from source systems, move the data through the data pipeline, and store the data in destination systems. During this course, you will experience how ELT and ETL processing differ and identify use cases for both.
You will identify methods and tools used for extracting the data, merging extracted data either logically or physically, and for importing data into data repositories. You will also define transformations to apply to source data to make the data credible, contextual, and accessible to data users. You will be able to outline some of the multiple methods for loading data into the destination system, verifying data quality, monitoring load failures, and the use of recovery mechanisms in case of failure.
Finally, you will complete a shareable final project that enables you to demonstrate the skills you acquired in each module.

Course 11 of 13 in the IBM Data Engineering Professional Certificate

Syllabus

WEEK 1
Data Processing Techniques
ETL or Extract, Transform, and Load processes are used for cases where flexibility, speed, and scalability of data are important. You will explore some key differences been similar processes, ETL and ELT, which include the place of transformation, flexibility, Big Data support, and time-to-insight.
You will learn that there is an increasing demand for access to raw data that drives the evolution from ETL to ELT. Data extraction involves advanced technologies including database querying, web scraping, and APIs. You will also learn that data transformation is about formatting data to suit the application and that data is loaded in batches or streamed continuously.

WEEK 2
ETL & Data Pipelines: Tools and Techniques
Extract, transform and load (ETL) pipelines are created with Bash scripts that can be run on a schedule using cron. Data pipelines move data from one place, or form, to another. Data pipeline processes include scheduling or triggering, monitoring, maintenance, and optimization. Furthermore, Batch pipelines extract and operate on batches of data. Whereas streaming data pipelines ingest data packets one-by-one in rapid succession. In this module, you will learn that streaming pipelines apply when the most current data is needed. You will explore that parallelization and I/O buffers help mitigate bottlenecks. You will also learn how to describe data pipeline performance in terms of latency and throughput.

WEEK 3
Building Data Pipelines using Airflow
The key advantage of Apache Airflow's approach to representing data pipelines as DAGs is that they are expressed as code, which makes your data pipelines more maintainable, testable, and collaborative. Tasks, the nodes in a DAG, are created by implementing Airflow's built-in operators.
In this module, you will learn about Apache Airflow having a rich UI that simplifies working with data pipelines. You will explore how to visualize your DAG in graph or tree mode. You will also learn about the key components of a DAG definition file, and you will learn that Airflow logs are saved into local file systems and then sent to cloud storage, search engines, and log analyzers.

WEEK 4
Building Streaming Pipelines using Kafka
Apache Kafka is a very popular open source event streaming pipeline. An event is a type of data that describes the entity’s observable state updates over time. Popular Kafka service providers include Confluent Cloud, IBM Event Stream, and Amazon MSK. Additionally, Kafka Streams API is a client library supporting you with data processing in event streaming pipelines.
In this module, you will learn that the core components of Kafka are brokers, topics, partitions, replications, producers, and consumers. You will explore two special types of processors in the Kafka Stream API stream-processing topology: The source processor and the sink processor. You will also learn about building event streaming pipelines using Kafka.

WEEK 5
Final Assignment
In this final assignment module, you will apply your newly gained knowledge to explore two very exciting hands-on labs. “Creating ETL Data Pipelines using Apache Airflow” and “Creating Streaming Data Pipelines using Kafka”. You will explore building these ETL pipelines using real-world scenarios.
You will extract, transform, and load data into a CSV file. You will also create a topic named “toll” in Apache Kafka, download and customize a streaming data consumer, as well as verifying that streaming data has been collected in the database table.

Go to Class
MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Introduction to Big Data with Spark and Hadoop (Coursera) Coursera
IBM

Introduction to Big Data with Spark and Hadoop (Coursera)

Bernard Marr defines Big Data as the digital trace that we are generating in this digital era. In this course, you will learn about the characteristics of Big Data and its application in Big Data Analytics. You will gain an understanding about the features, benefits, limitations, and applications of some of the Big Data processing tools. You’ll explore how Hadoop and Hive help leverage the benefits of Big Data while overcoming some of the challenges it poses.

Jun 8th 2026
5-12 Weeks
Data Engineer (Dataquest) Dataquest
Dataquest

Data Engineer (Dataquest)

Get all the skills and knowledge you need to become a data engineer. You’ll learn how to work with data architecture, data processing, and data systems. By the end, you’ll be able to build a unique data infrastructure, manage data pipelines and data processing, and maintain data systems.

Self Paced
Self-Paced
Machine Learning Operations 2 (MLOps2-AML): Data Pipeline Automation & Optimization using Microsoft Azure Machine Learning (AML) (edX) EdX
Statistics.comX,Statistics.com

Machine Learning Operations 2 (MLOps2-AML): Data Pipeline Automation & Optimization using Microsoft Azure Machine Learning (AML) (edX)

Most data science projects fail. There are various reasons why, but one of the primary reasons is the challenge of deployment. One piece to the deployment puzzle is understanding how to automate your pipeline’s functions and continuously optimize its performance, which is why we developed this course, MLOps2: Data Pipeline Automation & Optimization using Microsoft Azure Machine Learning (AML).

Self Paced
Self-Paced
Customising your models with TensorFlow 2 (Coursera) Coursera
Imperial College London

Customising your models with TensorFlow 2 (Coursera)

Welcome to this course on Customising your models with TensorFlow 2! In this course you will deepen your knowledge and skills with TensorFlow, in order to develop fully customised deep learning models and workflows for any application. You will use lower level APIs in TensorFlow to develop complex model architectures, fully customised layers, and a flexible data workflow. You will also expand your knowledge of the TensorFlow APIs to include sequence models.

Jun 8th 2026
5-12 Weeks
Data Engineering Capstone Project (Coursera) Coursera
IBM

Data Engineering Capstone Project (Coursera)

In this course you will apply a variety of data engineering skills and techniques you have learned as part of the previous courses in the IBM Data Engineering Professional Certificate. You will assume the role of a Junior Data Engineer who has recently joined the organization and be presented with a real-world use case that requires a data engineering solution.

Jun 1st 2026
5-12 Weeks
Healthcare Data Models (Coursera) Coursera
University of California, Davis

Healthcare Data Models (Coursera)

Career prospects are bright for those qualified to work in healthcare data analytics. Perhaps you work in data analytics, but are considering a move into healthcare where your work can improve people’s quality of life. If so, this course gives you a glimpse into why this work matters, what you’d be doing in this role, and what takes place on the Path to Value where data is gathered from patients at the point of care, moves into data warehouses to be prepared for analysis, then moves along the data pipeline to be transformed into valuable insights that can save lives, reduce costs, to improve healthcare and make it more accessible and affordable.

Jun 8th 2026
4 Weeks
Analytical Solutions to Common Healthcare Problems (Coursera) Coursera
University of California, Davis

Analytical Solutions to Common Healthcare Problems (Coursera)

In this course, we’re going to go over analytical solutions to common healthcare problems. I will review these business problems and you’ll build out various data structures to organize your data. We’ll then explore ways to group data and categorize medical codes into analytical categories. You will then be able to extract, transform, and load data into data structures required for solving medical problems and be able to also harmonize data from multiple sources.

Jun 8th 2026
4 Weeks
Data Warehouse Concepts, Design, and Data Integration (Coursera) Coursera
University of Colorado System

Data Warehouse Concepts, Design, and Data Integration (Coursera)

This is the second course in the Data Warehousing for Business Intelligence specialization. Ideally, the courses should be taken in sequence. In this course, you will learn exciting concepts and skills for designing data warehouses and creating data integration workflows. These are fundamental skills for data warehouse developers and administrators. You will have hands-on experience for data warehouse design and use open source products for manipulating pivot tables and creating data integration workflows.

Jun 8th 2026
5-12 Weeks
Distributed Computing with Spark SQL (Coursera) Coursera
University of California, Davis

Distributed Computing with Spark SQL (Coursera)

This course is for students with SQL experience and now want to take the next step in gaining familiarity with distributed computing using Spark. Students will gain an understanding of when to use Spark and how Spark as an engine uniquely combines Data and AI technologies at scale. The four modules build on one another and by the end of the course the student will understand: Spark architecture, Spark DataFrame, optimizing reading/writing data, and how to build a machine learning model.

Jan 13th 2025
4 Weeks