Coursera

ETL and Data Pipelines with Shell, Airflow and Kafka (Coursera)

Offered by IBM,

After taking this course, you will be able to describe two different approaches to converting raw data into analytics-ready data. One approach is the Extract, Transform, Load (ETL) process. The other contrasting approach is the Extract, Load, and Transform (ELT) process. ETL processes apply to data warehouses and data marts. ELT processes apply to data lakes, where the data is transformed on demand by the requesting/calling application.

Class Deals by MOOC List - Click here and see Coursera's Active Discounts, Deals, and Promo Codes.

Both ETL and ELT extract data from source systems, move the data through the data pipeline, and store the data in destination systems. During this course, you will experience how ELT and ETL processing differ and identify use cases for both.
You will identify methods and tools used for extracting the data, merging extracted data either logically or physically, and for importing data into data repositories. You will also define transformations to apply to source data to make the data credible, contextual, and accessible to data users. You will be able to outline some of the multiple methods for loading data into the destination system, verifying data quality, monitoring load failures, and the use of recovery mechanisms in case of failure.
Finally, you will complete a shareable final project that enables you to demonstrate the skills you acquired in each module.

Course 11 of 13 in the IBM Data Engineering Professional Certificate

Syllabus

WEEK 1
Data Processing Techniques
ETL or Extract, Transform, and Load processes are used for cases where flexibility, speed, and scalability of data are important. You will explore some key differences been similar processes, ETL and ELT, which include the place of transformation, flexibility, Big Data support, and time-to-insight.
You will learn that there is an increasing demand for access to raw data that drives the evolution from ETL to ELT. Data extraction involves advanced technologies including database querying, web scraping, and APIs. You will also learn that data transformation is about formatting data to suit the application and that data is loaded in batches or streamed continuously.

WEEK 2
ETL & Data Pipelines: Tools and Techniques
Extract, transform and load (ETL) pipelines are created with Bash scripts that can be run on a schedule using cron. Data pipelines move data from one place, or form, to another. Data pipeline processes include scheduling or triggering, monitoring, maintenance, and optimization. Furthermore, Batch pipelines extract and operate on batches of data. Whereas streaming data pipelines ingest data packets one-by-one in rapid succession. In this module, you will learn that streaming pipelines apply when the most current data is needed. You will explore that parallelization and I/O buffers help mitigate bottlenecks. You will also learn how to describe data pipeline performance in terms of latency and throughput.

WEEK 3
Building Data Pipelines using Airflow
The key advantage of Apache Airflow's approach to representing data pipelines as DAGs is that they are expressed as code, which makes your data pipelines more maintainable, testable, and collaborative. Tasks, the nodes in a DAG, are created by implementing Airflow's built-in operators.
In this module, you will learn about Apache Airflow having a rich UI that simplifies working with data pipelines. You will explore how to visualize your DAG in graph or tree mode. You will also learn about the key components of a DAG definition file, and you will learn that Airflow logs are saved into local file systems and then sent to cloud storage, search engines, and log analyzers.

WEEK 4
Building Streaming Pipelines using Kafka
Apache Kafka is a very popular open source event streaming pipeline. An event is a type of data that describes the entity’s observable state updates over time. Popular Kafka service providers include Confluent Cloud, IBM Event Stream, and Amazon MSK. Additionally, Kafka Streams API is a client library supporting you with data processing in event streaming pipelines.
In this module, you will learn that the core components of Kafka are brokers, topics, partitions, replications, producers, and consumers. You will explore two special types of processors in the Kafka Stream API stream-processing topology: The source processor and the sink processor. You will also learn about building event streaming pipelines using Kafka.

WEEK 5
Final Assignment
In this final assignment module, you will apply your newly gained knowledge to explore two very exciting hands-on labs. “Creating ETL Data Pipelines using Apache Airflow” and “Creating Streaming Data Pipelines using Kafka”. You will explore building these ETL pipelines using real-world scenarios.
You will extract, transform, and load data into a CSV file. You will also create a topic named “toll” in Apache Kafka, download and customize a streaming data consumer, as well as verifying that streaming data has been collected in the database table.

Go to Class

MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Coursera

University of California, Davis

Distributed Computing with Spark SQL (Coursera)

Data Science

This course is for students with SQL experience and now want to take the next step in gaining familiarity with distributed computing using Spark. Students will gain an understanding of when to use Spark and how Spark as an engine uniquely combines Data and AI technologies at scale. The four modules build on one another and by the end of the course the student will understand: Spark architecture, Spark DataFrame, optimizing reading/writing data, and how to build a machine learning model.

Jan 13th 2025

4 Weeks

Computing SQL Machine Learning

Coursera

Icahn School of Medicine at Mount Sinai

Big Data Science with the BD2K-LINCS Data Coordination and Integration Center (Coursera)

Health & Society Science

In this course we briefly introduce the DCIC and the various Centers that collect data for LINCS. We then cover metadata and how metadata is linked to ontologies. We then present data processing and normalization methods to clean and harmonize LINCS data. This follow discussions about how data is served as RESTful APIs. Most importantly, the course covers computational methods including: data clustering, gene-set enrichment analysis, interactive data visualization, and supervised learning. Finally, we introduce crowdsourcing/citizen-science projects where students can work together in teams to extract expression signatures from public databases and then query such collections of signatures against LINCS data for predicting small molecules as potential therapeutics.

Jul 27th 2026

5-12 Weeks

Network Analysis Big Data

EdX

AI (Pragmatic AI Labs)

Cloud Data Engineering (edX)

Computer Science

Master data engineering for cloud-native applications through distributed systems, big data, and serverless technologies.

Self Paced

Self-Paced

Cloud Computing ETL Cloud Storage

OpenSAP

SAP

Freedom of Data with SAP Data Hub (OpenSAP)

CS: Information & Technology Data Science

Join this free open online course to learn about SAP Data Hub. The course will provide you with an overview of the architecture as well as the installation/deployment options, and is aimed at application developers, data warehouse modelers, data engineers, data scientists, and technical business analysts.

Self Paced

Self-Paced

Data Big Data Data Integration

Apache Spark for Data Engineering and Machine Learning (edX)

EdX

IBM

Apache Spark for Data Engineering and Machine Learning (edX)

Computer Science

This short course introduces you to the fundamentals of Data Engineering and Machine Learning with Apache Spark, including Spark Structured Streaming, ETL for Machine Learning (ML) Pipelines, and Spark ML. By the end of the course, you will have hands-on experience applying Spark skills to ETL and ML workflows.

Self Paced

Self-Paced

ML Machine Learning Apache Spark

Coursera

IBM

Análise de dados com Python (Coursera)

Statistics & Data Analysis Data Science

Saiba como analisar dados usando Python. Este curso abrange desde o básico do Python até a exploração de diferentes tipos de dados. Você aprenderá como preparar dados para análise, executar análises estatísticas simples, criar visualizações de dados significativas, prever tendências futuras a partir de dados, e muito mais!

Oct 2nd 2023

5-12 Weeks

Python Data Analysis Polynomial Regression

Coursera

Whizlabs

AWS Data Processing (Coursera)

Statistics & Data Analysis Data Science

AWS: Data Processing Course is the second course of AWS Certified Data Analytics Specialty Specialization. This course focuses on providing data processing solutions. The entire course is designed to teach learners the concept of EMR and Extract, Transform and Load. This course also put emphasis on ETL services and Data Processing solutions in AWS.

Jul 22nd 2024

3 Weeks

Data Analysis ETL Data Processing

Coursera

IBM

Databases and SQL for Data Science with Python(Coursera)

Statistics & Data Analysis Data Science

Much of the world's data resides in databases. SQL (or Structured Query Language) is a powerful language which is used for communicating with and extracting data from databases. A working knowledge of databases and SQL is a must if you want to become a data scientist. The purpose of this course is to introduce relational database concepts and help you learn and apply foundational knowledge of the SQL language. It is also intended to get you started with performing SQL access in a data science environment.

Jul 6th 2026

4 Weeks

Programming Python Databases

Coursera

Imperial College London

Customising your models with TensorFlow 2 (Coursera)

Data Science

Welcome to this course on Customising your models with TensorFlow 2! In this course you will deepen your knowledge and skills with TensorFlow, in order to develop fully customised deep learning models and workflows for any application. You will use lower level APIs in TensorFlow to develop complex model architectures, fully customised layers, and a flexible data workflow. You will also expand your knowledge of the TensorFlow APIs to include sequence models.

Aug 3rd 2026

5-12 Weeks

Machine Learning Modeling APIs

Coursera

Microsoft

Data Integration with Microsoft Azure Data Factory (Coursera)

Statistics & Data Analysis Data Science

In this course, you will learn how to create and manage data pipelines in the cloud using Azure Data Factory. This course is part of a Specialization intended for Data engineers and developers who want to demonstrate their expertise in designing and implementing data solutions that use Microsoft Azure data services. It is ideal for anyone interested in preparing for the DP-203: Data Engineering on Microsoft Azure exam (beta).

Jan 6th 2025

5-12 Weeks

Microsoft Azure Data Integration Azure

Coursera

University of California, Davis

Healthcare Data Models (Coursera)

Health & Society CS: Information & Technology

Career prospects are bright for those qualified to work in healthcare data analytics. Perhaps you work in data analytics, but are considering a move into healthcare where your work can improve people’s quality of life. If so, this course gives you a glimpse into why this work matters, what you’d be doing in this role, and what takes place on the Path to Value where data is gathered from patients at the point of care, moves into data warehouses to be prepared for analysis, then moves along the data pipeline to be transformed into valuable insights that can save lives, reduce costs, to improve healthcare and make it more accessible and affordable.

Aug 3rd 2026

4 Weeks

Healthcare Healthcare Data Data Models

Coursera

University of Colorado System

Clinical Data Models and Data Quality Assessments (Coursera)

Statistics & Data Analysis Data Science

This course aims to teach the concepts of clinical data models and common data models. Upon completion of this course, learners will be able to interpret and evaluate data model designs using Entity-Relationship Diagrams (ERDs), differentiate between data models and articulate how each are used to support clinical care and data science, and create SQL statements in Google BigQuery to query the MIMIC3 clinical data model and the OMOP common data model.

Jul 27th 2026

5-12 Weeks

Data Models ERD Data Quality