Coursera

Introduction to Big Data with Spark and Hadoop (Coursera)

Offered by IBM,

Bernard Marr defines Big Data as the digital trace that we are generating in this digital era. In this course, you will learn about the characteristics of Big Data and its application in Big Data Analytics. You will gain an understanding about the features, benefits, limitations, and applications of some of the Big Data processing tools. You’ll explore how Hadoop and Hive help leverage the benefits of Big Data while overcoming some of the challenges it poses.

Class Deals by MOOC List - Click here and see Coursera's Active Discounts, Deals, and Promo Codes.

Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hive, a data warehouse software, provides an SQL-like interface to efficiently query and manipulate large data sets residing in various databases and file systems that integrate with Hadoop.
Apache Spark is an open-source processing engine that provides users new ways to store and make use of big data. It is an open-source processing engine built around speed, ease of use, and analytics. In this course, you will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the different components that make up Apache Spark.
In this course, you will also learn about Resilient Distributed Datasets, or RDDs, that enable parallel processing across the nodes of a Spark cluster.

This course is part of multiple programs
This course can be applied to multiple Specializations or Professional Certificates programs. Completing this course will count towards your learning in any of the following programs:

What You Will Learn

Deep insight into the impact of Big Data including use cases, tools, and processing methods.
Knowledge of the Apache Hadoop architecture, ecosystem, and practices, and the use of applications including HDFS, HBase, Spark, and MapReduce.
Know-how to apply Spark programming basics, including parallel programming basics for DataFrames, data sets, and Spark SQL.
Proficiency with Spark’s RDDs, data sets, use of Catalyst and Tungsten to optimize SparkSQL, and Spark’s development and runtime environment options.

Syllabus

WEEK 1
What is Big Data?
Begin your acquisition of Big Data knowledge with the most up-to-date definition of Big Data. You’ll explore the impact of Big Data on everyday personal tasks and business transactions with Big Data Use Cases. Learn how Big Data uses Parallel Processing, Scaling, and Data Parallelism. Learn about commonly used Big Data tools. Then, go beyond the hype and explore additional Big Data viewpoints.

WEEK 2
Introduction to the Hadoop Ecosystem
In this module, you'll gain a fundamental understanding of the Apache Hadoop architecture, ecosystem, practices, and commonly used applications including Distributed File System (HDFS), MapReduce, HIVE and HBase. Gain practical skills in this module's lab when you launch a single node Hadoop cluster using Docker and run MapReduce jobs.

WEEK 3
Apache Spark
Build your skills when you turn your attention to the popular Apache Spark platform. Explore attribute and benefits of Apache Spark and distributed computing. You'll gain key insights about functional programming and Lambda functions. Explore Resilient Distributed Datasets (RDDs), Parallel Programming, resilience in Apache Spark and relate RDDs and Parallel Programming with Apache Spark. Dive into additional Apache Spark components and learn how Apache Spark scales with Big Data. Working with Big Data signals the need for working with queries, including structured queries using SQL. Learn about the functions, parts and benefits of Spark SQL and DataFrame queries, and discover how DataFrames work with SparkSQL.

WEEK 4
DataFrames and SparkSQL
Learn about Resilient Distributed Datasets (RDDs), their uses in Apache Spark, and RDD transformations and actions. You'll compare the use of datasets with Spark's latest data abstraction, DataFrames. You'll learn to identify and apply basic DataFrame operations. Explore Apache Spark SQL optimization. Learn how Spark SQL and memory optimization benefit from using Catalyst and Tungsten. Learn how to create a table view and apply data aggregation techniques. Fortify your skills guided via the hands-on lab.

WEEK 5
Development and Runtime Environment Options
Explore how Spark processes the requests that your application submits. Learn how you can track work using the Spark Application UI. Because Spark application work happens on the cluster, you need be able to identify Apache Cluster Managers, their components, benefits, and know how to connect with each cluster manager and how and when you might want to set up a local, standalone Spark instance. Next, learn about Apache Spark application submission, including use of Spark’s unified interface, ‘spark-submit’ and learn about options and dependencies. Developers now have the option of AIOps. Discover how to use Spark within AIOps and with Apache Spark application submission, including use of Spark’s unified interface, ‘spark-submit’, describe and apply options for submitting applications, identify external application dependency management techniques and list Spark Shell benefits. View and see recommended practices for Spark's static and dynamic configuration options. Round out your development knowledge with insights about Spark on Kubernetes. This module features hands-on Spark labs using IBM Cloud and Kubernetes.

WEEK 6
Monitoring & Tuning
Platforms and applications require monitoring and tuning to manage issues that inevitably happen. In this module you'll learn about connecting the Apache Spark user interface web server and using the same UI web server to manage application processes. Identify common Apache Spark application issues. Learn about debugging issues using the application UI and locating related log files. Discover and gain real-world knowledge about how Spark manages memory and processor resources via videos and in the available hands-on lab.

Go to Class

MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Coursera

University of Washington

Data Manipulation at Scale: Systems and Algorithms (Coursera)

Statistics & Data Analysis Data Science

Data analysis has replaced data acquisition as the bottleneck to evidence-based decision making --- we are drowning in it. Extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales.

Aug 3rd 2026

4 Weeks

Algebra Algorithms Databases

Coursera

Johns Hopkins University

Python for Genomic Data Science (Coursera)

Statistics & Data Analysis Data Science

This class provides an introduction to the Python programming language and the iPython notebook. This is the third course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Aug 17th 2026

4 Weeks

Programming Python Big Data

Coursera

IBM

Introduction to Data Engineering (Coursera)

CS: Information & Technology

This course introduces you to the core concepts, processes, and tools you need to know in order to get a foundational knowledge of data engineering. You will gain an understanding of the modern data ecosystem and the role Data Engineers, Data Scientists, and Data Analysts play in this ecosystem. The Data Engineering Ecosystem includes several different components. It includes disparate data types, formats, and sources of data.

Aug 3rd 2026

4 Weeks

Databases NoSQL SQL

Coursera

University of California, San Diego

Code Free Data Science (Coursera)

Data Science

The Code Free Data Science class is designed for learners seeking to gain or expand their knowledge in the area of Data Science. Participants will receive the basic training in effective predictive analytic approaches accompanying the growing discipline of Data Science without any programming requirements. Machine Learning methods will be presented by utilizing the KNIME Analytics Platform to discover patterns and relationships in data.

Aug 3rd 2026

4 Weeks

Machine Learning Big Data Data Science

Coursera

Yonsei University

Spatial Data Science and Applications (Coursera)

Statistics & Data Analysis Data Science

Spatial (map) is considered as a core infrastructure of modern IT world, which is substantiated by business transactions of major IT companies such as Apple, Google, Microsoft, Amazon, Intel, and Uber, and even motor companies such as Audi, BMW, and Mercedes. Consequently, they are bound to hire more and more spatial data scientists. Based on such business trend, this course is designed to present a firm understanding of spatial data science to the learners, who would have a basic knowledge of data science and data analysis, and eventually to make their expertise differentiated from other nominal data scientists and data analysts.

Aug 10th 2026

5-12 Weeks

Hadoop Geographic Information Systems GIS

Coursera

IBM

Introduction to NoSQL Databases (Coursera)

CS: Information & Technology

This course will provide you with technical hands-on knowledge of NoSQL databases and Database-as-a-Service (DaaS) offerings. With the advent of Big Data and agile development methodologies, NoSQL databases have gained a lot of relevance in the database landscape. Their main advantage is the ability to effectively handle scalability and flexibility issues raised by modern applications.

Aug 10th 2026

5-12 Weeks

MongoDB NoSQL NoSQL Databases

Coursera

EIT Digital

Security and Privacy for Big Data - Part 1 (Coursera)

Security & Networking

This course sensitizes regarding security in Big Data environments. You will discover cryptographic principles, mechanisms to manage access controls in your Big Data system. By the end of the course, you will be ready to plan your next Big Data project successfully, ensuring that all security related issues are under control. You will look at decent-sized big data projects with security-skilled eyes, being able to recognize dangers. This will allow you to improve your systems to a grown and sustainable level.

Aug 10th 2026

1 Week

Cryptography Security Big Data

Coursera

Alibaba Cloud Academy

Big Data Analysis Deep Dive (Coursera)

CS: Information & Technology

The job market for architects, engineers, and analytics professionals with Big Data expertise continues to increase. The Academy’s Big Data Career path focuses on the fundamental tools and techniques needed to pursue a career in Big Data. This course includes: data processing with python, writing and reading SQL queries, transmitting data with MaxCompute, analyzing data with Quick BI, using Hive, Hadoop, and spark on E-MapReduce, and how to visualize data with data dashboards. Work through our course material, learn different aspects of the Big Data field, and get certified as a Big Data Professional!

Aug 3rd 2026

5-12 Weeks

Python SQL Big Data

Coursera

University of California, San Diego

Machine Learning With Big Data (Coursera)

Statistics & Data Analysis Data Science

Want to make sense of the volumes of data you have collected? Need to incorporate data-driven decisions into your process? This course provides an overview of machine learning techniques to explore, analyze, and leverage data. You will be introduced to tools and algorithms you can use to create machine learning models that learn from data, and to scale those models up to big data problems.

Aug 3rd 2026

5-12 Weeks

Algorithms Machine Learning Big Data

Coursera

IBM

Scalable Machine Learning on Big Data using Apache Spark (Coursera)

Data Science

This course will empower you with the skills to scale data science and machine learning (ML) tasks on Big Data sets using Apache Spark. Most real world machine learning work involves very large data sets that go beyond the CPU, memory and storage limitations of a single computer. Apache Spark is an open source framework that leverages cluster computing and distributed storage to process extremely large data sets in an efficient and cost effective manner. Therefore an applied knowledge of working with Apache Spark is a great asset and potential differentiator for a Machine Learning engineer.

Aug 3rd 2026

4 Weeks

Artificial Intelligence Machine Learning Big Data

Coursera

École Polytechnique Fédérale de Lausanne

Functional Programming in Scala Capstone (Coursera)

Statistics & Data Analysis Data Science

In the final capstone project you will apply the skills you learned by building a large data-intensive application using real-world data. You will implement a complete application processing several gigabytes of data. This application will show interactive visualizations of the evolution of temperatures over time all over the world.

Aug 10th 2026

5-12 Weeks

Interface Scala Interactive

Coursera

University of Colorado System

Relational Database Support for Data Warehouses (Coursera)

Statistics & Data Analysis Data Science

Relational Database Support for Data Warehouses is the third course in the Data Warehousing for Business Intelligence specialization. In this course, you'll use analytical elements of SQL for answering business intelligence questions. You'll learn features of relational database management systems for managing summary data commonly used in business intelligence reporting. Because of the importance and difficulty of managing implementations of data warehouses, we'll also delve into storage architectures, scalable parallel processing, data governance, and big data impacts. In the assignments in this course, you can use either Oracle or PostgreSQL.

Aug 3rd 2026

5-12 Weeks

Databases SQL Big Data