Introduction to Big Data with Spark and Hadoop (Coursera)

Offered by IBM,
Introduction to Big Data with Spark and Hadoop (Coursera)

Bernard Marr defines Big Data as the digital trace that we are generating in this digital era. In this course, you will learn about the characteristics of Big Data and its application in Big Data Analytics. You will gain an understanding about the features, benefits, limitations, and applications of some of the Big Data processing tools. You’ll explore how Hadoop and Hive help leverage the benefits of Big Data while overcoming some of the challenges it poses.

Class Deals by MOOC List - Click here and see Coursera's Active Discounts, Deals, and Promo Codes.

Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hive, a data warehouse software, provides an SQL-like interface to efficiently query and manipulate large data sets residing in various databases and file systems that integrate with Hadoop.
Apache Spark is an open-source processing engine that provides users new ways to store and make use of big data. It is an open-source processing engine built around speed, ease of use, and analytics. In this course, you will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the different components that make up Apache Spark.
In this course, you will also learn about Resilient Distributed Datasets, or RDDs, that enable parallel processing across the nodes of a Spark cluster.

This course is part of multiple programs
This course can be applied to multiple Specializations or Professional Certificates programs. Completing this course will count towards your learning in any of the following programs:

What You Will Learn

  • Deep insight into the impact of Big Data including use cases, tools, and processing methods.
  • Knowledge of the Apache Hadoop architecture, ecosystem, and practices, and the use of applications including HDFS, HBase, Spark, and MapReduce.
  • Know-how to apply Spark programming basics, including parallel programming basics for DataFrames, data sets, and Spark SQL.
  • Proficiency with Spark’s RDDs, data sets, use of Catalyst and Tungsten to optimize SparkSQL, and Spark’s development and runtime environment options.

Syllabus

WEEK 1
What is Big Data?
Begin your acquisition of Big Data knowledge with the most up-to-date definition of Big Data. You’ll explore the impact of Big Data on everyday personal tasks and business transactions with Big Data Use Cases. Learn how Big Data uses Parallel Processing, Scaling, and Data Parallelism. Learn about commonly used Big Data tools. Then, go beyond the hype and explore additional Big Data viewpoints.

WEEK 2
Introduction to the Hadoop Ecosystem
In this module, you'll gain a fundamental understanding of the Apache Hadoop architecture, ecosystem, practices, and commonly used applications including Distributed File System (HDFS), MapReduce, HIVE and HBase. Gain practical skills in this module's lab when you launch a single node Hadoop cluster using Docker and run MapReduce jobs.

WEEK 3
Apache Spark
Build your skills when you turn your attention to the popular Apache Spark platform. Explore attribute and benefits of Apache Spark and distributed computing. You'll gain key insights about functional programming and Lambda functions. Explore Resilient Distributed Datasets (RDDs), Parallel Programming, resilience in Apache Spark and relate RDDs and Parallel Programming with Apache Spark. Dive into additional Apache Spark components and learn how Apache Spark scales with Big Data. Working with Big Data signals the need for working with queries, including structured queries using SQL. Learn about the functions, parts and benefits of Spark SQL and DataFrame queries, and discover how DataFrames work with SparkSQL.

WEEK 4
DataFrames and SparkSQL
Learn about Resilient Distributed Datasets (RDDs), their uses in Apache Spark, and RDD transformations and actions. You'll compare the use of datasets with Spark's latest data abstraction, DataFrames. You'll learn to identify and apply basic DataFrame operations. Explore Apache Spark SQL optimization. Learn how Spark SQL and memory optimization benefit from using Catalyst and Tungsten. Learn how to create a table view and apply data aggregation techniques. Fortify your skills guided via the hands-on lab.

WEEK 5
Development and Runtime Environment Options
Explore how Spark processes the requests that your application submits. Learn how you can track work using the Spark Application UI. Because Spark application work happens on the cluster, you need be able to identify Apache Cluster Managers, their components, benefits, and know how to connect with each cluster manager and how and when you might want to set up a local, standalone Spark instance. Next, learn about Apache Spark application submission, including use of Spark’s unified interface, ‘spark-submit’ and learn about options and dependencies. Developers now have the option of AIOps. Discover how to use Spark within AIOps and with Apache Spark application submission, including use of Spark’s unified interface, ‘spark-submit’, describe and apply options for submitting applications, identify external application dependency management techniques and list Spark Shell benefits. View and see recommended practices for Spark's static and dynamic configuration options. Round out your development knowledge with insights about Spark on Kubernetes. This module features hands-on Spark labs using IBM Cloud and Kubernetes.

WEEK 6
Monitoring & Tuning
Platforms and applications require monitoring and tuning to manage issues that inevitably happen. In this module you'll learn about connecting the Apache Spark user interface web server and using the same UI web server to manage application processes. Identify common Apache Spark application issues. Learn about debugging issues using the application UI and locating related log files. Discover and gain real-world knowledge about how Spark manages memory and processor resources via videos and in the available hands-on lab.

Go to Class
MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Big Data Analysis Deep Dive (Coursera) Coursera
Alibaba Cloud Academy

Big Data Analysis Deep Dive (Coursera)

The job market for architects, engineers, and analytics professionals with Big Data expertise continues to increase. The Academy’s Big Data Career path focuses on the fundamental tools and techniques needed to pursue a career in Big Data. This course includes: data processing with python, writing and reading SQL queries, transmitting data with MaxCompute, analyzing data with Quick BI, using Hive, Hadoop, and spark on E-MapReduce, and how to visualize data with data dashboards. Work through our course material, learn different aspects of the Big Data field, and get certified as a Big Data Professional!

Jun 8th 2026
5-12 Weeks
Graph Analytics for Big Data (Coursera) Coursera
University of California, San Diego

Graph Analytics for Big Data (Coursera)

Want to understand your data network structure and how it changes under different conditions? Curious to know how to identify closely interacting clusters within a graph? Have you heard of the fast-growing area of graph analytics and want to learn more? This course gives you a broad overview of the field of graph analytics so you can learn new ways to model, store, retrieve and analyze graph-structured data.

Jun 8th 2026
5-12 Weeks
Deploying Machine Learning Models (Coursera) Coursera
University of California, San Diego

Deploying Machine Learning Models (Coursera)

In this course we will learn about Recommender Systems (which we will study for the Capstone project), and also look at deployment issues for data products. By the end of this course, you should be able to implement a working recommender system (e.g. to predict ratings, or generate lists of related products), and you should understand the tools and techniques required to deploy such a working system on real-world, large-scale datasets.

Jun 8th 2026
4 Weeks
Introduction and Programming with IoT Boards (Coursera) Coursera
Pohang University of Science and Technology - POSTECH

Introduction and Programming with IoT Boards (Coursera)

Internet of Things (IoT) is an emerging area of information and communications technology (ICT) involving many disciplines of computer science and engineering including sensors/actuators, communications networking, server platforms, data analytics and smart applications. IoT is considered to be an essential part of the 4th Industrial Revolution along with AI and Big Data. This course will be very useful to senior undergraduate and graduate students as well as engineers who are working in the industry.

Jun 8th 2026
5-12 Weeks
The Importance of Listening (Coursera) Coursera
Northwestern University

The Importance of Listening (Coursera)

In this second MOOC in the Social Marketing Specialization - "The Importance of Listening" - you will go deep into the Big Data of social and gain a more complete picture of what can be learned from interactions on social sites. You will be amazed at just how much information can be extracted from a single post, picture, or video.

Jun 8th 2026
4 Weeks
Applications of Software Architecture for Big Data (Coursera) Coursera
University of Colorado Boulder

Applications of Software Architecture for Big Data (Coursera)

The course is intended for individuals who want to build a production-quality software system that leverages big data. You will apply the basics of software engineering and architecture to create a production-ready distributed system that handles big data. You will build data intensive, distributed system, composed of loosely coupled, highly cohesive applications.

Jun 8th 2026
4 Weeks
Arquitecturas de Big Data (Coursera) Coursera
Universidad de los Andes

Arquitecturas de Big Data (Coursera)

El curso de Arquitecturas de Big Data busca que identifiques las características de una solución de Big Data, los datos asociados a estas soluciones, la infraestructura requerida, y las técnicas de procesamiento escalable. Desarrollaremos ejemplos usando infraestructuras basadas en Hadoop y en Spark, teniendo presente la pertinencia de las plataformas basadas en nube pública para soportar la escalabilidad de estas soluciones.

Jun 8th 2026
4 Weeks
Getting Started with CyberGIS (Coursera) Coursera
University of Illinois at Urbana-Champaign

Getting Started with CyberGIS (Coursera)

This course is intended to introduce students to CyberGIS—Geospatial Information Science and Systems (GIS)—based on advanced cyberinfrastructure as well as the state of the art in high-performance computing, big data, and cloud computing in the context of geospatial data science. Emphasis is placed on learning the cutting-edge advances of cyberGIS and its underlying geospatial data science principles.

Jun 8th 2026
4 Weeks
Healthcare Data Quality and Governance (Coursera) Coursera
University of California, Davis

Healthcare Data Quality and Governance (Coursera)

Career prospects are bright for those qualified to work with healthcare data or as Health Information Management (HIM) professionals. Perhaps you work in data analytics but are considering a move into healthcare, or you work in healthcare but are considering a transition into a new role. In either case, Healthcare Data Quality and Governance will provide insight into how valuable data assets are protected to maintain data quality. This serves care providers, patients, doctors, clinicians, and those who carry out the business of improving health outcomes.

Jun 8th 2026
4 Weeks