Scalable Machine Learning on Big Data using Apache Spark (Coursera)

Offered by IBM,
Scalable Machine Learning on Big Data using Apache Spark (Coursera)

This course will empower you with the skills to scale data science and machine learning (ML) tasks on Big Data sets using Apache Spark. Most real world machine learning work involves very large data sets that go beyond the CPU, memory and storage limitations of a single computer. Apache Spark is an open source framework that leverages cluster computing and distributed storage to process extremely large data sets in an efficient and cost effective manner. Therefore an applied knowledge of working with Apache Spark is a great asset and potential differentiator for a Machine Learning engineer.

Class Deals by MOOC List - Click here and see Coursera's Active Discounts, Deals, and Promo Codes.

After completing this course, you will be able to:

  • gain a practical understanding of Apache Spark, and apply it to solve machine learning problems involving both small and big data
  • understand how parallel code is written, capable of running on thousands of CPUs.
  • make use of large scale compute clusters to apply machine learning algorithms on Petabytes of data using Apache SparkML Pipelines.
  • eliminate out-of-memory errors generated by traditional machine learning frameworks when data doesn’t fit in a computer's main memory
  • test thousands of different ML models in parallel to find the best performing one – a technique used by many successful Kagglers
  • (Optional) run SQL statements on very large data sets using Apache SparkSQL and the Apache Spark DataFrame API.

Enrol now to learn the machine learning techniques for working with Big Data that have been successfully applied by companies like Alibaba, Apple, Amazon, Baidu, eBay, IBM, NASA, Samsung, SAP, TripAdvisor, Yahoo!, Zalando and many others.
NOTE: You will practice running machine learning tasks hands-on on an Apache Spark cluster provided by IBM at no charge during the course which you can continue to use afterwards.
Course 2 of 6 in the IBM AI Engineering Professional Certificate.

Prerequisites:

  • basic python programming
  • basic machine learning (optional introduction videos are provided in this course as well)
  • basic SQL skills for optional content

Syllabus

WEEK 1
Introduction
This is an introduction to Apache Spark. You'll learn how Apache Spark internally works and how to use it for data processing. RDD, the low level API is introduced in conjunction with parallel programming / functional programming. Then, different types of data storage solutions are contrasted. Finally, Apache Spark SQL and the optimizer Tungsten and Catalyst are explained.

WEEK 2
Scaling Math for Statistics on Apache Spark
Applying basic statistical calculations using the Apache Spark RDD API in order to experience how parallelization in Apache Spark works

WEEK 3
Introduction to Apache SparkML
Understand the concept of machine learning pipelines in order to understand how Apache SparkML works programmatically

WEEK 4
Supervised and Unsupervised learning with SparkML
Apply Supervised and Unsupervised Machine Learning tasks using SparkML

Go to Class
MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Process Mining: Data science in Action (Coursera) Coursera
Eindhoven University of Technology

Process Mining: Data science in Action (Coursera)

Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. Through concrete data sets and easy to use software the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains. Data science is the profession of the future, because organizations that are unable to use (big) data in a smart way will not survive. It is not sufficient to focus on data storage and data analysis. The data scientist also needs to relate data to process analysis.

Jun 1st 2026
5-12 Weeks
Machine Learning: Regression (Coursera) Coursera
University of Washington

Machine Learning: Regression (Coursera)

Case Study - Predicting Housing Prices. In our first case study, predicting house prices, you will create models that predict a continuous value (price) from input features (square footage, number of bedrooms and bathrooms,...). This is just one of the many places where regression can be applied. Other applications range from predicting health outcomes in medicine, stock prices in finance, and power usage in high-performance computing, to analyzing which regulators are important for gene expression.

Jun 1st 2026
5-12 Weeks
Learn to code with AI (Coursera) Coursera
Scrimba

Learn to code with AI (Coursera)

Imagine waking up tomorrow as a web developer. What would you want to build? With AI tools like ChatGPT, you're already a developer, regardless of your experience, if you know how to work with them. So in this course, you'll build functional, interactive front-end projects while learning how to write effective prompts and debug and refine your code with the help of AI.

Jun 3rd 2026
2 Weeks
The Economics of AI (Coursera) Coursera
University of Virginia

The Economics of AI (Coursera)

The course introduces you to cutting-edge research in the economics of AI and the implications for economic growth and labor markets. We start by analyzing the nature of intelligence and information theory. Then we connect our analysis to modeling production and technological change in economics, and how these processes are affected by AI. Next we turn to how technological change drives aggregate economic growth, covering a range of scenarios including a potential growth singularity.

Jun 2nd 2026
5-12 Weeks
Data-driven Decision Making (Coursera) Coursera
PwC

Data-driven Decision Making (Coursera)

Welcome to Data-driven Decision Making. In this course, you'll get an introduction to Data Analytics and its role in business decisions. You'll learn why data is important and how it has evolved. You'll be introduced to “Big Data” and how it is used. You'll also be introduced to a framework for conducting Data Analysis and what tools and techniques are commonly used. Finally, you'll have a chance to put your knowledge to work in a simulated business setting. This course was created by PricewaterhouseCoopers LLP with an address at 300 Madison Avenue, New York, New York, 10017.

Jun 1st 2026
4 Weeks
Advanced Algorithms and Complexity (Coursera) Coursera
University of California, San Diego,Higher School of Economics - HSE University

Advanced Algorithms and Complexity (Coursera)

You've learned the basic algorithms now and are ready to step into the area of more complex problems and algorithms to solve them. Advanced algorithms build upon basic ones and use new ideas. We will start with networks flows which are used in more typical applications such as optimal matchings, finding disjoint paths and flight scheduling as well as more surprising ones like image segmentation in computer vision.

Jun 1st 2026
5-12 Weeks
Foundations of marketing analytics (Coursera) Coursera
ESSEC Business School

Foundations of marketing analytics (Coursera)

Who is this course for? This course is designed for students, business analysts, and data scientists who want to apply statistical knowledge and techniques to business contexts. For example, it may be suited to experienced statisticians, analysts, engineers who want to move more into a business role, in particular in marketing. You will find this course exciting and rewarding if you already have a background in statistics, can use R or another programming language and are familiar with databases and data analysis techniques such as regression, classification, and clustering. However, it contains a number of recitals and R Studio tutorials which will consolidate your competences, enable you to play more freely with data and explore new features and statistical functions in R.

Jun 1st 2026
5-12 Weeks
Regression Models (Coursera) Coursera
Johns Hopkins University

Regression Models (Coursera)

Linear models, as their name implies, relates an outcome to a set of predictors of interest using linear assumptions. Regression models, a subset of linear models, are the most important statistical analysis tool in a data scientist’s toolkit. This course covers regression analysis, least squares and inference using regression models.

Jun 1st 2026
4 Weeks
A Crash Course in Data Science (Coursera) Coursera
Johns Hopkins University

A Crash Course in Data Science (Coursera)

By now you have definitely heard about data science and big data. In this one-week class, we will provide a crash course in what these terms mean and how they play a role in successful organizations. This class is for anyone who wants to learn what all the data science action is about, including those who will eventually need to manage data scientists. The goal is to get you up to speed as quickly as possible on data science without all the fluff. We've designed this course to be as convenient as possible without sacrificing any of the essentials.

Jun 1st 2026
1 Week
Teaching Impacts of Technology: Data Collection, Use, and Privacy (Coursera) Coursera
University of California, San Diego

Teaching Impacts of Technology: Data Collection, Use, and Privacy (Coursera)

In this course you’ll focus on how constant data collection and big data analysis have impacted us, exploring the interplay between using your data and protecting it, as well as thinking about what it could do for you in the future. This will be done through a series of paired teaching sections, exploring a specific “Impact of Computing” in your typical day and the “Technologies and Computing Concepts” that enable that impact, all at a K12-appropriate level.

Jun 3rd 2026
4 Weeks
Generative AI for Everyone (Coursera) Coursera
DeepLearning.AI

Generative AI for Everyone (Coursera)

Instructed by AI pioneer Andrew Ng, Generative AI for Everyone offers his unique perspective on empowering you and your work with generative AI. Andrew will guide you through how generative AI works and what it can (and can’t) do. It includes hands-on exercises where you'll learn to use generative AI to help in day-to-day work and receive tips on effective prompt engineering, as well as learning how to go beyond prompting for more advanced uses of AI.

Jun 2nd 2026
3 Weeks
Data Science Companion (Coursera) Coursera
MathWorks

Data Science Companion (Coursera)

The Data Science Companion provides an introduction to data science. You will gain a quick background in data science and core machine learning concepts, such as regression and classification. You’ll be introduced to the practical knowledge of data processing and visualization using low-code solutions, as well as an overview of the ways to integrate multiple tools effectively to solve data science problems.

Jun 5th 2026
4 Weeks