Big Data Analysis with Scala and Spark (Coursera)

Big Data Analysis with Scala and Spark (Coursera)

Manipulating big data distributed over a cluster using functional concepts is rampant in industry, and is arguably one of the first widespread industrial uses of functional ideas. This is evidenced by the popularity of MapReduce and Hadoop, and most recently Apache Spark, a fast, in-memory distributed collections framework written in Scala. In this course, we'll see how the data parallel paradigm can be extended to the distributed case, using Spark throughout.

Class Deals by MOOC List - Click here and see Coursera's Active Discounts, Deals, and Promo Codes.

We'll cover Spark's programming model in detail, being careful to understand how and when it differs from familiar programming models, like shared-memory parallel collections or sequential Scala collections. Through hands-on examples in Spark and Scala, we'll learn when important issues related to distribution like latency and network communication should be considered and how they can be addressed effectively for improved performance.
Learning Outcomes. By the end of this course you will be able to:

  • read data from persistent storage and load it into Apache Spark,
  • manipulate data with Spark and Scala,
  • express algorithms for data analysis in a functional style,
  • recognize how to avoid shuffles and recomputation in Spark,

Recommended background: You should have at least one year programming experience. Proficiency with Java or C# is ideal, but experience with other languages such as C/C++, Python, Javascript or Ruby is also sufficient. You should have some familiarity using the command line. This course is intended to be taken after Parallel programming.
Course 4 of 5 in the Functional Programming in Scala Specialization.

Syllabus

WEEK 1
Getting Started + Spark Basics
Get up and running with Scala on your computer. Complete an example assignment to familiarize yourself with our unique way of submitting assignments. In this week, we'll bridge the gap between data parallelism in the shared memory scenario (learned in the Parallel Programming course, prerequisite) and the distributed scenario. We'll look at important concerns that arise in distributed systems, like latency and failure. We'll go on to cover the basics of Spark, a functionally-oriented framework for big data processing in Scala. We'll end the first week by exercising what we learned about Spark by immediately getting our hands dirty analyzing a real-world data set.

WEEK 2
Reduction Operations & Distributed Key-Value Pairs
This week, we'll look at a special kind of RDD called pair RDDs. With this specialized kind of RDD in hand, we'll cover essential operations on large data sets, such as reductions and joins.

WEEK 3
Partitioning and Shuffling
This week we'll look at some of the performance implications of using operations like joins. Is it possible to get the same result without having to pay for the overhead of moving data over the network? We'll answer this question by delving into how we can partition our data to achieve better data locality, in turn optimizing some of our Spark jobs.

WEEK 4
Structured data: SQL, Dataframes, and Datasets
With our newfound understanding of the cost of data movement in a Spark job, and some experience optimizing jobs for data locality last week, this week we'll focus on how we can more easily achieve similar optimizations. Can structured data help us? We'll look at Spark SQL and its powerful optimizer which uses structure to apply impressive optimizations. We'll move on to cover DataFrames and Datasets, which give us a way to mix RDDs with the powerful automatic optimizations behind Spark SQL.

Go to Class
MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

An Introduction to Interactive Programming in Python (Part 1) (Coursera) Coursera
Rice University

An Introduction to Interactive Programming in Python (Part 1) (Coursera)

This two-part course is designed to help students with very little or no computing background learn the basics of building simple interactive applications. Our language of choice, Python, is an easy-to learn, high-level computer language that is used in many of the computational courses offered on Coursera. To make learning Python easy, we have developed a new browser-based programming environment that makes developing interactive applications in Python simple.

Jun 15th 2026
5-12 Weeks
Text Mining and Analytics (Coursera) Coursera
University of Illinois at Urbana-Champaign

Text Mining and Analytics (Coursera)

This course will cover the major techniques for mining and analyzing text data to discover interesting patterns, extract useful knowledge, and support decision making, with an emphasis on statistical approaches that can be generally applied to arbitrary text data in any natural language with no or minimum human effort.

Jun 15th 2026
5-12 Weeks
Analytic Combinatorics (Coursera) Coursera
Princeton University

Analytic Combinatorics (Coursera)

Analytic Combinatorics teaches a calculus that enables precise quantitative predictions of large combinatorial structures. This course introduces the symbolic method to derive functional relations among ordinary, exponential, and multivariate generating functions, and methods in complex analysis for deriving accurate asymptotics from the GF equations. All the features of this course are available for free. It does not offer a certificate upon completion.

Jun 15th 2026
5-12 Weeks
HTML, CSS, and Javascript for Web Developers (Coursera) Coursera
Johns Hopkins University

HTML, CSS, and Javascript for Web Developers (Coursera)

Do you realize that the only functionality of a web application that the user directly interacts with is through the web page? Implement it poorly and, to the user, the server-side becomes irrelevant! Today’s user expects a lot out of the web page: it has to load fast, expose the desired service, and be comfortable to view on all devices: from a desktop computers to tablets and mobile phones. In this course, we will learn the basic tools that every web page coder needs to know. We will start from the ground up by learning how to implement modern web pages with HTML and CSS.

Jun 16th 2026
5-12 Weeks
Algorithmic Thinking (Part 2) (Coursera) Coursera
Rice University

Algorithmic Thinking (Part 2) (Coursera)

Experienced Computer Scientists analyze and solve computational problems at a level of abstraction that is beyond that of any particular programming language. This two-part class is designed to train students in the mathematical concepts and process of "Algorithmic Thinking", allowing them to build simpler, more efficient solutions to computational problems.

Jun 15th 2026
4 Weeks
Principles of Computing (Part 2) (Coursera) Coursera
Rice University

Principles of Computing (Part 2) (Coursera)

This two-part course introduces the basic mathematical and programming principles that underlie much of Computer Science. Understanding these principles is crucial to the process of creating efficient and well-structured solutions for computational problems. To get hands-on experience working with these concepts, we will use the Python programming language. The main focus of the class will be weekly mini-projects that build upon the mathematical and programming principles that are taught in the class.

Jun 15th 2026
4 Weeks
Process Data from Dirty to Clean (Coursera) Coursera
Google

Process Data from Dirty to Clean (Coursera)

This is the fourth course in the Google Data Analytics Certificate. These courses will equip you with the skills needed to apply to introductory-level data analyst jobs. In this course, you’ll continue to build your understanding of data analytics and the concepts and tools that data analysts use in their work. You’ll learn how to check and clean your data using spreadsheets and SQL as well as how to verify and report your data cleaning results. Current Google data analysts will continue to instruct and provide you with hands-on ways to accomplish common data analyst tasks with the best tools and resources.

Jun 16th 2026
5-12 Weeks
Cluster Analysis in Data Mining (Coursera) Coursera
University of Illinois at Urbana-Champaign

Cluster Analysis in Data Mining (Coursera)

Discover the basic concepts of cluster analysis, and then study a set of typical clustering methodologies, algorithms, and applications. This includes partitioning methods such as k-means, hierarchical methods such as BIRCH, and density-based methods such as DBSCAN/OPTICS. Moreover, learn methods for clustering validation and evaluation of clustering quality. Finally, see examples of cluster analysis in applications.

Jun 15th 2026
4 Weeks
Programming Languages, Part A (Coursera) Coursera
University of Washington

Programming Languages, Part A (Coursera)

This course is an introduction to the basic concepts of programming languages, with a strong emphasis on functional programming. The course uses the languages ML, Racket, and Ruby as vehicles for teaching the concepts, but the real intent is to teach enough about how any language “fits together” to make you more effective programming in any language -- and in learning new ones.

Jun 15th 2026
5-12 Weeks
Java for Android (Coursera) Coursera
Vanderbilt University

Java for Android (Coursera)

This MOOC teaches you how to program core features and classes from the Java programming language that are used in Android, which is the dominant platform for developing and deploying mobile device apps. In particular, this MOOC covers key Java programming language features that control the flow of execution through an app (such as Java’s various looping constructs and conditional statements), enable access to structured data (such as Java's built-in arrays and common classes in the Java Collections Framework, such as ArrayList and HashMap), group related operations and data into classes and interfaces (such as Java's primitive and user-defined types, fields, methods, generic parameters, and exceptions), customize the behavior of existing classes via inheritance and polymorphism (such as subclassing and overriding virtual methods).

Jun 16th 2026
4 Weeks
Share Data Through the Art of Visualization (Coursera) Coursera
Google

Share Data Through the Art of Visualization (Coursera)

This is the sixth course in the Google Data Analytics Certificate. These courses will equip you with the skills needed to apply to introductory-level data analyst jobs. You’ll learn how to visualize and present your data findings as you complete the data analysis process. This course will show you how data visualizations, such as visual dashboards, can help bring your data to life. You’ll also explore Tableau, a data visualization platform that will help you create effective visualizations for your presentations.

Jun 16th 2026
4 Weeks
Regression Modeling in Practice (Coursera) Coursera
Wesleyan University

Regression Modeling in Practice (Coursera)

This course focuses on one of the most important tools in your data analysis arsenal: regression analysis. Using either SAS or Python, you will begin with linear regression and then learn how to adapt when two variables do not present a clear linear relationship. You will examine multiple predictors of your outcome and be able to identify confounding variables, which can tell a more compelling story about your results. You will learn the assumptions underlying regression analysis, how to interpret regression coefficients, and how to use regression diagnostic plots and other tools to evaluate the quality of your regression model. Throughout the course, you will share with others the regression models you have developed and the stories they tell you.

Jun 19th 2026
4 Weeks