Spark, Hadoop, and Snowflake for Data Engineering (Coursera)

Offered by Duke University,
Spark, Hadoop, and Snowflake for Data Engineering (Coursera)

This is primarily aimed at first- and second-year undergraduates interested in engineering or science, along with high school students and professionals with an interest in programming. Gain the skills for building efficient and scalable data pipelines.

Class Deals by MOOC List - Click here and see Coursera's Active Discounts, Deals, and Promo Codes.

Explore essential data engineering platforms (Hadoop, Spark, and Snowflake) as well as learn how to optimize and manage them. Delve into Databricks, a powerful platform for executing data analytics and machine learning tasks, while honing your Python data science skills with PySpark. Finally, discover the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and learn how to integrate it with Databricks.
This course is designed for learners who want to pursue or advance their career in data science or data engineering, or for software developers or engineers who want to grow their data management skill set. In addition to the technologies you will learn, you will also gain methodologies to help you hone your project management and workflow skills for data engineering, including applying Kaizen, DevOps, and Data Ops methodologies and best practices.
With quizzes to test your knowledge throughout, this comprehensive course will help guide your learning journey to become a proficient data engineer, ready to tackle the challenges of today's data-driven world.

What you'll learn

  • Create scalable data pipelines (Hadoop, Spark, Snowflake, Databricks) for efficient data handling.
  • Optimize data engineering with clustering and scaling to boost performance and resource use.
  • Build ML solutions (PySpark, MLFlow) on Databricks for seamless model development and deployment.
  • Implement DataOps and DevOps practices for continuous integration and deployment (CI/CD) of data-driven applications, including automating processes.

Syllabus

Overview and Introduction to PySpark
This week, you will learn how to work with different data engineering platforms, such as Hadoop and Spark, and apply their concepts to real-world scenarios. First, you will explore the fundamentals of Hadoop to store and process big data. Next, you will delve into Spark concepts, distributed computing, deferred execution, and Spark SQL. By the end of the week, you will gain hands-on experience with PySpark DataFrames, DataFrame methods, and deferred execution strategies.

Snowflake
This week, you will explore the Snowflake platform, gaining insights into its architecture and key concepts. Through hands-on practice in the Snowflake Web UI, you'll learn to create tables, manage warehouses, and use the Snowflake Python Connector to interact with tables. By the end of this week, you'll solidify your understanding of Snowflake's architecture and practical applications, emerging with the ability to effectively navigate and leverage the platform for data management and analysis.

Azure Databricks and MLFLow
This week, you will practice the essential skills for seamlessly managing machine learning workflows using Databricks and MLFlow. First, you will create a Databricks workspace and configure a cluster, setting the stage for efficient data analysis. Next, you will load a sample dataset into the Databricks workspace using the power of PySpark, enabling data manipulation and exploration. Finally, you will install MLFlow either locally or within the Databricks environment, gaining the ability to orchestrate the entire machine learning lifecycle. By the end of this week, you will be able to craft, track, and manage machine learning experiments within Databricks, ensuring precision, reproducibility, and optimal decision-making throughout your data-driven journey.

DataOps and Operations Methodologies
This week, you will explore the concepts of Kaizen, DevOps, and DataOps and how these methodologies synergistically contribute to efficient and seamless data engineering workflows. Through practical examples, you will learn how Kaizen's continuous improvement philosophy, DevOps' collaborative practices, and DataOps' focus on data quality and integration converge to enhance the development, deployment, and management of data engineering platforms. By the end of this week, you will have the knowledge and perspective needed to optimize data engineering processes and deliver scalable, reliable, and high-quality solutions.

Go to Class
MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Machine Learning with Apache Spark (Coursera) Coursera
IBM

Machine Learning with Apache Spark (Coursera)

Explore the exciting world of machine learning with this IBM course. Start by learning ML fundamentals before unlocking the power of Apache Spark to build and deploy ML models for data engineering applications. Dive into supervised and unsupervised learning techniques and discover the revolutionary possibilities of Generative AI through instructional readings and videos.

Jun 29th 2026
4 Weeks
Explore Core Data Concepts in Microsoft Azure (Coursera) Coursera
Microsoft

Explore Core Data Concepts in Microsoft Azure (Coursera)

In this course, you will learn the fundamentals of database concepts in a cloud environment, get basic skilling in cloud data services, and build your foundational knowledge of cloud data services within Microsoft Azure. You will identify and describe core data concepts such as relational, non-relational, big data, and analytics, and explore how this technology is implemented with Microsoft Azure. You will explore the roles, tasks, and responsibilities in the world of data.

Jun 29th 2026
2 Weeks
AI Workflow: AI in Production (Coursera) Coursera
IBM

AI Workflow: AI in Production (Coursera)

This is the sixth course in the IBM AI Enterprise Workflow Certification specialization. You are STRONGLY encouraged to complete these courses in order as they are not individual independent courses, but part of a workflow where each course builds on the previous ones. This course focuses on models in production at a hypothetical streaming media company. There is an introduction to IBM Watson Machine Learning.

Jun 29th 2026
4 Weeks
Hadoop Platform and Application Framework (Coursera) Coursera
University of California, San Diego

Hadoop Platform and Application Framework (Coursera)

This course is for novice programmers or business people who'd like to understand the core tools used to wrangle and analyze big data. With no prior experience, you'll have the opportunity to walk through hands-on examples with Hadoop and Spark frameworks, two of the most common in the industry. You will be comfortable explaining the specific components and basic processes of the Hadoop architecture, software stack, and execution environment.

Jun 29th 2026
5-12 Weeks
Big Data Modeling and Management Systems (Coursera) Coursera
University of California, San Diego

Big Data Modeling and Management Systems (Coursera)

Once you’ve identified a big data issue to analyze, how do you collect, store and organize your data using Big Data solutions? In this course, you will experience various data genres and management tools appropriate for each. You will be able to describe the reasons behind the evolving plethora of new big data platforms from the perspective of big data management systems and analytical tools.

Jun 22nd 2026
5-12 Weeks
Data-driven Decision Making (Coursera) Coursera
PwC

Data-driven Decision Making (Coursera)

Welcome to Data-driven Decision Making. In this course, you'll get an introduction to Data Analytics and its role in business decisions. You'll learn why data is important and how it has evolved. You'll be introduced to “Big Data” and how it is used. You'll also be introduced to a framework for conducting Data Analysis and what tools and techniques are commonly used. Finally, you'll have a chance to put your knowledge to work in a simulated business setting. This course was created by PricewaterhouseCoopers LLP with an address at 300 Madison Avenue, New York, New York, 10017.

Jun 29th 2026
4 Weeks
Foundations of marketing analytics (Coursera) Coursera
ESSEC Business School

Foundations of marketing analytics (Coursera)

Who is this course for? This course is designed for students, business analysts, and data scientists who want to apply statistical knowledge and techniques to business contexts. For example, it may be suited to experienced statisticians, analysts, engineers who want to move more into a business role, in particular in marketing. You will find this course exciting and rewarding if you already have a background in statistics, can use R or another programming language and are familiar with databases and data analysis techniques such as regression, classification, and clustering. However, it contains a number of recitals and R Studio tutorials which will consolidate your competences, enable you to play more freely with data and explore new features and statistical functions in R.

Jun 29th 2026
5-12 Weeks
Data Science for Business Innovation (Coursera) Coursera
Politecnico di Milano,EIT Digital

Data Science for Business Innovation (Coursera)

The course is a compendium of the must-have expertise in data science for executive and middle-management to foster data-driven innovation. It consists of introductory lectures spanning big data, machine learning, data valorization and communication. Topics cover the essential concepts and intuitions on data needs, data analysis, machine learning methods, respective pros and cons, and practical applicability issues.

Jun 29th 2026
4 Weeks
Big Data, Artificial Intelligence, and Ethics (Coursera) Coursera
University of California, Davis

Big Data, Artificial Intelligence, and Ethics (Coursera)

This course gives you context and first-hand experience with the two major catalyzers of the computational science revolution: big data and artificial intelligence. With more than 99% of all mediated information in digital format and with 98% of the world population using digital technology, humanity produces an impressive digital footprint.

Jun 29th 2026
4 Weeks
Big Data Integration and Processing (Coursera) Coursera
University of California, San Diego

Big Data Integration and Processing (Coursera)

At the end of the course, you will be able to: Retrieve data from example database and big data management systems; Describe the connections between data management operations and the big data processing patterns needed to utilize them in large-scale analytical applications; Identify when a big data problem needs data integration; Execute simple big data integration and processing on Hadoop and Spark platforms.

Jun 29th 2026
5-12 Weeks