EdX

Big Data, Hadoop, and Spark Basics (edX)

Offered by IBM,
Big Data, Hadoop, and Spark Basics (edX)

This course provides foundational big data practitioner knowledge and analytical skills using popular big data tools, including Hadoop and Spark. Learn and practice your big data skills hands-on. Organizations need skilled, forward-thinking Big Data practitioners who can apply their business and technical skills to unstructured data such as tweets, posts, pictures, audio files, videos, sensor data, and satellite imagery, and more, to identify behaviors and preferences of prospects, clients, competitors, and others. ****

Class Deals by MOOC List - Click here and see EdX's Active Discounts, Deals, and Promo Codes.

This course introduces you to Big Data concepts and practices. You will understand the characteristics, features, benefits, limitations of Big Data and explore some of the Big Data processing tools. You'll explore how Hadoop, Hive, and Spark can help organizations overcome Big Data challenges and reap the rewards of its acquisition.
Hadoop, an open-source framework, enables distributed processing of large data sets across clusters of computers using simple programming models. Each computer, or node, offers local computation and storage, allowing datasets to be processed faster and more efficiently. Hive, a data warehouse software, provides an SQL-like interface to efficiently query and manipulate large data sets in various databases and file systems that integrate with Hadoop.
Open-source Apache Spark is a processing engine built around speed, ease of use, and analytics that provides users with newer ways to store and use big data.
You will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the different components that make up Apache Spark. In this course, you will also learn how Resilient Distributed Datasets, known as RDDs, enable parallel processing across the nodes of a Spark cluster.
You'll gain practical skills when you learn how to analyze data in Spark using PySpark and Spark SQL and how to create a streaming analytics application using Spark Streaming, and more.

What you'll learn
"After completing this course, a learner will be able to..."

  • Describe Big Data, its impact, processing methods and tools, and use cases.
  • Describe Hadoop architecture, ecosystem, practices, and applications, including Distributed File - -
  • Describe Spark programming basics, including parallel programming basics, for DataFrames, data sets, and SparkSQL.
  • Describe how Spark uses RDDs, creates data sets, and uses Catalyst and Tungsten to optimize SparkSQL.
  • Apply Apache Spark development and runtime environment options.

This course is part of the NoSQL, Big Data and Spark Fundamentals Professional Certificate

Syllabus

Module 1 – What is Big Data?
___Introduction to Big Data_ *
o What is Big Data?
o Impact of Big Data
o Parallel Processing, Scaling, and Data Parallelism
o Tools of Big Data
o Beyond the Hype
o Big Data Use Cases
o Viewpoints about Big Data

Module 2 – Introduction to the Hadoop Ecosystem
___Introduction to the Hadoop Ecosystem_ *
o What is Hadoop
o An introduction to MapReduce
o The Hadoop Ecosystem/Common components: Introducing HDFS, Hive, HBase, and Spark, other modules
o Working with HDFS
o Working with HBase
o Lab: MapReduce

Module 3 – Introduction to Apache Spark
___Introduction to Apache Spark_ *
o Why use Apache Spark?
o Functional Programming Basics
o Parallel Programming using Resilient Distributed Datasets
o Scale-out / Data Parallelism in Apache Spark
o DataFrames and SparkSQL
o Lab: Practical examples with PySpark

Module 4 – DataFrames and SparkSQL
___DataFrames and SparkSQL_ *
o Introduction to Data-Frames & SparkSQL
o RDDs in Parallel Programming and Spark
o Data-frames and Datasets
o Catalyst and Tungsten
o ETL with Data-frames
o Lab: ETL with Data-frames
o Real-world usage of SparkSQL
o Lab: SparkSQL

Module 5 – Development and Runtime Environment options
___Development and Runtime Environment options_ *
o Apache Spark architecture
o Overview of Apache Spark Cluster Modes
o How to Run an Apache Spark Application
o Using Apache Spark on IBM Cloud
o Lab: Scale-out on IBM Spark Environment in Watson Studio
o Setting Apache Spark Configuration
o Running Spark on Kubernetes
o Lab: Spark on Kube

Go to Class
MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Introduction to Computer Science and Programming (edX) EdX
Tokyo Institute of Technology,TokyoTechX

Introduction to Computer Science and Programming (edX)

The term “Computation” refers to the action performed by a computer. A computation can be a basic operation and it can also be a sophisticated computer simultation requiring a large amount of data and substantial resources. This course aims at introducing learners with no prior knowledge to basics and key concepts of computer science. By following the lectures and exercises of this course you will have an understanding of algorithms and you will get a real experience of programming using the language Ruby.

Self Paced
Self-Paced
Data Analytics and Visualization in Health Care (edX) EdX
Rochester Institute of Technology,RITx

Data Analytics and Visualization in Health Care (edX)

Learn best practices in data analytics, informatics, and visualization to gain literacy in data-driven, strategic imperatives that affect all facets of health care. Big data is transforming the health care industry relative to improving quality of care and reducing costs—key objectives for most organizations. Employers are desperately searching for professionals who have the ability to extract, analyze, and interpret data from patient health records, insurance claims, financial records, and more to tell a compelling and actionable story using health care data analytics.

Self Paced
Self-Paced
Big Data Strategies to Transform Your Business (edX) EdX
Delft University of Technology,DelftX

Big Data Strategies to Transform Your Business (edX)

Make your organization’s business strategy and model, as well as your own career path, future-proof by using big data’s disruptive power. While big data infiltrates all walks of life, most firms have not changed sufficiently to meet the challenges that come with it. In this course, you will learn how to develop a big data strategy, transform your business model and your organization. This course will enable professionals to take their organization and their own career to the next level, regardless of their background and position.

Self Paced
Self-Paced
Value Co-Creation in Sport Management – A New Logic in a Changing Society (edX) EdX
University of Bayreuth,BayreuthX

Value Co-Creation in Sport Management – A New Logic in a Changing Society (edX)

Do you want to become a successful sport management expert? Learn the importance of value co-creation and gather new insights that will make you more competitive in the field of sport management. Are you a passionate sports lover interested in exploring a new innovative logic in sport management? Enroll in this online course.

Self Paced
Self-Paced
Biostatistics for Big Data Applications (edX) EdX
University of Texas Medical Branch

Biostatistics for Big Data Applications (edX)

Learn data analysis basics for working with biomedical big data with practical hands-on examples using R. This course provides a broad foundation of statistical terms and concepts as well as an introduction to the R statistical software package. The topics covered are fundamental components of biostatistical methods used in both omics and population health research.

No sessions Available
5-12 Weeks
Industry 4.0: How to Revolutionize your Business (edX) EdX
The Hong Kong Polytechnic University,HKPolyUx

Industry 4.0: How to Revolutionize your Business (edX)

An introduction to the fourth industrial revolution, it's major systems and technologies and how new products and services will impact business and society. We have witnessed the power of mechanization in the early nineteen century, automation in the seventies, information and the internet in the last decades. But now, the adaptation of connected intelligence into the business and social fabrics is advancing at an astonishing speed, which will completely change the way we conduct business.

Self Paced
Self-Paced
Distributed Machine Learning with Apache Spark (edX) EdX
University of California, Berkeley,BerkeleyX

Distributed Machine Learning with Apache Spark (edX)

Learn the underlying principles required to develop scalable machine learning pipelines and gain hands-on experience using Apache Spark. Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability and optimization.

No sessions available
4 Weeks
Data Storage and Processing (edX) EdX
ITMO University,ITMOx

Data Storage and Processing (edX)

Master the culture of data representation, interpretation and outcomes evaluation. Learn the fundamentals of relational and NoSQL database management systems. Want to learn data processing and interpreting the result you’ve got? This course is for you! Get acquainted with preparing and analyzing large amount of data, as well as data storage fundamentals.

No sessions available
5-12 Weeks
Analytics in Python (edX) EdX
Columbia University,ColumbiaX

Analytics in Python (edX)

Learn the fundamental of programming in Python and develop the ability to analyze data and make data-driven decisions. Data is the lifeblood of an organization. Competency in programming is an essential skill for successfully extracting information and knowledge from data. The goal of this course is to introduce learners to the basics of programming in Python and to give a working knowledge of how to use programs to deal with data.

This course is archived
5-12 Weeks
Computational Thinking and Big Data (edX) EdX
University of Adelaide,AdelaideX

Computational Thinking and Big Data (edX)

Learn the core concepts of computational thinking and how to collect, clean and consolidate large-scale datasets. Computational thinking is an invaluable skill that can be used across every industry, as it allows you to formulate a problem and express a solution in such a way that a computer can effectively carry it out.

Self Paced
Self-Paced