Introduction to Parallel Programming with CUDA (Coursera)

Introduction to Parallel Programming with CUDA (Coursera)

This course will help prepare students for developing code that can process large amounts of data in parallel on Graphics Processing Units (GPUs). It will learn on how to implement software that can solve complex problems with the leading consumer to enterprise-grade GPUs available using Nvidia CUDA. They will focus on the hardware and software capabilities, including the use of 100s to 1000s of threads and various forms of memory.

Class Deals by MOOC List - Click here and see Coursera's Active Discounts, Deals, and Promo Codes.

Course 2 of 4 in the GPU Programming Specialization.

What You Will Learn

  • Students will learn how to utilize the CUDA framework to write C/C++ software that runs on CPUs and Nvidia GPUs.
  • Students will transform sequential CPU algorithms and programs into CUDA kernels that execute 100s to 1000s of times simultaneously on GPU hardware.

Syllabus

WEEK 1
Course Overview
The purpose of this module is for students to understand how the course will be run, topics, how they will be assessed, and expectations.

WEEK 2
Threads, Blocks and Grids
The single most important concept for using GPUs to solve complex and large-scale problems, is management of threads. CUDA provides two- and three-dimensional logical abstractions of threads, blocks and grids. Students will develop programs that utilize threads, blocks, and grids to process large 2 to 3-dimensional data sets.

WEEK 3
Host and Global Memory
To manage the access and modification of data in physical memory effectively, students will need to load data into CPU (host) and GPU (global) general-purpose memory. Students will create software that allocates host memory and transfers it into global memory for use by threads. Students will also learn the capabilities and speeds of these types of memories.

WEEK 4
Shared and Constant Memory
To improve performance in GPU software, students will need to utilized mutable (shared) and static (constant) memory. They will use them to apply masks to all items of a data set, to manage the communication between threads, and use for caching in complex programs.

WEEK 5
Register Memory
In this module, students will learn the benefits and constraints of GPUs most hyper-localized memory, registers. While using this type of memory will be natural for students, gaining the largest performance boost from it, like all forms of memory, will require thoughtful design of software. Students will develop implementations of algorithms using each type of memory and generate performance analysis.

Go to Class
MOOC List is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Related Courses

Parallel programming (Coursera) Coursera
École Polytechnique Fédérale de Lausanne

Parallel programming (Coursera)

With every smartphone and computer now boasting multiple processors, the use of functional ideas to facilitate parallel programming is becoming increasingly widespread. In this course, you'll learn the fundamentals of parallel programming, from task parallelism to data parallelism. In particular, you'll see how many familiar ideas from functional programming map perfectly to to the data parallel paradigm.

Jun 22nd 2026
4 Weeks
The Arduino Platform and C Programming (Coursera) Coursera
University of California, Irvine

The Arduino Platform and C Programming (Coursera)

The Arduino is an open-source computer hardware/software platform for building digital devices and interactive objects that can sense and control the physical world around them. In this class you will learn how the Arduino platform works in terms of the physical board and libraries and the IDE (integrated development environment). You will also learn about shields, which are smaller boards that plug into the main Arduino board to perform other functions such as sensing light, heat, GPS tracking, or providing a user interface display. The course will also cover programming the Arduino using C code and accessing the pins on the board via the software to control external devices.

Jun 22nd 2026
4 Weeks
Shell Programming - A necessity for all Programmers (edX) EdX
IIT Bombay,IITBombayX

Shell Programming - A necessity for all Programmers (edX)

Unleash your Linux scripting skills and amaze others with your productivity level. Various programming languages have gained popularity since 1970. Starting with Assembly, C, C++, and moving towards Java, Python, and finally building of backend and frontend frameworks, all of these became popular and were or getting replaced with some other language or framework. Shell programming (scripting) is the only programming language that has been popular and the choice of programmers, testers, system administrators, etc., from 1970 to date (21st century).

Self Paced
Self-Paced
High Performance Computing (Udacity) Udacity
Georgia Institute of Technology,Udacity

High Performance Computing (Udacity)

The goal of this course is to give you solid foundations for developing, analyzing, and implementing parallel and locality-efficient algorithms. This course focuses on theoretical underpinnings. To give a practical feeling for how algorithms map to and behave on real systems, we will supplement algorithmic theory with hands-on exercises on modern HPC systems, such as Cilk Plus or OpenMP on shared memory nodes, CUDA for graphics co-processors (GPUs), and MPI and PGAS models for distributed memory systems.

Self Paced
Self-Paced
Parallel Programming Concepts (openHPI) OpenHPI
Hasso-Plattner-Institut

Parallel Programming Concepts (openHPI)

The openHPI online course “Parallel Programming Concepts” presents relevant theoretical and practical foundations for parallel programming. We show crucial theoretical ideas such as semaphores and actors, the architecture of modern parallel hardware, different programming models such as task parallelism, message passing and functional programming, and several patterns and best practices.

Self Paced
Self-Paced
Using GPUs to Scale and Speed-up Deep Learning (edX) EdX
IBM

Using GPUs to Scale and Speed-up Deep Learning (edX)

Training complex deep learning models with large datasets takes a long time. In this course, you will learn how to use accelerated GPU hardware to overcome the scalability problem in deep learning. Training a complex deep learning model with a very large dataset can take hours, days and occasionally weeks to train. So, what is the solution? Accelerated hardware.

No sessions available
5-12 Weeks
MPI: A Short Introduction to One-sided Communication (FutureLearn) FutureLearn
Partnership for Advanced Computing in Europe - PRACE

MPI: A Short Introduction to One-sided Communication (FutureLearn)

Learn the details of one-sided communication in MPI programming. Discover the advantages to one-sided communication in parallel programming. Message Passing Interface (MPI) is a key standard for parallel computing architectures. On this course, you’ll learn the essential concepts of one-sided communication in MPI, as well as the advantages of the MPI communication model.

No sessions available
2 Weeks
Identifying Security Vulnerabilities in C/C++Programming (Coursera) Coursera
University of California, Davis

Identifying Security Vulnerabilities in C/C++Programming (Coursera)

This course builds upon the skills and coding practices learned in both Principles of Secure Coding and Identifying Security Vulnerabilities, courses one and two, in this specialization. This course uses the focusing technique that asks you to think about: “what to watch out for” and “where to look” to evaluate and ultimately remediate fragile C++ library code.

Jun 22nd 2026
4 Weeks
The Raspberry Pi Platform and Python Programming for the Raspberry Pi (Coursera) Coursera
University of California, Irvine

The Raspberry Pi Platform and Python Programming for the Raspberry Pi (Coursera)

The Raspberry Pi is a small, affordable single-board computer that you will use to design and develop fun and practical IoT devices while learning programming and computer hardware. In addition, you will learn how to set up up the Raspberry Pi environment, get a Linux operating system running, and write and execute some basic Python code on the Raspberry Pi. You will also learn how to use Python-based IDE (integrated development environments) for the Raspberry Pi and how to trace and debug Python code on the device.

Jun 22nd 2026
4 Weeks
Hands-on Machine Learning with AWS and NVIDIA (Coursera) Coursera
AWS,NVIDIA

Hands-on Machine Learning with AWS and NVIDIA (Coursera)

Machine learning (ML) projects can be complex, tedious, and time consuming. AWS and NVIDIA solve this challenge with fast, effective, and easy-to-use capabilities for your ML project. This course is designed for ML practitioners, including data scientists and developers, who have a working knowledge of machine learning workflows. In this course, you will gain hands-on experience on building, training, and deploying scalable machine learning models with Amazon SageMaker and Amazon EC2 instances powered by NVIDIA GPUs.

Jul 22nd 2024
4 Weeks
Ordered Data Structures (Coursera) Coursera
University of Illinois at Urbana-Champaign

Ordered Data Structures (Coursera)

In this course, you will learn new data structures for efficiently storing and retrieving data that is structured in an ordered sequence. Such data includes an alphabetical list of names, a family tree, a calendar of events or an inventory organized by part numbers. The specific data structures covered by this course include arrays, linked lists, queues, stacks, trees, binary trees, AVL trees, B-trees and heaps. This course also shows, through algorithm complexity analysis, how these structures enable the fastest algorithms to search and sort data.

Jun 24th 2026
4 Weeks