GitHub - asrjy/gpu: 100 days of Parallel Programming!

100 days of GPU Programming!

Inspiration:

100 days of GPU
🤑

Progress

Day	Code	Notes	Progress
043	triton puzzles: matmul + relu	triton	matrix multiplication and relu
042	triton puzzles: fused matmul + relu	triton	fused matrix multiplication and relu
041	triton puzzles: vector addition, row to col	triton	vector addition row and column vectors
040	triton puzzles: vector addition	triton	vector addition
039	triton puzzles: constant addition with varying block sizes	triton	constant addition to vector
038	triton puzzles: constant addition	triton	constant addition
037	triton puzzles: blocks and loading	triton	2d tensor loading as blocks
036	triton puzzles: loading 2d tensors	triton	2d tensor and tl.store
035	triton puzzles	triton	tritonviz + debugging meson + puzzles environment setup
034	benchmarking in triton	triton	triton benchmarking + plots
033	vector addition in triton	triton	triton setup + vector addition
032	knn with vectorized distance computation	float4	knn + vectorized distance computation + float4 operations
031	knn with batch distance computation	knn	knn + batch distance computation
030	knn with thrust for sorting	knn	knn + thrust sorting
029	knn with tiled distance computation	knn	knn + tiling
028	knn with shared memory distance calculation	knn	knn + shared memory
027	baseline gpu knn	knn	knn
026	bitonic sort with shared memory	sorting	bitonic sort with shared memory
025	bitonic sort	sorting	bitonic sort
024	histogram with shared memory and atomic add	using shared memory and atomic adds	atomic operations, race conditions
023	histogram with atomicadds	using atomic adds	atomic operations, race conditions
022	register pressure and spilling	reducing register pressure	spilling, high and low register pressure
021	optimizing warp divergence	optimized warp divergence	warp divergence and optimzing for it
020	hillis steele prefix sum (optimized)	optimized parallel prefix sum	hillis steele with shared memory
019	prefix sum (naive)	parallel prefix sum	prefix sum, parallel scanning
018	parallel reduction (optimized)	optimized parallel reduction	shuffle sync with mask and warps
017	parallel reduction (naive)	naive parallel reduction	parallel reduction with shared memory
016	l1 and l2 cache	read about l1, l2 cache and how to write cache friendly code	l1, l2 cache
015	matrix multiplication with block tiling	optimizing mat mul using block tiling	block tiling
014	matrix multiplication sgemm shared memory	optimizing mat mul using memory blocking	shared memory, memory blocking
013	optimizing matrix multiplication	optimizing mat mul using coalescing	coalescing memory and warp scheduling
012	shared memory	matrix multiplication and shared memory	read about shared memory, registers and warps, bank conflicts, reading matrix multiplication blog by siboehm
011	optimizing matrix multiplication	matrix multiplication and profiling	using nsys and nvprof, reading matrix multiplication blog by siboehm
010	face blur	read matrix multiplication blog	reading matrix multiplication blog by siboehm, using a compiled kernel in python
009	matrix transpose	matrix transpose and matrix multiplication blog	started reading matrix multiplication blog by siboehm, started chapter 4 of PMPP
008	matrix multiply and helpers	matrix multiplication, pinned memory and BLAS	read about pinned memory, pageable memory and cudaHostAlloc(). finished chapter 3 of PMPP
007	vector multiply and helpers	internal structure of blocks	setup gpu env on new server. studied heirarchy of execution within the streaming multiprocessor. created helpers file.
006	gaussianBlurSharedMemory with event times	event times and performance measurement	added perf measurement code to gaussian blur with shared memory kernel
005	gaussianBlurSharedMemory	PMPP Chapter 3 & exploration	built on top of gaussian blur; learnt about shared memory and implemented it;
004	gaussianBlur	PMPP Chapter 3	built on top of image blur; struggling to understand multidimensionality;
003	imageBlur	PMPP Chapter 3	read parts of image blur and about better ways to handle errors, image blurring logic
002	colorToGrayScaleConversion	PMPP Chapter 3	read half of chapter 2 of pmpp, implemented color to grayscale conversion
001	vecAbsDiff	PMPP Chapter 2	read chapter 2 of pmpp, implemented vector absolute difference kernel
000	-	PMPP	setup environment, lecture 1 of ECE 408, chapter 1 of PMPP

Resources:

Programming Massively Parallel Processors
CUDA 120 Days Challenge
ECE 408
LLMs

The Plan:

Objective	Topic	Task/Implementation	Status
Phase 1: Foundations	Goal: Understand CUDA fundamentals, memory hierarchy, and write basic optimized kernels.
1	CUDA Setup & First Kernel	Install CUDA, write a vector addition kernel	✅
2	Thread Hierarchy	Grids, blocks, threads, experimenting with configurations	✅
3	Memory Model Basics	Global, shared, local memory overview	✅
4	Memory Coalescing	Optimize vector addition using shared memory	✅
5	Matrix Multiplication (Naïve)	Implement basic matrix multiplication	✅
6	Matrix Multiplication (Optimized)	Use shared memory to optimize
7	Profiling Basics	Use `nvprof` and `nsys` to analyze kernels	✅
8	L1/L2 Cache Effects	Study cache behavior and memory bandwidth	✅
9	Tiled Matrix Multiplication	Further optimize matrix multiplication
10	Register Pressure	Optimize register usage and reduce spilling	✅
11	Warp Execution Model	Avoiding warp divergence	✅
12	Parallel Reduction (Naïve)	Implement sum/max reductions	✅
13	Parallel Reduction (Optimized)	Optimize with warp shuffle (`__shfl_sync`)	✅
14	Code Review & Optimization	Refine and benchmark previous work
15	Parallel Scan (Prefix Sum)	Implement parallel scan algorithm	✅
16	Histogram (Naïve)	Implement histogram using global memory atomics	✅
17	Histogram (Optimized)	Use shared memory to optimize histogram	✅
18	Parallel Sorting	Implement bitonic or bucket sort	✅
19	k-Nearest Neighbors	Implement kNN search using CUDA
20	Code Review & Benchmarking	Optimize and compare previous implementations
Phase 2: ML Operators	Goal: Implement and optimize core ML kernels.
21	Dense Matrix-Vector Multiplication	Implement `y = Wx + b` in CUDA
22	Fully Connected Layer	Implement dense forward pass
23	ReLU & Softmax	Implement activation functions
24	Backpropagation	Implement BP for a single layer
25	1D Convolution (Naïve)	Implement 1D convolution
26	1D Convolution (Optimized)	Optimize with shared memory
27	Profiling DL Kernels	Compare CUDA vs. PyTorch performance
28	2D Convolution (Naïve)	Implement 2D convolution
29	2D Convolution (Optimized)	Use shared memory for optimization
30	Im2Col + GEMM Conv	Implement im2col approach
31	Depthwise Separable Conv	Optimize CNN inference workloads
32	Batch Norm & Activation Fusion	Optimize BN + activation
33	Code Review & Optimization	Refine previous work
34	Benchmarking ML Kernels	Compare different CNN implementations
35	LayerNorm in CUDA	Implement LayerNorm from scratch
36	Efficient Dropout	Optimize dropout for training speed
37	Fused MLP Block	Implement fused MLP (`GEMM + activation + dropout`)
38	Transformer Attention (Naïve)	Implement self-attention kernel
39	Optimized Self-Attention	Optimize self-attention with shared memory
40	Benchmark Transformer Layers	Compare against `torch.nn.MultiheadAttention`
41	Tensor Cores & FP16	Implement FP16 computation
42	Gradient Accumulation	Optimize training with gradient accumulation
43	Mixed Precision Training (AMP)	Implement AMP optimizations
44	Optimized Attention (FlashAttention)	Implement FlashAttention concepts
45	Fused LayerNorm + Dropout	Optimize memory and performance
46	Large-Scale Training Profiling	Analyze memory bottlenecks
Phase 3: Advanced CUDA & Large-Scale ML	Goal: Optimize LLMs, multi-GPU training, and memory-efficient kernels.
47	Multi-GPU Data Parallelism	Implement data parallel training
48	Multi-GPU Model Parallelism	Implement model parallel training
49	Efficient Multi-GPU Communication	Study NCCL and all-reduce ops
50	Large Model Optimization	Optimize large-scale deep learning models
51	Rotary Embeddings	Implement rotary embeddings in CUDA
52	Fused Transformer Block	Implement fused transformer kernel
53	LLM Batch Processing	Optimize inference for large batch sizes
54	FlashAttention-Like Kernels	Implement memory-efficient attention
55	Memory Optimization for LLMs	Optimize LLM inference footprint
56	GPU Benchmarking	Compare performance across GPUs
57	Architecture-Specific Optimizations	Tune for Ampere/Hopper GPUs
58	CUDA Graphs	Implement CUDA Graphs for execution efficiency
59	Memory Fragmentation Optimization	Optimize dynamic allocations
60	Benchmarking	Compare PyTorch/TensorFlow vs. your CUDA implementations
61	Optimize a Real-World Model	Pick a model (BERT/GPT) and optimize
62	Custom CUDA Model Acceleration	Implement a custom CUDA-based model optimization

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
cuda		cuda
notes		notes
plots		plots
triton		triton
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

100 days of GPU Programming!

Inspiration:

Progress

Resources:

The Plan:

About

Releases

Packages

Languages

asrjy/gpu

Folders and files

Latest commit

History

Repository files navigation

100 days of GPU Programming!

Inspiration:

Progress

Resources:

The Plan:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages