Day | Code | Notes | Progress |
---|---|---|---|
043 | triton puzzles: matmul + relu | triton | matrix multiplication and relu |
042 | triton puzzles: fused matmul + relu | triton | fused matrix multiplication and relu |
041 | triton puzzles: vector addition, row to col | triton | vector addition row and column vectors |
040 | triton puzzles: vector addition | triton | vector addition |
039 | triton puzzles: constant addition with varying block sizes | triton | constant addition to vector |
038 | triton puzzles: constant addition | triton | constant addition |
037 | triton puzzles: blocks and loading | triton | 2d tensor loading as blocks |
036 | triton puzzles: loading 2d tensors | triton | 2d tensor and tl.store |
035 | triton puzzles | triton | tritonviz + debugging meson + puzzles environment setup |
034 | benchmarking in triton | triton | triton benchmarking + plots |
033 | vector addition in triton | triton | triton setup + vector addition |
032 | knn with vectorized distance computation | float4 | knn + vectorized distance computation + float4 operations |
031 | knn with batch distance computation | knn | knn + batch distance computation |
030 | knn with thrust for sorting | knn | knn + thrust sorting |
029 | knn with tiled distance computation | knn | knn + tiling |
028 | knn with shared memory distance calculation | knn | knn + shared memory |
027 | baseline gpu knn | knn | knn |
026 | bitonic sort with shared memory | sorting | bitonic sort with shared memory |
025 | bitonic sort | sorting | bitonic sort |
024 | histogram with shared memory and atomic add | using shared memory and atomic adds | atomic operations, race conditions |
023 | histogram with atomicadds | using atomic adds | atomic operations, race conditions |
022 | register pressure and spilling | reducing register pressure | spilling, high and low register pressure |
021 | optimizing warp divergence | optimized warp divergence | warp divergence and optimzing for it |
020 | hillis steele prefix sum (optimized) | optimized parallel prefix sum | hillis steele with shared memory |
019 | prefix sum (naive) | parallel prefix sum | prefix sum, parallel scanning |
018 | parallel reduction (optimized) | optimized parallel reduction | shuffle sync with mask and warps |
017 | parallel reduction (naive) | naive parallel reduction | parallel reduction with shared memory |
016 | l1 and l2 cache | read about l1, l2 cache and how to write cache friendly code | l1, l2 cache |
015 | matrix multiplication with block tiling | optimizing mat mul using block tiling | block tiling |
014 | matrix multiplication sgemm shared memory | optimizing mat mul using memory blocking | shared memory, memory blocking |
013 | optimizing matrix multiplication | optimizing mat mul using coalescing | coalescing memory and warp scheduling |
012 | shared memory | matrix multiplication and shared memory | read about shared memory, registers and warps, bank conflicts, reading matrix multiplication blog by siboehm |
011 | optimizing matrix multiplication | matrix multiplication and profiling | using nsys and nvprof, reading matrix multiplication blog by siboehm |
010 | face blur | read matrix multiplication blog | reading matrix multiplication blog by siboehm, using a compiled kernel in python |
009 | matrix transpose | matrix transpose and matrix multiplication blog | started reading matrix multiplication blog by siboehm, started chapter 4 of PMPP |
008 | matrix multiply and helpers | matrix multiplication, pinned memory and BLAS | read about pinned memory, pageable memory and cudaHostAlloc(). finished chapter 3 of PMPP |
007 | vector multiply and helpers | internal structure of blocks | setup gpu env on new server. studied heirarchy of execution within the streaming multiprocessor. created helpers file. |
006 | gaussianBlurSharedMemory with event times | event times and performance measurement | added perf measurement code to gaussian blur with shared memory kernel |
005 | gaussianBlurSharedMemory | PMPP Chapter 3 & exploration | built on top of gaussian blur; learnt about shared memory and implemented it; |
004 | gaussianBlur | PMPP Chapter 3 | built on top of image blur; struggling to understand multidimensionality; |
003 | imageBlur | PMPP Chapter 3 | read parts of image blur and about better ways to handle errors, image blurring logic |
002 | colorToGrayScaleConversion | PMPP Chapter 3 | read half of chapter 2 of pmpp, implemented color to grayscale conversion |
001 | vecAbsDiff | PMPP Chapter 2 | read chapter 2 of pmpp, implemented vector absolute difference kernel |
000 | - | PMPP | setup environment, lecture 1 of ECE 408, chapter 1 of PMPP |
- Programming Massively Parallel Processors
- CUDA 120 Days Challenge
- ECE 408
- LLMs
Objective | Topic | Task/Implementation | Status |
---|---|---|---|
Phase 1: Foundations | Goal: Understand CUDA fundamentals, memory hierarchy, and write basic optimized kernels. | ||
1 | CUDA Setup & First Kernel | Install CUDA, write a vector addition kernel | ✅ |
2 | Thread Hierarchy | Grids, blocks, threads, experimenting with configurations | ✅ |
3 | Memory Model Basics | Global, shared, local memory overview | ✅ |
4 | Memory Coalescing | Optimize vector addition using shared memory | ✅ |
5 | Matrix Multiplication (Naïve) | Implement basic matrix multiplication | ✅ |
6 | Matrix Multiplication (Optimized) | Use shared memory to optimize | |
7 | Profiling Basics | Use nvprof and nsys to analyze kernels |
✅ |
8 | L1/L2 Cache Effects | Study cache behavior and memory bandwidth | ✅ |
9 | Tiled Matrix Multiplication | Further optimize matrix multiplication | |
10 | Register Pressure | Optimize register usage and reduce spilling | ✅ |
11 | Warp Execution Model | Avoiding warp divergence | ✅ |
12 | Parallel Reduction (Naïve) | Implement sum/max reductions | ✅ |
13 | Parallel Reduction (Optimized) | Optimize with warp shuffle (__shfl_sync ) |
✅ |
14 | Code Review & Optimization | Refine and benchmark previous work | |
15 | Parallel Scan (Prefix Sum) | Implement parallel scan algorithm | ✅ |
16 | Histogram (Naïve) | Implement histogram using global memory atomics | ✅ |
17 | Histogram (Optimized) | Use shared memory to optimize histogram | ✅ |
18 | Parallel Sorting | Implement bitonic or bucket sort | ✅ |
19 | k-Nearest Neighbors | Implement kNN search using CUDA | |
20 | Code Review & Benchmarking | Optimize and compare previous implementations | |
Phase 2: ML Operators | Goal: Implement and optimize core ML kernels. | ||
21 | Dense Matrix-Vector Multiplication | Implement y = Wx + b in CUDA |
|
22 | Fully Connected Layer | Implement dense forward pass | |
23 | ReLU & Softmax | Implement activation functions | |
24 | Backpropagation | Implement BP for a single layer | |
25 | 1D Convolution (Naïve) | Implement 1D convolution | |
26 | 1D Convolution (Optimized) | Optimize with shared memory | |
27 | Profiling DL Kernels | Compare CUDA vs. PyTorch performance | |
28 | 2D Convolution (Naïve) | Implement 2D convolution | |
29 | 2D Convolution (Optimized) | Use shared memory for optimization | |
30 | Im2Col + GEMM Conv | Implement im2col approach | |
31 | Depthwise Separable Conv | Optimize CNN inference workloads | |
32 | Batch Norm & Activation Fusion | Optimize BN + activation | |
33 | Code Review & Optimization | Refine previous work | |
34 | Benchmarking ML Kernels | Compare different CNN implementations | |
35 | LayerNorm in CUDA | Implement LayerNorm from scratch | |
36 | Efficient Dropout | Optimize dropout for training speed | |
37 | Fused MLP Block | Implement fused MLP (GEMM + activation + dropout ) |
|
38 | Transformer Attention (Naïve) | Implement self-attention kernel | |
39 | Optimized Self-Attention | Optimize self-attention with shared memory | |
40 | Benchmark Transformer Layers | Compare against torch.nn.MultiheadAttention |
|
41 | Tensor Cores & FP16 | Implement FP16 computation | |
42 | Gradient Accumulation | Optimize training with gradient accumulation | |
43 | Mixed Precision Training (AMP) | Implement AMP optimizations | |
44 | Optimized Attention (FlashAttention) | Implement FlashAttention concepts | |
45 | Fused LayerNorm + Dropout | Optimize memory and performance | |
46 | Large-Scale Training Profiling | Analyze memory bottlenecks | |
Phase 3: Advanced CUDA & Large-Scale ML | Goal: Optimize LLMs, multi-GPU training, and memory-efficient kernels. | ||
47 | Multi-GPU Data Parallelism | Implement data parallel training | |
48 | Multi-GPU Model Parallelism | Implement model parallel training | |
49 | Efficient Multi-GPU Communication | Study NCCL and all-reduce ops | |
50 | Large Model Optimization | Optimize large-scale deep learning models | |
51 | Rotary Embeddings | Implement rotary embeddings in CUDA | |
52 | Fused Transformer Block | Implement fused transformer kernel | |
53 | LLM Batch Processing | Optimize inference for large batch sizes | |
54 | FlashAttention-Like Kernels | Implement memory-efficient attention | |
55 | Memory Optimization for LLMs | Optimize LLM inference footprint | |
56 | GPU Benchmarking | Compare performance across GPUs | |
57 | Architecture-Specific Optimizations | Tune for Ampere/Hopper GPUs | |
58 | CUDA Graphs | Implement CUDA Graphs for execution efficiency | |
59 | Memory Fragmentation Optimization | Optimize dynamic allocations | |
60 | Benchmarking | Compare PyTorch/TensorFlow vs. your CUDA implementations | |
61 | Optimize a Real-World Model | Pick a model (BERT/GPT) and optimize | |
62 | Custom CUDA Model Acceleration | Implement a custom CUDA-based model optimization |