References
- https://github.com/ModelTC/LightLLM/tree/main/lightllm/common/basemodel/triton_kernel 
- https://github.com/unslothai/unsloth/tree/main/unsloth/kernels 
- GPU Teaching Kit - Accelerated Computing - http://gputeachingkit.hwu.crhc.illinois.edu/ - It has videos, slides 
- Simplifying CUDA kernels with Triton: A Pythonic Approach to GPU Programming 
- The Deep Learning Compiler: A Comprehensive Survey 
- https://github.com/tugot17/pmpp - Complete solutions to the Programming Massively Parallel Processors Edition 4 
- Lecture 1 How to profile CUDA kernels in PyTorch - https://www.youtube.com/watch?v=LuhJEEJQgUM&t=2200s&ab_channel=GPUMODE 
- Programming Massively Parallel Processors (PMPP) - https://stevengong.co/notes/Programming-Massively-Parallel-Processors - It has lots of notes