References
https://github.com/ModelTC/LightLLM/tree/main/lightllm/common/basemodel/triton_kernel
https://github.com/unslothai/unsloth/tree/main/unsloth/kernels
GPU Teaching Kit - Accelerated Computing
http://gputeachingkit.hwu.crhc.illinois.edu/
It has videos, slides
Simplifying CUDA kernels with Triton: A Pythonic Approach to GPU Programming
The Deep Learning Compiler: A Comprehensive Survey
https://github.com/tugot17/pmpp
Complete solutions to the Programming Massively Parallel Processors Edition 4
Lecture 1 How to profile CUDA kernels in PyTorch
https://www.youtube.com/watch?v=LuhJEEJQgUM&t=2200s&ab_channel=GPUMODE
Programming Massively Parallel Processors (PMPP)
https://stevengong.co/notes/Programming-Massively-Parallel-Processors
It has lots of notes