Fused MLP 설계를 통한 HBM 트래픽 제거 및 커널 최적화
Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP
Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP
Triton 기반 FlashAttention2 및 분산 학습으로 LLM Full-stack 구현
CS336: Language Modeling from Scratch
GPU Hardware, VRAM Optimization & Next-Gen Driver Updates
Make LLM Fine-tuning 2x faster with Unsloth and 🤗 TRL