Snowflake AI Research가 Ulysses Sequence Parallelism으로 어텐션 헤드를 GPU 간 분산 처리하여 64K 토큰에서 3.7배 처리량 증가 달성
Ulysses Sequence Parallelism: Training with Million-Token Contexts
Ulysses Sequence Parallelism: Training with Million-Token Contexts
From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate
Fast Inference on Large Language Models: BLOOMZ on Habana Gaudi2 Accelerator
Optimum+ONNX Runtime - Easier, Faster training for your Hugging Face models
Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate
The Technology Behind BLOOM Training
Accelerate Large Model Training using DeepSpeed
Fit More and Train Faster With ZeRO via DeepSpeed and FairScale