Local Gradient Accumulation을 통한 Training 속도 1.69배 향상

Local Gradient Accumulation Speeds Training 1.7

Papers Mache2026년 6월 21일2분advanced

AI 요약

Context

Pipeline Parallelism의 1F1B-flush 스케줄링으로 인한 Pipeline Bubble 발생과 이로 인한 GPU 자원 낭비 문제 발생. Asynchronous 방식은 throughput을 높이나 Weight Stashing 등 복잡한 기법 도입에 따른 Training Stability 저하라는 Trade-off 존재.

Technical Solution

Local Gradient Accumulation 도입을 통한 Global Synchronization 제거 및 Pipeline 가동률 극대화
Micro-batch의 Weight Version Drift를 특정 범위 내로 제한하여 Asynchronous 방식의 불안정성 해결
Bounded Inconsistency 설계를 통한 Synchronous 방식 수준의 수렴 안정성 및 Perplexity 확보
Weight Stashing이나 Parameter Copy 없이 1F1B-flush와 동일한 Peak Memory Footprint 유지
Local-accumulation wrapper 적용을 통한 기존 Flush Synchronizer의 구조적 대체
Pipeline Depth 및 Micro-batch size에 따른 Accumulation Window 튜닝으로 최적의 Drift Bound 설정

실천 포인트

Bounded Inconsistency를 허용하는 설계를 통해 하드웨어 추가 비용 없이 Throughput과 Stability의 균형점을 최적화할 것

태그

#Distributed Training #Gradient Accumulation #Pipeline Parallelism #Weight Inconsistency #Throughput

원문 읽기