ํ”ผ๋“œ๋กœ ๋Œ์•„๊ฐ€๊ธฐ
Tricks from OpenAI gpt-oss YOU ๐Ÿซต can use with transformers
Hugging Face BlogHugging Face Blog
Backend

Hugging Face Transformers๊ฐ€ Hub์—์„œ ๋‹ค์šด๋กœ๋“œ ๊ฐ€๋Šฅํ•œ ์ปค์Šคํ…€ ์ปค๋„๊ณผ MXFP4 ์–‘์žํ™”๋ฅผ ํ†ตํ•ฉํ•ด GPT-OSS ๋ชจ๋ธ์˜ ๋กœ๋”ฉยท์ถ”๋ก ยทํŒŒ์ธํŠœ๋‹ ์„ฑ๋Šฅ์„ 2~10๋ฐฐ ํ–ฅ์ƒ

Tricks from OpenAI gpt-oss YOU ๐Ÿซต can use with transformers

2025๋…„ 9์›” 11์ผ10๋ถ„intermediate

Context

์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ ๊ฐœ๋ฐœ๋œ Flash Attention, Liger RMSNorm, MegaBlocks MoE ๋“ฑ์˜ ์ปค์Šคํ…€ ์ปค๋„๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ์‚ฐ์žฌ๋˜์–ด ์žˆ์–ด ์˜์กด์„ฑ ์ฆ๊ฐ€์™€ CUDA/C++ ์ปดํŒŒ์ผ ์š”๊ตฌ์‚ฌํ•ญ์ด ๋ฐœ์ƒํ–ˆ๋‹ค. ๊ฐ ๋ชจ๋ธ ํ†ตํ•ฉ ์‹œ๋งˆ๋‹ค ์ƒˆ๋กœ์šด ์ปค๋„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•ด์•ผ ํ•˜๋Š” ๊ตฌ์กฐ๋กœ ์ธํ•ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ณต์žก๋„๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ์—ˆ๋‹ค.

Technical Solution

  • Zero-build Kernels ํŒจํ‚ค์ง€ ๋„์ž…: Hub์—์„œ ์‚ฌ์ „ ์ปดํŒŒ์ผ๋œ ์ปค๋„ ๋ฐ”์ด๋„ˆ๋ฆฌ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ  @use_kernel_forward_from_hub() ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋กœ ์ž๋™ ์„ ํƒํ•˜๋Š” ๊ตฌ์กฐ๋กœ ๋ณ€๊ฒฝ
  • Liger RMSNorm ์ปค๋„ ํ†ตํ•ฉ: @use_kernel_forward_from_hub("RMSNorm") ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋กœ ์ •๊ทœํ™” ์—ฐ์‚ฐ ์ตœ์ ํ™”
  • MegaBlocks MoE ์ปค๋„ ํ†ตํ•ฉ: @use_kernel_forward_from_hub("MegaBlocksMoeMLP") ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋กœ Mixture of Experts ์—ฐ์‚ฐ ๊ฐ€์†
  • Flash Attention 3 ํ†ตํ•ฉ: Attention Sinks๋ฅผ ์ง€์›ํ•˜๋Š” Flash Attention 3 ์ปค๋„์„ Hopper ์•„ํ‚คํ…์ฒ˜ ๋Œ€์ƒ์œผ๋กœ ์ถ”๊ฐ€
  • MXFP4 ์–‘์žํ™” ์ปค๋„ ์ถ”๊ฐ€: Triton ๊ธฐ๋ฐ˜ MXFP4 ์–‘์žํ™” ์—ฐ์‚ฐ์„ ์ปค์Šคํ…€ ์ปค๋„๋กœ ์ œ๊ณต
  • ๋””๋ฐ”์ด์Šค ์ž๋™ ๋กœ๋”ฉ ์ตœ์ ํ™”: device_map="auto" ๋˜๋Š” Tensor Parallel ์‹คํ–‰ ์‹œ ๋ฉ€ํ‹ฐ GPU ๋กœ๋”ฉ ์†๋„ ๊ฐœ์„ 
  • ์ปค๋ฎค๋‹ˆํ‹ฐ ์ปค๋„ ์ž๋™ ์„ ํƒ: CUDA/ROCm ์—ฌ๋ถ€ ๋ฐ ํ›ˆ๋ จ/์ถ”๋ก  ๋ชจ๋“œ์— ๋”ฐ๋ผ ํ˜ธํ™˜ ์ปค๋„์„ ์ž๋™ ์„ ํƒ

Impact

  • PyTorch 2.0์˜ torch.compile๊ณผ TorchInductor ๋ฐฑ์—”๋“œ๋Š” 2~10๋ฐฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ ์ œ๊ณต
  • ์ปค์Šคํ…€ ์ปค๋„ ์‚ฌ์šฉ ์‹œ ๋” ํฐ ๋ฐฐ์น˜ ํฌ๊ธฐ์—์„œ ์ตœ์  ์„ฑ๋Šฅ ๋‹ฌ์„ฑ (Figure 1 ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ)

Key Takeaway

์ปค์Šคํ…€ ์ปค๋„์„ ์ค‘์•™ ๋ฆฌํฌ์ง€ํ† ๋ฆฌ(Hub)์—์„œ ์‚ฌ์ „ ์ปดํŒŒ์ผ ๋ฐ”์ด๋„ˆ๋ฆฌ๋กœ ๋ฐฐํฌํ•˜๊ณ  ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ ํŒจํ„ด์œผ๋กœ ์ถ”์ƒํ™”ํ•˜๋ฉด, ์˜์กด์„ฑ ์ฆ๊ฐ€์™€ ์ปดํŒŒ์ผ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด์„œ๋„ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์—์„œ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ํ™•์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” ์ปค๋ฎค๋‹ˆํ‹ฐ ๊ธฐ์—ฌ ์ปค๋„์„ ์ฐธ์กฐ ๊ตฌํ˜„์œผ๋กœ ์ œ๊ณตํ•จ์œผ๋กœ์จ MLX, llama.cpp, vLLM ๊ฐ™์€ ๋‹ค๋ฅธ ํ”„๋ ˆ์ž„์›Œํฌ์˜ ํ•™์Šต ์ž๋ฃŒ๋กœ๋„ ํ™œ์šฉ๋œ๋‹ค.


GPT-OSS ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ์„ ์šด์˜ํ•˜๋Š” ํŒ€์—์„œ `AutoModelForCausalLM.from_pretrained(model_id, use_kernels=True)`๋กœ ๋กœ๋”ฉํ•˜๋ฉด ์ถ”๊ฐ€ ์˜์กด์„ฑ ์„ค์น˜ ์—†์ด Liger RMSNorm, MegaBlocks MoE, Flash Attention 3 ๋“ฑ์˜ ์ปค์Šคํ…€ ์ปค๋„์ด ์ž๋™ ๋‹ค์šด๋กœ๋“œยท์ ์šฉ๋˜์–ด ๋ฐฐ์น˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ๋‹ค๋งŒ MXFP4 ์–‘์žํ™” ์ปค๋„ ์‚ฌ์šฉ ์‹œ์—๋Š” bfloat16 ํƒ€์ž… ์ถ”๋ก ์œผ๋กœ ์ „ํ™˜๋˜๋ฏ€๋กœ ๋ฉ”๋ชจ๋ฆฌ์™€ ์ฒ˜๋ฆฌ๋Ÿ‰ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๋ฅผ ๋ฒค์น˜๋งˆํฌํ•ด์•ผ ํ•œ๋‹ค.

์›๋ฌธ ์ฝ๊ธฐ