ํ”ผ๋“œ๋กœ ๋Œ์•„๊ฐ€๊ธฐ
๐Ÿš€ Accelerating LLM Inference with TGI on Intel Gaudi
Hugging Face BlogHugging Face Blog
Backend

Hugging Face๊ฐ€ Intel Gaudi ํ•˜๋“œ์›จ์–ด๋ฅผ Text Generation Inference์— ๋„ค์ดํ‹ฐ๋ธŒ ํ†ตํ•ฉ์œผ๋กœ ๋ณ„๋„ ํฌํฌ ์ œ๊ฑฐ ๋ฐ ์ตœ์‹  ๊ธฐ๋Šฅ ๋™์‹œ ์ง€์›

๐Ÿš€ Accelerating LLM Inference with TGI on Intel Gaudi

2025๋…„ 3์›” 28์ผ6๋ถ„intermediate

Context

Text Generation Inference๋Š” Intel Gaudi ํ•˜๋“œ์›จ์–ด ์ง€์›์„ ์œ„ํ•ด ๋ณ„๋„์˜ ํฌํฌ ์ €์žฅ์†Œ(tgi-gaudi)๋ฅผ ์œ ์ง€ํ•ด์•ผ ํ–ˆ์œผ๋ฉฐ, ์ด๋กœ ์ธํ•ด ์‚ฌ์šฉ์ž ์ž…์žฅ์—์„œ ๋งž์ถค ์ €์žฅ์†Œ๋ฅผ ๊ด€๋ฆฌํ•ด์•ผ ํ•˜๋Š” ๋ถˆํŽธํ•จ๊ณผ ์ตœ์‹  TGI ๊ธฐ๋Šฅ์„ Gaudi์—์„œ ์‹ ์†ํ•˜๊ฒŒ ์ง€์›ํ•  ์ˆ˜ ์—†๋Š” ํ•œ๊ณ„๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค.

Technical Solution

  • Gaudi ์ง€์›์„ TGI ๋ฉ”์ธ ์ฝ”๋“œ๋ฒ ์ด์Šค(PR #3091)์— ์ง์ ‘ ํ†ตํ•ฉ: ๋ณ„๋„ ํฌํฌ ์ œ๊ฑฐ ๋ฐ ๋‹จ์ผ ์ €์žฅ์†Œ์—์„œ ๊ด€๋ฆฌ
  • ์ƒˆ๋กœ์šด TGI ๋‹ค์ค‘ ๋ฐฑ์—”๋“œ ์•„ํ‚คํ…์ฒ˜ ๋„์ž…: ํ•˜๋“œ์›จ์–ด ๋‹ค์–‘์„ฑ ์ง€์› ๊ฐ€๋Šฅํ•˜๋„๋ก ๋ชจ๋“ˆํ™”
  • 15๊ฐœ ์ด์ƒ์˜ LLM ๋ชจ๋ธ ์ตœ์ ํ™”: Llama 3.1(8B, 70B), Mixtral(8x7B), Mistral(7B), Falcon(180B) ๋“ฑ ๋‹จ์ผ ๋ฐ ๋ฉ€ํ‹ฐ์นด๋“œ ๊ตฌ์„ฑ ๋ชจ๋‘ ์ง€์›
  • Intel Gaudi์˜ ์ „์ฒด ํ•˜๋“œ์›จ์–ด ๋ผ์ธ ์ง€์›: Gaudi 1, Gaudi 2, Gaudi 3 ํฌํ•จ
  • FP8 ์–‘์žํ™” ๊ธฐ๋Šฅ ์ถ”๊ฐ€: Intel Neural Compressor(INC)๋ฅผ ํ†ตํ•œ ์ถ”๊ฐ€ ์„ฑ๋Šฅ ์ตœ์ ํ™” ์ œ๊ณต

Impact

์•„ํ‹ฐํด์— ์ •๋Ÿ‰์  ์„ฑ๋Šฅ ์ˆ˜์น˜(์ง€์—ฐ์‹œ๊ฐ„, ์ฒ˜๋ฆฌ๋Ÿ‰, ๋น„์šฉ ์ ˆ๊ฐ์œจ ๋“ฑ)๋Š” ๋ช…์‹œ๋˜์ง€ ์•Š์Œ.

Key Takeaway

๋ฉ€ํ‹ฐ ๋ฐฑ์—”๋“œ ์•„ํ‚คํ…์ฒ˜๋กœ ์„ค๊ณ„ํ•˜๋ฉด ์ƒˆ๋กœ์šด ๊ฐ€์†๊ธฐ ํ•˜๋“œ์›จ์–ด ์ถ”๊ฐ€ ์‹œ ๊ธฐ์กด ๊ธฐ๋Šฅ๊ณผ ์‚ฌ์šฉ์ž ๊ฒฝํ—˜์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋น ๋ฅด๊ฒŒ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋‹ค. ํฌํฌ ์œ ์ง€ ๋ฐฉ์‹์—์„œ ๋‹จ์ผ ์ฝ”๋“œ๋ฒ ์ด์Šค ๊ตฌ์กฐ๋กœ ์ „ํ™˜ํ•˜๋ฉด ๊ธฐ๋Šฅ ์—…๋ฐ์ดํŠธ ์†๋„์™€ ์‚ฌ์šฉ์ž ์ ‘๊ทผ์„ฑ์ด ๋™์‹œ์— ๊ฐœ์„ ๋œ๋‹ค.


LLM ์ถ”๋ก  ์ธํ”„๋ผ๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ํŒ€์€ Intel Gaudi ํ•˜๋“œ์›จ์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋ ค ํ•  ๋•Œ TGI ๊ณต์‹ Docker ์ด๋ฏธ์ง€(ghcr.io/huggingface/text-generation-inference:3.2.1-gaudi)๋ฅผ Habana ๋Ÿฐํƒ€์ž„์œผ๋กœ ์‹คํ–‰ํ•˜๋ฉด ๋ณ„๋„ ์ปค์Šคํ„ฐ๋งˆ์ด์ง• ์—†์ด Llama 3.1, Mixtral, Mistral ๋“ฑ ์‚ฌ์ „ ์ตœ์ ํ™”๋œ ๋ชจ๋ธ์„ ์ฆ‰์‹œ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ๋ฉ€ํ‹ฐ์นด๋“œ ์ธํผ๋Ÿฐ์Šค ์ƒค๋”ฉ๊ณผ FP8 ์–‘์žํ™”๋ฅผ ๊ธฐ๋ณธ ์ง€์›๋ฐ›์œผ๋ฏ€๋กœ GPU ๊ธฐ๋ฐ˜ ๋ฐฐํฌ์™€ ๋™์ผํ•œ ์ˆ˜์ค€์˜ ํ”„๋กœ๋•์…˜ ๊ธฐ๋Šฅ์„ ํ™œ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

์›๋ฌธ ์ฝ๊ธฐ