ํ”ผ๋“œ๋กœ ๋Œ์•„๊ฐ€๊ธฐ
Accelerate StarCoder with ๐Ÿค— Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding
Hugging Face BlogHugging Face Blog
AI/ML

Intel์ด Optimum Intel์„ ํ†ตํ•ด StarCoder 15B ๋ชจ๋ธ์— INT8/INT4 ์–‘์žํ™”์™€ Speculative Decoding์„ ์ ์šฉํ•ด Xeon์—์„œ 7๋ฐฐ ์ด์ƒ์˜ ์ถ”๋ก  ๊ฐ€์† ๋‹ฌ์„ฑ

Accelerate StarCoder with ๐Ÿค— Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

2024๋…„ 1์›” 30์ผ9๋ถ„intermediate

Context

LLM ์ถ”๋ก  ์‹œ ์ž๋™ํšŒ๊ท€ ๋ฐฉ์‹์˜ ํ† ํฐ ์ƒ์„ฑ์œผ๋กœ ์ธํ•ด ๋งค ํ† ํฐ๋งˆ๋‹ค ์ „์ฒด ๋ชจ๋ธ์„ DRAM์—์„œ CPU๋กœ ๋กœ๋“œํ•ด์•ผ ํ•˜๋ฉฐ, ์˜คํ”„์นฉ ๋ฉ”๋ชจ๋ฆฌ์™€ CPU ๊ฐ„ ๋Œ€์—ญํญ์ด ํ† ํฐ ์ƒ์„ฑ์˜ ์ฃผ์š” ๋ณ‘๋ชฉ์ด ๋œ๋‹ค.

Technical Solution

  • INT8 ์ •์  ์–‘์žํ™” ๋„์ž…: SmoothQuant ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•ด ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ์ด์ƒ์น˜๋ฅผ ์Šค๋ฌด๋”ฉํ•˜๊ณ  ์–‘์žํ™” ๋ ˆ๋ฒจ ํ™œ์šฉ์„ ์ตœ์ ํ™”ํ•˜์—ฌ TTFT 2.19๋ฐฐ, TPOT 2.20๋ฐฐ ๊ฐ€์†
  • INT4 ๊ฐ€์ค‘์น˜ ์ „์šฉ ์–‘์žํ™” ์ ์šฉ: RTN(Round-To-Nearest) ๋ฐฉ์‹์œผ๋กœ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ถ”๊ฐ€ ๊ฐ์†Œ์‹œ์ผœ TPOT 3.35๋ฐฐ ๊ฐ€์† ๋‹ฌ์„ฑ
  • Speculative Decoding ํ†ตํ•ฉ: ์ดˆ์•ˆ ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ K๊ฐœ ํ† ํฐ์„ ๋Œ€์ƒ ๋ชจ๋ธ์ด ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•˜๋„๋ก ๋ณ€๊ฒฝํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์—์„œ ์ปดํ“จํŠธ ๋ณ‘๋ชฉ์œผ๋กœ ์ „ํ™˜
  • INT8 ์–‘์žํ™” ๋Œ€์ƒ ๋ชจ๋ธ ์ ์šฉ: Speculative Decoding์—์„œ๋Š” INT4์˜ ์—ญ์–‘์žํ™” ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ์ธํ•ด INT8 ๋Œ€์ƒ ๋ชจ๋ธ์ด INT4๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
  • ๐Ÿค— Optimum Intel ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ™œ์šฉ: IPEXModelForCausalLM ํด๋ž˜์Šค๋กœ AutoModelForCausalLM ๋Œ€์ฒดํ•˜์—ฌ ์ตœ์ ํ™”๋œ ๋ชจ๋ธ ๋กœ๋“œ ๋ฐ ์ถ”๋ก  ์‹คํ–‰

Impact

  • TTFT: INT8 ์–‘์žํ™” ๋‹จ๋… 2.19๋ฐฐ, INT8 + Speculative Decoding 1.95๋ฐฐ
  • TPOT: INT8 ์–‘์žํ™” ๋‹จ๋… 2.20๋ฐฐ, INT4 ์–‘์žํ™” 3.35๋ฐฐ, INT8 + Speculative Decoding 7.30๋ฐฐ
  • ์ •ํ™•๋„ ์œ ์ง€: INT8 ์–‘์žํ™” ์‹œ HumanEval pass@1 ๊ธฐ์ค€ 33.54% โ†’ 33.96%(์˜คํžˆ๋ ค ๋ฏธ๋ฏธํ•œ ์ƒํ–ฅ), INT4๋Š” 32.80%
  • ํ†ตํ•ฉ ์ตœ์ ํ™”(INT8 + Speculative Decoding): ๊ธฐ์ค€ ๋Œ€๋น„ 7๋ฐฐ ์ด์ƒ์˜ ์ถ”๋ก  ๊ฐ€์†

Key Takeaway

LLM ์ถ”๋ก ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ๋ณ‘๋ชฉ์„ ์–‘์žํ™”๋กœ ํ•ด๊ฒฐํ•˜๋˜, ์ƒ์„ฑ ๊ตฌ์กฐ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ์–‘์žํ™” ์ •๋ฐ€๋„๋ฅผ ์„ ํƒํ•ด์•ผ ํ•œ๋‹ค: ์ˆœ์ฐจ ์ฒ˜๋ฆฌ(TPOT ์ค‘์‹ฌ)์—์„œ๋Š” INT4 ๊ฐ€์ค‘์น˜ ์–‘์žํ™”, Speculative Decoding ๊ฐ™์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ(์ปดํ“จํŠธ ์ค‘์‹ฌ) ํ™˜๊ฒฝ์—์„œ๋Š” ์—ญ์–‘์žํ™” ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ž‘์€ INT8์ด ๋” ํšจ๊ณผ์ ์ด๋‹ค.


CPU ๊ธฐ๋ฐ˜ LLM ์ถ”๋ก  ์„œ๋น„์Šค์—์„œ Intel Xeon ํ™œ์šฉ ์‹œ, Optimum Intel์˜ IPEXModelForCausalLM๊ณผ SmoothQuant INT8 ์–‘์žํ™”๋ฅผ ๊ธฐ๋ณธ์œผ๋กœ ์ ์šฉํ•˜๋ฉด 2๋ฐฐ ์ด์ƒ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ, Speculative Decoding ๊ฐ™์€ ๋ณ‘๋ ฌ ํ† ํฐ ์ฒ˜๋ฆฌ ๊ธฐ๋ฒ•์„ ์ถ”๊ฐ€ํ•˜๋ฉด 7๋ฐฐ ์ด์ƒ์˜ ๊ฐ€์†์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

์›๋ฌธ ์ฝ๊ธฐ