ํ”ผ๋“œ๋กœ ๋Œ์•„๊ฐ€๊ธฐ
๐Ÿšฆ Meet Kueue: Smart Job Queueing for Kubernetes ๐Ÿง โš™๏ธ
Dev.toDev.to
Infrastructure

Pod ์ค‘์‹ฌ ์Šค์ผ€์ค„๋ง ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•œ Kueue ๊ธฐ๋ฐ˜ Job ์ฟผํ„ฐ ๊ด€๋ฆฌ ๋ฐ ์šฐ์„ ์ˆœ์œ„ ์ œ์–ด

๐Ÿšฆ Meet Kueue: Smart Job Queueing for Kubernetes ๐Ÿง โš™๏ธ

Hamdi (KHELIL) LION2026๋…„ 6์›” 30์ผ16๋ถ„intermediate

Context

๊ธฐ๋ณธ Kubernetes Scheduler์˜ Pod ์ค‘์‹ฌ ๋ฐฐ์น˜ ๋ฐฉ์‹์€ Batch ๋ฐ AI/ML ์›Œํฌ๋กœ๋“œ์˜ ์‹œ์ž‘ ์‹œ์  ์ œ์–ด ๋ถˆ๊ฐ€ ๋ฌธ์ œ๋ฅผ ์•ผ๊ธฐํ•จ. ์ด๋กœ ์ธํ•œ ๋ฆฌ์†Œ์Šค ๊ฒฝํ•ฉ ๋ฐœ์ƒ ๋ฐ ํŠน์ • ํŒ€์˜ ์ž์› ๋…์ ์œผ๋กœ ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ์˜ ํšจ์œจ์„ฑ์ด ์ €ํ•˜๋˜๋Š” ๊ตฌ์กฐ์  ํ•œ๊ณ„ ์กด์žฌ.

Technical Solution

  • Scheduler ์ „๋‹จ์— Admission Control ๊ณ„์ธต์„ ๋ฐฐ์น˜ํ•˜์—ฌ Job ๋‹จ์œ„์˜ ์‹คํ–‰ ์‹œ์  ๊ฒฐ์ • ๊ตฌ์กฐ ์„ค๊ณ„
  • ResourceFlavor๋ฅผ ํ†ตํ•œ ๋…ธ๋“œ ํŠน์„ฑ๋ณ„(x86, arm, GPU) ๋ฆฌ์†Œ์Šค ๊ทธ๋ฃนํ™” ๋ฐ ์ถ”์ƒํ™” ๊ตฌํ˜„
  • ClusterQueue์™€ LocalQueue์˜ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ํ†ตํ•œ ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ฟผํ„ฐ ๊ฑฐ๋ฒ„๋„Œ์Šค ๋ฐ ๋„ค์ž„์ŠคํŽ˜์ด์Šค๋ณ„ ํ• ๋‹น ์ฒด๊ณ„ ๊ตฌ์ถ•
  • Workload ๊ฐ์ฒด๋ฅผ ์ด์šฉํ•œ Job ์ƒํƒœ ์ถ”์  ๋ฐ ๊ฐ€์šฉ ์ฟผํ„ฐ ํ™•์ธ ํ›„ Pod ์ƒ์„ฑ์„ ํ—ˆ์šฉํ•˜๋Š” Admission ๋กœ์ง ์ ์šฉ
  • Cohort ์„ค์ •์„ ํ†ตํ•ด ์œ ํœด ์ฟผํ„ฐ๋ฅผ ํŒ€ ๊ฐ„ ๊ณต์œ ํ•˜๋Š” Fair Sharing ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋„์ž…
  • Priority ๊ธฐ๋ฐ˜ Admission์„ ํ†ตํ•ด ํ”„๋กœ๋•์…˜ ํ•™์Šต ์ž‘์—…์˜ ์šฐ์„  ์‹คํ–‰ ๊ถŒํ•œ ๋ณด์žฅ

1. Pod ์ •์˜ ์‹œ resource requests ์„ค์ •์„ ํ•„์ˆ˜๋กœ ์ ์šฉํ•˜์—ฌ ์ฟผํ„ฐ ๊ณ„์‚ฐ ๋ˆ„๋ฝ ๋ฐฉ์ง€

2. Job ์ •์˜ ์‹œ LocalQueue์™€ ์ผ์น˜ํ•˜๋Š” queue-name ๋ผ๋ฒจ ๋ถ€์—ฌ ์—ฌ๋ถ€ ํ™•์ธ

3. GPU ๋“ฑ ๊ณ ๋น„์šฉ ์ž์› ์‚ฌ์šฉ ์‹œ Partial Admission์œผ๋กœ ์ธํ•œ ๋ฆฌ์†Œ์Šค ๋‚ญ๋น„ ๊ฐ€๋Šฅ์„ฑ ๊ฒ€ํ† 

4. Elastic Jobs ๋„์ž… ์‹œ Feature Gate ํ™œ์„ฑํ™” ์ƒํƒœ ํ™•์ธ

5. ClusterQueue์˜ flavor ๋ช…์นญ๊ณผ ResourceFlavor์˜ ์ผ์น˜ ์—ฌ๋ถ€ ๊ฒ€์ฆ

์›๋ฌธ ์ฝ๊ธฐ