ํ”ผ๋“œ๋กœ ๋Œ์•„๊ฐ€๊ธฐ
The State of Computer Vision at Hugging Face ๐Ÿค—
Hugging Face BlogHugging Face Blog
AI/ML

Hugging Face๊ฐ€ Vision Transformer ๋„์ž…๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด 8๊ฐœ ํ•ต์‹ฌ ์ปดํ“จํ„ฐ ๋น„์ „ ํƒœ์Šคํฌ, 3000๊ฐœ ์ด์ƒ์˜ ๋ชจ๋ธ, 100๊ฐœ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ์…‹์„ Hub์— ํ†ตํ•ฉ

The State of Computer Vision at Hugging Face ๐Ÿค—

2023๋…„ 1์›” 30์ผ10๋ถ„intermediate

Context

Hugging Face๋Š” AI์˜ ๋ฏผ์ฃผํ™”๋ผ๋Š” ๋ฏธ์…˜ ํ•˜์—์„œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ๋ฅผ ๋„˜์–ด ์ปดํ“จํ„ฐ ๋น„์ „ ์˜์—ญ์œผ๋กœ ํ™•์žฅํ•  ํ•„์š”๊ฐ€ ์žˆ์—ˆ๋‹ค. ๊ธฐ์กด์—๋Š” Vision Transformer(ViT) ํ•˜๋‚˜์˜ ์•„ํ‚คํ…์ฒ˜๋งŒ ์ œํ•œ์ ์œผ๋กœ ์ง€์›ํ–ˆ๋˜ ์ƒํ™ฉ์—์„œ ์‚ฐ์—… ํ˜„์žฅ์˜ ๋‹ค์–‘ํ•œ ๋น„์ „ ํƒœ์Šคํฌ์— ๋Œ€์‘ํ•ด์•ผ ํ–ˆ๋‹ค.

Technical Solution

  • 8๊ฐœ ํ•ต์‹ฌ ๋น„์ „ ํƒœ์Šคํฌ ์ง€์› ์ถ”๊ฐ€: ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ์ด๋ฏธ์ง€ ์„ธ๋ถ„ํ™”, Zero-shot ๊ฐ์ฒด ํƒ์ง€, ๋น„๋””์˜ค ๋ถ„๋ฅ˜, ๊นŠ์ด ์ถ”์ •, ์ด๋ฏธ์ง€-์ด๋ฏธ์ง€ ํ•ฉ์„ฑ, ๋ฌด์กฐ๊ฑด๋ถ€ ์ด๋ฏธ์ง€ ์ƒ์„ฑ, Zero-shot ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜
  • Vision-Language ๊ต์ฐจ ํƒœ์Šคํฌ ํ†ตํ•ฉ: ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ(์ด๋ฏธ์ง€ ์บก์…”๋‹, OCR), ํ…์ŠคํŠธ-์ด๋ฏธ์ง€, ๋ฌธ์„œ ์งˆ๋‹ต, ์‹œ๊ฐ ์งˆ๋‹ต(VQA) ์ง€์›
  • ๋‹ค์–‘ํ•œ ์•„ํ‚คํ…์ฒ˜ ์ง€์› ํ™•๋Œ€: Transformer ๊ธฐ๋ฐ˜(ViT, Swin, DETR) ์™ธ์—๋„ ์ˆœ์ˆ˜ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง(ConvNeXt, ResNet, RegNet) ํฌํ•จ
  • Pipelines API๋กœ ์ถ”๋ก  ๋‹จ์ˆœํ™”: 7๊ฐœ ๋น„์ „ ํƒœ์Šคํฌ์— ๋Œ€ํ•ด 3~5์ค„ ์ฝ”๋“œ๋กœ ์ถ”๋ก  ์ˆ˜ํ–‰ ๊ฐ€๋Šฅํ•˜๋„๋ก ํ†ต์ผ๋œ ์ธํ„ฐํŽ˜์ด์Šค ์ œ๊ณต
  • Trainer API๋กœ ๋ฏธ์„ธ์กฐ์ • ์ง€์›: ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ์ด๋ฏธ์ง€ ์„ธ๋ถ„ํ™”, ๋น„๋””์˜ค ๋ถ„๋ฅ˜, ๊ฐ์ฒด ํƒ์ง€, ๊นŠ์ด ์ถ”์ •์— ๋Œ€ํ•ด Trainer๋กœ ํ†ตํ•ฉ๋œ ํ•™์Šต ์ง€์›
  • Datasets์™€ ์ฆ๊ฐ• ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ†ตํ•ฉ: ImageNet-1k, Scene Parsing, NYU Depth V2, COYO-700M, LAION-400M ๋“ฑ 100๊ฐœ ์ด์ƒ ๋ฐ์ดํ„ฐ์…‹ ์ ‘๊ทผ ๋ฐ albumentations, Kornia ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์—ฐ๋™
  • Zero-shot ๋ชจ๋ธ ์ง€์›: CLIP(Zero-shot ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜), OWL-ViT(Zero-shot ๊ฐ์ฒด ํƒ์ง€), CLIPSeg(Zero-shot ์„ธ๋ถ„ํ™”), GroupViT(Zero-shot ์„ธ๋ถ„ํ™”), X-CLIP(Zero-shot ๋น„๋””์˜ค ๋ถ„๋ฅ˜) ์ถ”๊ฐ€
  • Inference Endpoints๋ฅผ ํ†ตํ•œ ๋ฐฐํฌ: ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ๊ฐ์ฒด ํƒ์ง€, ์ด๋ฏธ์ง€ ์„ธ๋ถ„ํ™”๋Š” ์ง์ ‘ ํ†ตํ•ฉ, ๊ธฐํƒ€ ํƒœ์Šคํฌ๋Š” ์ปค์Šคํ…€ ํ•ธ๋“ค๋Ÿฌ๋กœ ์ง€์›

Impact

Hub์— 3000๊ฐœ ์ด์ƒ์˜ ๋ชจ๋ธ ์ œ๊ณต, 100๊ฐœ ์ด์ƒ์˜ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ฐ์ดํ„ฐ์…‹ ํ†ตํ•ฉ

Key Takeaway

์˜คํ”ˆ์†Œ์Šค ML ์ƒํƒœ๊ณ„์˜ ๋ฏผ์ฃผํ™”๋Š” ๋‹จ์ผ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋‚ด ๊ธฐ๋Šฅ ํ†ตํ•ฉ๋ณด๋‹ค๋Š” Transformer, PyTorch๋ฟ ์•„๋‹ˆ๋ผ ์ œ3์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊นŒ์ง€ ํฌํ•จํ•œ ํ‘œ์ค€ํ™”๋œ ์ธํ„ฐํŽ˜์ด์Šค(Pipeline, Trainer, Hub)๋ฅผ ํ†ตํ•ด ๋‹ฌ์„ฑ๋œ๋‹ค. ์•„ํ‚คํ…์ฒ˜ ์„ ํƒ์˜ ์ž์œ ๋„์™€ ํ”„๋กœ๋•์…˜ ๋ฐฐํฌ ๊ฒฝ๋กœ์˜ ๋‹จ์ˆœ์„ฑ์ด ์ปค๋ฎค๋‹ˆํ‹ฐ ์ฑ„ํƒ๋ฅ ์„ ๊ฒฐ์ •ํ•œ๋‹ค.


์ปดํ“จํ„ฐ ๋น„์ „ ๋ชจ๋ธ์„ ํ”„๋กœ๋•์…˜์— ๋ฐฐํฌํ•ด์•ผ ํ•˜๋Š” ํŒ€์—์„œ๋Š” ์ž์‹ ์˜ ํƒœ์Šคํฌ(๋ถ„๋ฅ˜, ํƒ์ง€, ์„ธ๋ถ„ํ™” ๋“ฑ)์— ํ•ด๋‹นํ•˜๋Š” Hub์˜ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ๊ณผ Datasets๋ฅผ ๋จผ์ € ํ™•์ธํ•œ ํ›„, Pipelines๋กœ ๋น ๋ฅธ ํ”„๋กœํ† ํƒ€์ž…์„ ๊ตฌ์„ฑํ•˜๊ณ , ๋ฐ์ดํ„ฐ์…‹์ด ์ถฉ๋ถ„ํ•˜๋ฉด Trainer๋กœ ๋ฏธ์„ธ์กฐ์ •ํ•˜๊ณ , ์ตœ์ข…์ ์œผ๋กœ Inference Endpoints ๋˜๋Š” ์ปค์Šคํ…€ ํ•ธ๋“ค๋Ÿฌ๋กœ ๋ฐฐํฌํ•˜๋Š” ํ†ตํ•ฉ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

์›๋ฌธ ์ฝ๊ธฐ