ํ”ผ๋“œ๋กœ ๋Œ์•„๊ฐ€๊ธฐ
Fine-Tune ViT for Image Classification with ๐Ÿค— Transformers
Hugging Face BlogHugging Face Blog
AI/ML

Hugging Face Transformers๋ฅผ ์‚ฌ์šฉํ•ด Vision Transformer(ViT) ๋ชจ๋ธ์„ beans ๋ฐ์ดํ„ฐ์…‹์— ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ 98.5% ํ‰๊ฐ€ ์ •ํ™•๋„ ๋‹ฌ์„ฑ

Fine-Tune ViT for Image Classification with ๐Ÿค— Transformers

2022๋…„ 2์›” 11์ผ12๋ถ„beginner

Context

Transformer ์•„ํ‚คํ…์ฒ˜๋Š” NLP ๋ถ„์•ผ์—์„œ ํ˜์‹ ์„ ๊ฐ€์ ธ์™”์œผ๋‚˜, ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋ก ์ด ๋ถ€์กฑํ–ˆ๋‹ค. Vision Transformer(ViT)๋Š” ์ด๋ฏธ์ง€๋ฅผ NLP์˜ ํ† ํฐํ™” ๋ฐฉ์‹์ฒ˜๋Ÿผ ํŒจ์น˜๋กœ ๋ถ„ํ• ํ•˜์—ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ–ˆ์œผ๋‚˜, ์‹ค์ œ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์ž‘์—…์— ์ ์šฉํ•˜๋Š” ๊ตฌ์ฒด์ ์ธ ํŒŒ์ธํŠœ๋‹ ํ”„๋กœ์„ธ์Šค๊ฐ€ ํ•„์š”ํ–ˆ๋‹ค.

Technical Solution

  • ์ด๋ฏธ์ง€๋ฅผ 16x16 ํ”ฝ์…€ ํŒจ์น˜๋กœ ๋ถ„ํ• ํ•˜๊ณ  ์„ ํ˜• ํˆฌ์˜์œผ๋กœ ์ž„๋ฒ ๋”ฉ: google/vit-base-patch16-224-in21k ๋ชจ๋ธ์˜ ViTImageProcessor๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 224x224 ํฌ๊ธฐ๋กœ ์ •๊ทœํ™” ๋ฐ ์ •๊ทœ๋ถ„ํฌ(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])๋กœ ์ •๊ทœํ™”
  • Hugging Face datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ beans ๋ฐ์ดํ„ฐ์…‹(3๊ฐœ ํด๋ž˜์Šค, ์ด ์ด๋ฏธ์ง€ ์ƒ˜ํ”Œ) ๋กœ๋“œ ๋ฐ ์ „์ฒ˜๋ฆฌ: ClassLabel ํŠน์„ฑ์„ ํ™œ์šฉํ•ด 'angular_leaf_spot', 'bean_rust', 'healthy' ํด๋ž˜์Šค ๋งคํ•‘
  • Trainer API๋ฅผ ํ†ตํ•œ ์ž๋™ํ™”๋œ ํŒŒ์ธํŠœ๋‹: TrainingArguments์—์„œ ์—ํฌํฌ ์ˆ˜, ๋ฐฐ์น˜ ํฌ๊ธฐ, ํ•™์Šต๋ฅ  ์„ค์ •ํ•˜์—ฌ ๋ฐ˜๋ณต ์ˆ˜ํ–‰
  • ํ•™์Šต๋œ ๋ชจ๋ธ์„ Hugging Face Hub๋กœ ํ‘ธ์‹œ: push_to_hub ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ True๋กœ ์„ค์ •ํ•˜์—ฌ 'nateraw/vit-base-beans'๋กœ ๊ณต๊ฐœ ๋ฐฐํฌ
  • ์ด๋ฏธ์ง€ ๊ทธ๋ฆฌ๋“œ ์‹œ๊ฐํ™” ํ•จ์ˆ˜๋กœ ํด๋ž˜์Šค๋ณ„ ์ƒ˜ํ”Œ ๊ฒ€์ฆ: PIL๊ณผ ImageDraw๋ฅผ ํ™œ์šฉํ•ด ๊ฐ ํด๋ž˜์Šค๋ณ„ 3๊ฐœ ์˜ˆ์‹œ๋ฅผ ํ‘œ์‹œ

Impact

  • ํ‰๊ฐ€ ์ •ํ™•๋„(eval_accuracy): 0.985 (98.5%)
  • ํ‰๊ฐ€ ์†์‹ค(eval_loss): 0.0637
  • ํ‰๊ฐ€ ์ฒ˜๋ฆฌ ์†๋„: ์ดˆ๋‹น 62.356 ์ƒ˜ํ”Œ, ์ดˆ๋‹น 7.97 ์Šคํ…

Key Takeaway

Vision Transformer๋ฅผ ํ™œ์šฉํ•œ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์ž‘์—…์€ Hugging Face์˜ ํ†ตํ•ฉ ๋„๊ตฌ(datasets, transformers, Trainer)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด NLP ํŒŒ์ธํŠœ๋‹๊ณผ ๋™์ผํ•˜๊ฒŒ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ViTImageProcessor๋ฅผ ๋ฐ˜๋“œ์‹œ ์‚ฌ์šฉํ•˜์—ฌ ์ผ๊ด€๋œ ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ๋ฅผ ๋ณด์žฅํ•ด์•ผ ๋ชจ๋ธ์ด ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ž‘๋™ํ•œ๋‹ค.


์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„๋ฅ˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์—”์ง€๋‹ˆ์–ด๋Š” google/vit-base-patch16-224 ๊ฐ™์€ ์‚ฌ์ „ํ•™์Šต ViT ๋ชจ๋ธ์˜ ๊ณต์‹ ViTImageProcessor๋ฅผ ํŒŒ์ดํ”„๋ผ์ธ์— ํ†ตํ•ฉํ•˜๊ณ , ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹(beans, CIFAR-10 ๋“ฑ)์—์„œ Trainer API๋กœ ํŒŒ์ธํŠœ๋‹ํ•˜๋ฉด 98% ์ด์ƒ์˜ ์ •ํ™•๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

์›๋ฌธ ์ฝ๊ธฐ