ํ”ผ๋“œ๋กœ ๋Œ์•„๊ฐ€๊ธฐ
๐Ÿ“„Paper: RORA-VLM: Robust Retrieval Augmentation for Vision Language Models
Dev.toDev.to
AI/ML

Noise-resilient RAG ๊ตฌ์กฐ๋ฅผ ํ†ตํ•œ VLM์˜ ์™ธ๋ถ€ ์ง€์‹ ์ถ”๋ก  ์•ˆ์ •์„ฑ ํ™•๋ณด

๐Ÿ“„Paper: RORA-VLM: Robust Retrieval Augmentation for Vision Language Models

Mercy2026๋…„ 5์›” 29์ผ1๋ถ„advanced

Context

์ด๋ฏธ์ง€ ๋‚ด๋ถ€ ์ •๋ณด๋งŒ์œผ๋กœ ๋‹ต๋ณ€์ด ๋ถˆ๊ฐ€๋Šฅํ•œ VQA ํƒœ์Šคํฌ์—์„œ ์™ธ๋ถ€ ์ง€์‹ ํ™œ์šฉ์˜ ํ•„์š”์„ฑ ์ฆ๋Œ€. ๊ธฐ์กด VLM์€ Retrieval ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” Noise ๋ฐ์ดํ„ฐ ์œ ์ž… ์‹œ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ค๋Š” ๋ณ‘๋ชฉ ์ง€์  ํ˜•์„ฑ.

Technical Solution

  • Image-to-Entity-to-Text๋กœ ์ด์–ด์ง€๋Š” Two-stage retrieval ์„ค๊ณ„๋ฅผ ํ†ตํ•œ ์ง€์‹ ๊ฒ€์ƒ‰ ์ •ํ™•๋„ ํ–ฅ์ƒ
  • WIT ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์˜ 3,700๋งŒ ์žฅ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ Entity anchor ๋งค์นญ ๋ฐ Google API ์—ฐ๋™ Query ํ™•์žฅ ๋กœ์ง ๊ตฌํ˜„
  • Query-oriented visual token refinement ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ์งˆ๋ฌธ๊ณผ ๋ฌด๊ด€ํ•œ Image background noise ์ œ๊ฑฐ
  • Attention score ๊ธฐ๋ฐ˜์˜ ํŒจ์น˜ ์„ ๋ณ„์„ ํ†ตํ•ด ํ•ต์‹ฌ Visual token๋งŒ ์‹œํ€€์Šค๋กœ ์žฌ๊ตฌ์„ฑํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ •์ œ ํ”„๋กœ์„ธ์Šค ๋„์ž…
  • ํ•™์Šต ๋‹จ๊ณ„์—์„œ ์˜๋„์ ์œผ๋กœ Irrelevant knowledge๋ฅผ ์ฃผ์ž…ํ•˜๋Š” Noise-resilient training์œผ๋กœ ๋ชจ๋ธ์˜ ์ •๋ณด ์„ ๋ณ„ ๋Šฅ๋ ฅ ๊ฐ•ํ™”

- RAG ์‹œ์Šคํ…œ ์„ค๊ณ„ ์‹œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์˜ Noise๊ฐ€ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์ €ํ•ดํ•œ๋‹ค๋ฉด, ์˜๋„์ ์ธ Noise ์ฃผ์ž… ํ•™์Šต(Adversarial Training) ๊ฒ€ํ†  - ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์‹œ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์ฃผ์ž…ํ•˜๊ธฐ๋ณด๋‹ค Query ๊ธฐ๋ฐ˜์˜ Token Refinement ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์ณ ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ ํšจ์œจ ์ตœ์ ํ™” - ๋‹จ์ˆœ ํ‚ค์›Œ๋“œ ๊ฒ€์ƒ‰ ๋Œ€์‹  '์ด๋ฏธ์ง€-์—”ํ‹ฐํ‹ฐ-ํ…์ŠคํŠธ'์™€ ๊ฐ™์€ ๊ณ„์ธต์  ๊ฒ€์ƒ‰ ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•œ ๊ฒ€์ƒ‰ ์ •๋ฐ€๋„ ๊ฐœ์„ 

์›๋ฌธ ์ฝ๊ธฐ