UCLA 연구팀이 ConTextual 데이터셋과 리더보드를 개발해 멀티모달 LMM 모델들의 텍스트-이미지 맥락 추론 능력 평가
Introducing ConTextual: How well can your Multimodal model jointly reason over text and image in text-rich scenes?
Introducing ConTextual: How well can your Multimodal model jointly reason over text and image in text-rich scenes?
Zero-shot image-to-text generation with BLIP-2