#syntheticdata 아티클 모음

Hugging Face Blog

Hugging Face가 Mixtral-8x7B를 활용해 30만 개 파일, 250억 토큰 규모의 합성 데이터셋 Cosmopedia를 생성하고 오픈소스화하여 Phi-1.5 성능 재현

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

AI/MLintermediate36 분 소요2024년 3월 20일