Hugging Face가 RLOO 알고리즘을 도입해 PPO 대비 GPU 메모리 50-70% 감소 및 2-3배 학습 속도 개선
Putting RL back in RLHF
Putting RL back in RLHF
Finetune Stable Diffusion Models with DDPO via TRL
Fine-tune Llama 2 with DPO
Llama 2 is here - get it on Hugging Face
Can foundation models label data like humans?
StackLLaMA: A hands-on guide to train LLaMA with RLHF
Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU
What Makes a Dialog Agent Useful?
Illustrating Reinforcement Learning from Human Feedback (RLHF)