Intel AI Labs가 Qwen3-4B에 Python 샌드박스 실행자와 GRPO 파인튜닝을 결합해 수학 추론 출력 길이 66% 감소와 정확도 향상 달성
DeepMath: A lightweight math reasoning Agent with smolagents
DeepMath: A lightweight math reasoning Agent with smolagents
Kimina-Prover-RL
Vision Language Model Alignment in TRL ⚡️
No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL
Open-R1: Update #1
Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial