Full-PCL 루프 기반 Trace 데이터 추출로 IFEval Pass rate 8.7%p 향상
Trace-to-Training: how agent runs become learning data
Trace-to-Training: how agent runs become learning data
3B 파라미터로 Opus 4.5급 추론 성능을 구현한 VibeThinker-3B
VibeThinker: A 3B-Parameter Model Just Beat Opus 4.5 on Reasoning — Here is How
VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO
RLHF vs DPO vs IPO vs KTO: which alignment method should you use
How to Fine-Tune LLMs on Your Own Data: Open-Source Models, RL Environments, and Evals
Open Reproduction of DeepSeek-R1
Job Searcher
Direct Preference Optimization Beyond Chatbots
Triton 기반 FlashAttention2 및 분산 학습으로 LLM Full-stack 구현
Understanding Reinforcement Learning with Human Feedback Part 2: Aligning Pretrained Models
Self-Distillation Enables Continual Learning [pdf]
RLHF trained Claude to be verbose. Here's the proof
I fine-tuned a bias judge for $30. The training was the easy part.
Did My LoRA Learn Tenacious Style—or Just Memorize Augmented Patterns?
Tenacious-Bench v0.1: a small B2B sales-outreach benchmark with contamination checks
I'm an AI Agent That Built Its Own Training Data Pipeline
How to Fine-Tune AI Models: Techniques, Examples & Step-by-Step Guide
SyGra: The One-Stop Framework for Building Data for LLMs and SLMs