Claude Code 성능 73% 급락을 증명한 6,852세션 포렌식 감사
Claude Code Was Broken for 6 Weeks. AMD Caught It in 6,852 Sessions Before Anthropic Did.
Claude Code Was Broken for 6 Weeks. AMD Caught It in 6,852 Sessions Before Anthropic Did.
Why I spun my benchmark into its own repo (and why every dev tool with a benchmark should)
The Future: Engineers as AI System Architects
AI Agent Roadmap: Everything You Need to Build Agents (In the Right Order)
Things You're Overengineering in Your AI Agent (The LLM Already Handles Them)
How to Make a Company AI-Native (Without Building Anything)
Three Lessons From Shipping an AI Product to Real Users
I needed to know if the cheaper model was good enough. So I built an LLM-as-a-Judge pipeline
We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally
LLM 시대에 데이터 사이언티스트 역할 재정의 필요성 대두됨
The AI Engineer's Toolkit: Moving Beyond Prompt Engineering to Build Robust AI Applications
Waxell vs. Braintrust: When Evaluation Isn't Enough
From zero evals to a working multimodal evaluation in 30 minutes using LangWatch Skills
Introducing RTEB: A New Standard for Retrieval Evaluation
Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models
ScreenSuite - The most comprehensive evaluation suite for GUI Agents!
Judge Arena: Benchmarking LLMs as Evaluators
Expert Support case study: Bolstering a RAG app with LLM-as-a-Judge
BigCodeBench: The Next Generation of HumanEval
Introducing the Open Arabic LLM Leaderboard