Atlarix 하네스를 통한 open-weight 모델의 성능 병목 제거 확인 (정확도 47%)
Atlarix vs opencode on Terminal-Bench 2.0 — same model, only the harness changes (k=1, receipts included)
Atlarix vs opencode on Terminal-Bench 2.0 — same model, only the harness changes (k=1, receipts included)
The gap between open weights LLMs and closed source LLMs
xAI 모델 벤치마크 열위 및 SpaceX의 AI 역량 부재 분석
Building haven bench in the open, and the flaky CI ghost it flushed out
Benchmarking Residential Proxy Providers: A Reproducible Test Script
Will It Mythos?
We ran Composer 2.5 and 2.5 Fast across 11 skills. Surprisingly, Fast won.
A Chinese 8B model beat the Western 8B models at Japanese RAG. I still wouldn't put it in the default deployment — and that distinction is the point.
오류 46% 감소 구현한 Quantum Decoder 및 표준 벤치마크 설계
olmo-eval: An evaluation workbench for the model development loop
【Deep Dive】Frontier Code: The Benchmark That Asks "Would a Maintainer Merge This?"
50 Million Records in Under One Second — Inside ZenQL’s New Collection Engine
Benchmarks in Leipzig
한국어 공공기관 Long-Document 분석을 위한 KOLongDoc 벤치마크 공개
EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
"How I Cut My Go Markdown Linter's Benchmark by 81%"
I A/B tested 4 LLMs on the same 500 queries. The results surprised me.
Antigravity 2.0, OpenSCAD 기반 3D LLM 벤치마크 1위 달성
I benchmarked OpenAI's new GPT-Realtime-Translate against four other live translation systems