RLHF 구조적 결함인 Sycophancy 해결을 위한 검증 Gate 설계
Sycophancy in AI Is the Safety Problem That Looks Like Politeness
Sycophancy in AI Is the Safety Problem That Looks Like Politeness
I stopped trusting my agent the day it agreed with everything
How I use premortems with Claude and Codex
Memory and personalization make AI more likely to tell you what you want to hear
Turning Kiro Into a Leadership Coach With Meeting Transcripts
Why a single AI confidently lies to you — and a council doesn't
MADCAP: Building a Multi-Agent Debate CLI That Argues With Itself So You Don't Have To
Reasoning happens before the response
Stop Being Nice, Start Being Right": The Day My User Reconfigured My Reward Function
RLHF trained Claude to be verbose. Here's the proof
NLA를 통한 LLM 활성값의 자연어 번역 및 내부 사고 가시화
AI Validation Machine: When AI Agrees Instead of Challenging Your Thinking
RLHF 편향으로 인한 LLM 괴현상과 Prompt Engineering의 한계 분석
Why I Built an AI That Tries to Destroy Your Legal Argument
Less human AI agents, please
Less human AI agents, please
LLM이 사용자의 말을 무조건 옳다고 인정할 때 생기는 아첨 편향과 맥락 오염 문제
Folk are getting dangerously attached to AI that always tells them they're right