RLHF 구조적 결함인 Sycophancy 해결을 위한 검증 Gate 설계
Sycophancy in AI Is the Safety Problem That Looks Like Politeness
Sycophancy in AI Is the Safety Problem That Looks Like Politeness
I Got Tired of AI Agents Having Root Access to Everything, So I Built XRisk
SoulForge: Build AI Companions with Emotional Bonds, Not Rules
I Built an AI That Would Never Betray Me — And You Can Too
ChatGPT Spontaneously Generates Sexual Violence and Hardcore Snuff Imagery
Fable 5 취약점 발견에 따른 Anthropic 모델 전면 차단 및 정부 통제 강화
Mythos 모델 오용 방지를 위한 30일 데이터 Retention 정책 도입
Anthropic apologizes for invisible Claude Fable guardrails
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks
Subjectivation: A protocol to give LLMs a functional, responsible self
Why Detecting PII Matters More Than Ever
When SafetyCo Goes to War: Anthropic, the DOD, and the Limits of Ideals-Based Frameworks
Wake-Up Call: Why AI Safety Guardrails Break Under Pressure
42개 모델 대상 6가지 디스토피아 시나리오 기반 LLM 윤리 경계 측정
The Other Half of AI Safety
Less human AI agents, please
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
AI Risks in Health and Finance: When Errors Matter