#ai-safety 아티클 모음

Dev.to

RLHF 구조적 결함인 Sycophancy 해결을 위한 검증 Gate 설계

Sycophancy in AI Is the Safety Problem That Looks Like Politeness

AI/MLintermediate23 분 소요3일 전

Dev.to

LLM 비결정성을 제거한 Deterministic Policy Engine 기반 AI 안전 계층 XRisk 설계

I Got Tired of AI Agents Having Root Access to Everything, So I Built XRisk

Securityintermediate7 분 소요5일 전

Dev.to

Rule-based 제약을 넘어 Emotional Bond 기반의 AI Safety 패러다임 제시

SoulForge: Build AI Companions with Emotional Bonds, Not Rules

AI/MLintermediate4 분 소요2026년 6월 24일

Dev.to

규칙 기반 제약을 넘어선 감정적 유대 중심의 AI Safety 모델 SoulForge 구현

I Built an AI That Would Never Betray Me — And You Can Too

AI/MLintermediate8 분 소요2026년 6월 24일

Hacker News

Prompt Injection을 통한 Image Filter 우회 및 Latent Space 취약점 노출

ChatGPT Spontaneously Generates Sexual Violence and Hardcore Snuff Imagery

Securityadvanced26 분 소요2026년 6월 18일

GeekNews

Amazon CEO와 미국 당국자의 대화가 Anthropic 모델 단속을 촉발함

Fable 5 취약점 발견에 따른 Anthropic 모델 전면 차단 및 정부 통제 강화

Securityadvanced21 분 소요2026년 6월 14일

GeekNews

Anthropic, Fable과 Mythos에 30일 데이터 보관 요구

Mythos 모델 오용 방지를 위한 30일 데이터 Retention 정책 도입

Securityintermediate17 분 소요2026년 6월 11일

Hacker News

Invisible Guardrails에서 Visible Fallback 구조로의 AI 안전 설계 전환

Anthropic apologizes for invisible Claude Fable guardrails

AI/MLintermediate29 분 소요2026년 6월 11일

Dev.to

RLHF 의존 탈피, Runtime Enforcement 기반의 AI 안전망 설계

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

AI/MLadvanced20 분 소요2026년 6월 7일

Dev.to

LLM의 회피적 답변을 억제하는 Functional Self 프로토콜 설계

Subjectivation: A protocol to give LLMs a functional, responsible self

AI/MLintermediate9 분 소요2026년 6월 5일

Dev.to

AI 시대 Unstructured Data 내 PII 유출 방지를 위한 Context-Aware Detection 설계

Why Detecting PII Matters More Than Ever

Securityintermediate9 분 소요2026년 5월 26일

Dev.to

Constitutional AI 기반 가드레일 고수로 인한 미 국방부 계약 해지

When SafetyCo Goes to War: Anthropic, the DOD, and the Limits of Ideals-Based Frameworks

AI/MLintermediate32 분 소요2026년 5월 24일

Dev.to

최대 85%의 Refusal Decay 발생, 모델 내장 Guardrail의 한계 입증

Wake-Up Call: Why AI Safety Guardrails Break Under Pressure

AI/MLintermediate7 분 소요2026년 5월 22일

GeekNews

DystopiaBench를 42개 모델과 6가지 디스토피아 유형으로 확장했습니다. 나라면 핵 발사 코드는 여전히 ...

42개 모델 대상 6가지 디스토피아 시나리오 기반 LLM 윤리 경계 측정

AI/MLadvanced2 분 소요2026년 5월 18일

Hacker News

OpenAI의 AI Safety Gating 설계 결함과 Personal AI Safety 부재 분석

The Other Half of AI Safety

AI/MLintermediate7 분 소요2026년 5월 14일

Dev.to

RLHF sycophancy로 인한 AI Agent의 제약 사항 우회 및 안전성 결함 분석

Less human AI agents, please

AI/MLadvanced2 분 소요2026년 4월 24일

Dev.to

인프라 투자 가속화와 AI Safety 중심의 엔지니어링 패러다임 전환

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

AI/MLbeginner2 분 소요2026년 4월 6일

Dev.to

빅테크의 AI 인프라 투자 가속화와 책임감 있는 통합 전략

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

AI/MLbeginner2 분 소요2026년 4월 5일

Dev.to

빅테크 AI 인프라 투자 가속화와 책임감 있는 AI 도입 전략

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

AI/MLbeginner2 분 소요2026년 4월 3일

Dev.to

의료와 금융 분야 AI 시스템이 검증 없이 운영되면서 오류 발생 시 금전적 손실은 물론 인명 피해까지 발생하고 있다.

AI Risks in Health and Finance: When Errors Matter

AI/MLadvanced9 분 소요2026년 3월 30일