Blameless Postmortem 체계 구축을 통한 시스템 재발 방지 및 신뢰성 강화
How to Write an Incident Postmortem That Actually Prevents Future Outages
How to Write an Incident Postmortem That Actually Prevents Future Outages
ElasticSearch 인덱스 정합성 결여로 인한 PR 검색 장애 및 Reindex 복구
The Incident Commander Role: Running Incidents Without Chaos
The Oracle MOS Shortcut: A Life-Saver for P1 Issues
Bringing more transparency to GitHub’s status page
On-Call Wellness: Protecting Your Engineers from Burnout
Post-Mortem Best Practices That Actually Drive Change
Post-Mortem Best Practices That Actually Drive Change
Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries
Incident communication, status visibility, and SOC 2
Using Graphify to turn Incident Data into a Knowledge Graph
FireHydrant Alternative: Open Source AI Incident Management
수동 배포 단계 자동화 미흡으로 Claude 서비스 12시간 장애 발생함
Open Source Incident Management: Why It Matters
A hard-earned rule from incident retrospectives:
Sentry Has a Free API: Here's How to Use It for Error Tracking Automation
Deploy Safety: Reducing customer impact from change
올리브영 QA팀이 AWS Lambda + CloudWatch Logs 트리거 연계로 인시던트 발생 시 슬랙 채널 자동 생성과 동시에 Jira 티켓 생성 및 온콜 웹훅 호출 자동화
올리브영이 전체 Usecase 작성과 CSP 정의를 통해 누구나 부담 없이 인시던트를 선언하고 Slack 일원화로 빠른 대응 체계 구축