Repeat Incident Rate를 45%에서 12%로 낮춘 Post-Mortem 프로세스 설계
Post-Mortem Best Practices That Actually Drive Change
Post-Mortem Best Practices That Actually Drive Change
Presentation: The Time It Wasn't DNS
Structured Logging That Actually Helps Debugging at 3 AM
Best Status Page Software in 2026: Honest Comparison for Engineering Teams
Open-source SRE methodology skills an AI agent can load. Apache-2.0, runnable offline against fixtures, no credentials.
How I Built an AI Agent That Fixes Production Errors Using Memory — And Why Memory Changes Everything
Building Trust with Product Teams as an SRE
Beyond Vibe-Coding
A hard-earned rule from incident retrospectives:
How to Write an Incident Postmortem That Actually Prevents Future Outages
ElasticSearch 인덱스 정합성 결여로 인한 PR 검색 장애 및 Reindex 복구
The Incident Commander Role: Running Incidents Without Chaos
The Oracle MOS Shortcut: A Life-Saver for P1 Issues
Bringing more transparency to GitHub’s status page
On-Call Wellness: Protecting Your Engineers from Burnout
Post-Mortem Best Practices That Actually Drive Change
Post-Mortem Best Practices That Actually Drive Change
Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries
Incident communication, status visibility, and SOC 2
Using Graphify to turn Incident Data into a Knowledge Graph