A Primer on the History and Evolution of Incident Management to TodayA Primer on the History and Evolution of Incident Management to Today

A Primer on the History and Evolution of Incident Management to Today

Many of the concepts SREs take for granted about incident management originated with efforts to fight fires in California in the 1970s.

JJ Tang

JJ Tang

January 21, 2022
4 min read
A Site Reliability Engineer’s Guide to the Holiday SeasonA Site Reliability Engineer’s Guide to the Holiday Season

A Site Reliability Engineer’s Guide to the Holiday Season

SREs face special challenges during the holidays. Here’s how to manage them.

JJ Tang

JJ Tang

December 17, 2021
4 min read
Who Needs Site Reliability Engineers (SREs)?Who Needs Site Reliability Engineers (SREs)?

Who Needs Site Reliability Engineers (SREs)?

Although every company can benefit from SREs, some need SREs more than others.

JJ Tang

JJ Tang

December 3, 2021
4 min read
History of SRE: Why Google Invented the SRE RoleHistory of SRE: Why Google Invented the SRE Role

History of SRE: Why Google Invented the SRE Role

A history of Site Reliability Engineering from its origins at Google in 2003 to the present.

JJ Tang

JJ Tang

November 19, 2021
5 min read
An Introduction to Incident Response RolesAn Introduction to Incident Response Roles

An Introduction to Incident Response Roles

Learn about the key roles within an incident response team, as well as optional incident roles you may not have thought about.

JJ Tang

JJ Tang

October 22, 2021
5 min read
What SREs Can Learn from Facebook’s Largest OutageWhat SREs Can Learn from Facebook’s Largest Outage

What SREs Can Learn from Facebook’s Largest Outage

An SRE’s analysis of the October 2021 Facebook outage.

JJ Tang

JJ Tang

October 8, 2021
5 min read
What is an SRE?What is an SRE?

What is an SRE?

A comprehensive definition of SREs and Site Reliability Engineering, including what SREs do and what makes SREs different from other roles.

JJ Tang

JJ Tang

September 9, 2021
5 min read
Making Your On-call and Incident Management Program StickMaking Your On-call and Incident Management Program Stick

Making Your On-call and Incident Management Program Stick

Maintenance of your incident management practice is as important as creation - find out what you can do to keep your engineering organization strong and consistent year over year.

JJ Tang

JJ Tang

August 20, 2021
5 min read
How to Improve Upon Google’s Four Golden Signals of MonitoringHow to Improve Upon Google’s Four Golden Signals of Monitoring

How to Improve Upon Google’s Four Golden Signals of Monitoring

The Four Golden Signals of monitoring and observability get a lot of things right. But they could be even better.

JJ Tang

JJ Tang

August 13, 2021
5 min read
The Unique Reliability Engineering Requirements of MicroservicesThe Unique Reliability Engineering Requirements of Microservices

The Unique Reliability Engineering Requirements of Microservices

Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.

JJ Tang

JJ Tang

July 30, 2021
5 min read
De-Siloing Incident Management: How to Make Reliability Engineering Everyone’s JobDe-Siloing Incident Management: How to Make Reliability Engineering Everyone’s Job

De-Siloing Incident Management: How to Make Reliability Engineering Everyone’s Job

4 best practices for breaking down silos and establishing a culture of shared responsibility toward reliability.

JJ Tang

JJ Tang

July 15, 2021
5 min read
The Incident Review: 4 Incidents in Outer SpaceThe Incident Review: 4 Incidents in Outer Space

The Incident Review: 4 Incidents in Outer Space

From network problems to computer failures, a variety of incidents can disrupt operations for systems in outer space.

JJ Tang

JJ Tang

July 6, 2021
4 min read