12 votes

~comp book club: Site Reliability Engineering, chapter 1

Posted September 6, 2018 by spit-evil-olive-tips

Tags: book club, site reliability engineering

For the first book of this book club, I chose Google's book on Site Reliability Engineering, because it's freely available online, so we can get started without waiting for any dead-tree books to arrive. For the next book I'll plan on a few weeks lead time to decide on a book then acquire it from whatever bookstore or library you use.

I'm going to aim to do this every Thursday, 1 chapter per week (or maybe 2, if they're short?)

Today's reading: Preface and chapter 1 (full list of chapters here)

1 comment

spit-evil-olive-tips (OP)
September 6, 2018
Link
I love the analogy they give in the preface: I've worked with a lot of developers in my career who will work on a project up until the "birth", then get bored and move to a different company, or a...

I love the analogy they give in the preface:

Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort. Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second

I've worked with a lot of developers in my career who will work on a project up until the "birth", then get bored and move to a different company, or a different team at a large company. Lather, rinse, and repeat. I find they very often have blind spots related to how the system will actually work when run in production long-term - things like how the system will be monitored to alert someone if it's broken in some way.

The other concept I love is that of the "error budget" from chapter 1. If you have a target of 99.9% uptime, you can flip that around and say you have a budget for errors/downtime of 0.1%. Then you can decide how you want to spend that budget:

The use of an error budget resolves the structural conflict of incentives between development and SRE. SRE’s goal is no longer "zero outages"; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity.

2 votes