12
votes
~comp book club: Site Reliability Engineering, chapter 1
For the first book of this book club, I chose Google's book on Site Reliability Engineering, because it's freely available online, so we can get started without waiting for any dead-tree books to arrive. For the next book I'll plan on a few weeks lead time to decide on a book then acquire it from whatever bookstore or library you use.
I'm going to aim to do this every Thursday, 1 chapter per week (or maybe 2, if they're short?)
Today's reading: Preface and chapter 1 (full list of chapters here)
I love the analogy they give in the preface:
I've worked with a lot of developers in my career who will work on a project up until the "birth", then get bored and move to a different company, or a different team at a large company. Lather, rinse, and repeat. I find they very often have blind spots related to how the system will actually work when run in production long-term - things like how the system will be monitored to alert someone if it's broken in some way.
The other concept I love is that of the "error budget" from chapter 1. If you have a target of 99.9% uptime, you can flip that around and say you have a budget for errors/downtime of 0.1%. Then you can decide how you want to spend that budget: