Big ideas
- SLOs should be customer focused!
- Eg Loading the home page is fast
- It’s nice to be able to tell how individual customers are doing vs system-level aggregate metrics
- Tradeoff between reliability work vs product work based on agreement between stakeholders about how important a service is thought to be
Todos
- Read about burn rates in the google sre book
Notes
- SLOs for the new dog owner
- Objectives were set for a timid little new friend
- Bit of an iterative process … had to start small and be realistic given where Misha was (couldn’t do what owner ultimately wanted just yet)
- Had to be a bit comfortable with a starting baseline that included some freak out moments (punching through the set SLO). So alert threshold set higher than the ideal
- Setup a plan to improve the baseline in small increments
- Had to step back from experiments along the way because Misha wasn’t ready for the step size
- Some SLOs are temporary
- Sometimes SLOs change over time as the system changes
- Speaker: Isobel Redelmeier, SRE @ Dischord
- Evaluating SLOs at scale
- SLOs are customer focused, eg
- Home page loads quickly
- Customer data ingests quickly
- User run queries are fast
- Talked about a couple of different re-architecture decisions that help them deal with fast scaling
- Sometimes you make a choice that helps you today but has a not-long shelf life (eg 6 or 12 months) while you work on a longer term fix
- Short term fixes are often fine. Building for system size too far out can be waste
- Speaker: Liz-Fong Jones, Dev Advocate @ Honeycomb
- Reliability lessons from the back of a bicycle
- Good slis are specific and can be evalated for an event with a simple true / false type answer
- eg All trips 5 miles or less from their home should be made on a bicycle
- eg slo For above is 80% of trips in the last 7 days were made on a bicycle
- Good slos are attainanble / achievable. There comfortably within the realm of the possible
- You shouldn’t have a lot of slos. A few is usually enough for the system
- This was a fun talk that started with covid and a desire to limit the carbon footprint of their household
- Michael Ericksen @ Intelligent Medical Objects
- I’m pretty sure I’ve heard this fellow speak before. I like him muchly :)
- Customer centric slos
- System wide slos hide how customers are actually experiencing your system
- aggregates
- can look fine when there are customers in pain
- Had customer facing slos similar to honeycomb’s as described by Liz
- Fast ingest (within a minute)
- No metrics / logs dropped
- Collectors able to register quickly
- Responsiveness of the ui and querying
- Every week they get together and look at the 10 customers with the worst experience on the platform and think about how things can be improved for them
- Stefan Zier @ Sumo logic
- SLOs for everyone
- If you’re measuring your customer experience and trying to make that better you’re basically doing slos
- He comes from a service industry background (restaurant server I think) and relates that lack or not of customer happiness to how well the business is doing
- Told a story about a time where the different roles in the restaurant (service, kitchen, bussing, etc) stopped working together well and how that impacted the overall customer experience
- Alex Hidalgo @ Nobl9
- How to answer tricky questions the sre way
- An slo is an agreement among stakeholders about the availability / reliability needed for a service
- Jakub Warczarek @ ?
- Ensuring reliability with burn rates
- I need to review my burn rate math :)
- There’s apparently more on this topic in the google sre book
- When do you alert when you’re working with error budgets? Static thresholds? Maybe there’s something like that in there somewhere but it seems like burn rates are really interesting
- Want to know when the error rate in the system is going to put you at risk of burning through your error budget inside a rolling slo calc window
- 2 kinds
- long 5-7 hours looks for long, slow burns in recent time that will eventually knock you our of your error budget tolerance
- short 5 minutes for really fast drop-offs in what is considered good service by users
- Ajuna Kyaruzi @ Datadog