Big ideas

  • SLOs should be customer focused!
    • Eg Loading the home page is fast
    • It’s nice to be able to tell how individual customers are doing vs system-level aggregate metrics
  • Tradeoff between reliability work vs product work based on agreement between stakeholders about how important a service is thought to be
    • slo is set based on this

Todos

  • Read about burn rates in the google sre book

Notes

  • SLOs for the new dog owner
    • Objectives were set for a timid little new friend
    • Bit of an iterative process … had to start small and be realistic given where Misha was (couldn’t do what owner ultimately wanted just yet)
    • Had to be a bit comfortable with a starting baseline that included some freak out moments (punching through the set SLO). So alert threshold set higher than the ideal
    • Setup a plan to improve the baseline in small increments
    • Had to step back from experiments along the way because Misha wasn’t ready for the step size
    • Some SLOs are temporary
    • Sometimes SLOs change over time as the system changes
    • Speaker: Isobel Redelmeier, SRE @ Dischord
  • Evaluating SLOs at scale
    • SLOs are customer focused, eg
      • Home page loads quickly
      • Customer data ingests quickly
      • User run queries are fast
    • Talked about a couple of different re-architecture decisions that help them deal with fast scaling
    • Sometimes you make a choice that helps you today but has a not-long shelf life (eg 6 or 12 months) while you work on a longer term fix
    • Short term fixes are often fine. Building for system size too far out can be waste
    • Speaker: Liz-Fong Jones, Dev Advocate @ Honeycomb
  • Reliability lessons from the back of a bicycle
    • Good slis are specific and can be evalated for an event with a simple true / false type answer
      • eg All trips 5 miles or less from their home should be made on a bicycle
      • eg slo For above is 80% of trips in the last 7 days were made on a bicycle
    • Good slos are attainanble / achievable. There comfortably within the realm of the possible
    • You shouldn’t have a lot of slos. A few is usually enough for the system
    • This was a fun talk that started with covid and a desire to limit the carbon footprint of their household
    • Michael Ericksen @ Intelligent Medical Objects
    • I’m pretty sure I’ve heard this fellow speak before. I like him muchly :)
  • Customer centric slos
    • System wide slos hide how customers are actually experiencing your system
      • aggregates
      • can look fine when there are customers in pain
    • Had customer facing slos similar to honeycomb’s as described by Liz
      • Fast ingest (within a minute)
      • No metrics / logs dropped
      • Collectors able to register quickly
      • Responsiveness of the ui and querying
    • Every week they get together and look at the 10 customers with the worst experience on the platform and think about how things can be improved for them
    • Stefan Zier @ Sumo logic
  • SLOs for everyone
    • If you’re measuring your customer experience and trying to make that better you’re basically doing slos
    • He comes from a service industry background (restaurant server I think) and relates that lack or not of customer happiness to how well the business is doing
    • Told a story about a time where the different roles in the restaurant (service, kitchen, bussing, etc) stopped working together well and how that impacted the overall customer experience
    • Alex Hidalgo @ Nobl9
  • How to answer tricky questions the sre way
    • An slo is an agreement among stakeholders about the availability / reliability needed for a service
    • Jakub Warczarek @ ?
  • Ensuring reliability with burn rates
    • I need to review my burn rate math :)
    • There’s apparently more on this topic in the google sre book
    • When do you alert when you’re working with error budgets? Static thresholds? Maybe there’s something like that in there somewhere but it seems like burn rates are really interesting
    • Want to know when the error rate in the system is going to put you at risk of burning through your error budget inside a rolling slo calc window
    • 2 kinds
      • long 5-7 hours looks for long, slow burns in recent time that will eventually knock you our of your error budget tolerance
      • short 5 minutes for really fast drop-offs in what is considered good service by users
    • Ajuna Kyaruzi @ Datadog