• Building the future of resilient tech: Lessons from two decades in SRE: Test with tools like jmeter, apache bench, and fault injectors. Also thing about timeouts - first general rule of thumb guidance I’ve ever seen is 150% of the 95th. (The 95th percentile lives again.)
  • Achieving relentless Kafka reliability at scale with the Streaming Platform: Kafka scale at datadog is ludicrous. They’ve had to customize stream processing and modify the constraints inside kafka to handle large volumes of data flowing through their kafka fleet.
  • Value chains – A method for creating and balancing faucet-and-drain game economies: Brilliant piece about player motivation in games. Probably has broader applications. If a person understands there’s an end goal in mind that has high value (eg self-expression from the article) they will tolerate some drudgery to get there.
  • Dividing Python Code Into Functions: 5 guidelines:
    • Code that appears in more than one place should be in a function
    • A function should be less than 30 lines long
    • A function should have 3 or fewer levels of conditional control
    • Create a function when it improves readability
    • Do one thing
  • Populating the page: how browsers work: What happens in the browser after you type a url into the address bar and hit enter. I love this particular story. Touches on so many interesting things.
  • Active Benchmarking: When you are trying to understand how a system behaves under load it is good to apply load over an extended period of time while observing the constituent parts of the system. Brendan calls this active benchmarking. Passive involves kicking off a test, waiting for results, and then looking just at those final outputs. You often miss so much context in between.

srecon2025

  • SREcon25 Americas - One Million Builds per Year, Only One Page - Operating Internal Services…: Liked this talk a lot. How to get to a small, sustainable ops team that is empathetic to the teams it serves.
    • Leaned into cloud platforms managed things (eg The db)
    • Talked to customers, business about needs for escalation and when work happens
    • Automation
      • The ops team started with doing things outside work hours but that team churned because having to do work really early or late isn’t what people signed up for
      • Auto upgrades - where it was reasonable just take the latest and regular processes / tools will be able to verify that there are no problems just by running (They protected processes / tools that were more sensitive to failure but most things are not that. I’d lump our release mgmt process stuff in there. Ansible are terraform are things that could probably work nicely in that model.)
  • SREcon25 Americas - OLTP SQL Database Query Tracing and Linting: They have a tool that scores queries during development looking for potentially concerning ones. Proactive (detect, review bad queries) / reactive (monitoring) practices.