• On-call with Jérôme Petazzoni: Most recent memory I guess of how Jerome would do on-call:
    • He’d be the primary on-call 2 days every week or so
    • Pair rotation. There’d be a junior and senior team member on-call at once
    • Junior would try to tackle problems during the day, Senior would handle everything else
    • Pages would be acknowledged within 15-30m
    • If a fix would take upwards of an hour, he’d get other team members involved otherwise he’s try to handle an incident on his own (This sounds high. I probably missed something here.)
    • Table top exercises: reboot a server, fill up a vm disk, change a public ip (Nasty lol)
  • Dev tools time: Suz Hinton: Loved this one. Tools Suz uses a lot. I should check them out:
  • Excellent interview with Neil Gaiman from August 2022: Talked about Sandman the Netflix series and Good Omens 2 (The first series on Amazon Prime Video was so good :))
  • NYT platform team sends alerts to users of a multi-tenant kafka service: When a team needing kafka onboarded they would choose slack channels to send alerts to, criticality of alerts for a particular consumer, the threshold they’d want to be alerted / paged at (consumer lag in terms of the number of messages on a kafka topic not yet processed as a key metric), there was also a neat flow diagram:
    • Google form
    • Gapp script that pushes form data to a backend service on submit
    • Create a pull request on infra repo
    • Platform team member reviews pr
    • Accepts and merge to a prod branch in the infra repo
    • Terraform pushes out the change (In this case the change is an auto-generated description of datadog monitor config setting up slack notification rules based on gform data)