How we tracked down a Go 1.24 memory regression across hundreds of pods: Datadog investigates a memory usage change between 1.23 and 1.24 for a particular workload they have and discovers a regression in go’s memory allocator after a refactor. Neat bit of sleuthing. They noticed a difference between virtual and RSS memory across releases. (~20%) In a large service with multiple task / pod instances that adds up to a lot.
Terraform Stacks, explained: This seems like something we’ve done manually in the past. Bringing together different layers in our tf code (eg base, network, app1, app2, etc) and operating on them together. Also includes concepts like cell and deployment I think. Worth understanding better since it may lesson the amount of boiler plate code we need.
How to create software quality: I liked this framing of quality: does what users expect, can be reasonably changed, has good stories around performance, security, operability, and other non-functional requirements. Also interesting ideas around essential and accidental complexity as well as scale complexity as drivers of practices to enhance safety, and speed of delivery.
Bookmarkable by design: url-driven state in htmx: Packing state into a url where possible for htmx.

Alert game days

This is a neat idea …

One successful practice involves setting aside time for scenario-based alert testing, often referred to as Game Days. During these sessions, teams gather to brainstorm three plausible failure scenarios for their service. For each one, they define which alerts should fire, simulate the failure (like network interruptions or service crashes), and then observe what actually happens.

The results are rarely perfect. Sometimes expected alerts don’t fire at all; other times, too many alerts go off. These sessions help teams identify which signals are actually useful and which need tuning or removal. The emphasis is on clarity and actionability: alerts should be easy to notice and tied to a clear response. Over time, the exercise helps engineers sharpen their intuition around what makes a good alert and surfaces blind spots in existing coverage.