- Dyninno’s Incident Management: an Introduction: Improving incident management is a cultural and business problem. You have to get leaders to agree it’s worth it and to invest energy from the team. Somebody should really own the improvement of it and the ongoing care necessary. Dyninno has a dedicated team for this. Incident response is about effectively detecting and fixing issues (turn the lights back on) and ensuring the process is healthy and followed and that improvement happens by following through on learnings (transparency and improvement).
- Streamlining and Implementing Incident Management at Dyninno: Their ticketing system became a central hub to drive incident response. Severities are determined by number of impacted customers. This is captured in a ticket when an issue is detected (can be by customer complaint or monitoring platform). Other automation looks for new tickets and severities and triggers internal instant messaging flows as determined by their severity scale. Still more tools are watching updates to an incident ticket made by the commander while it is being worked. I have never thought of the ticketing system as being the hub but it’s a neat idea. The issue tracker does need a resonable api I guess but this seems to be less of an issue.
- SLA vs. SLO vs. SLI: What’s the Difference?: Nice treatment of slis, slos and slas.
- Bloom Filters: Loved this article about how bloom filters work! The structure and pacing was great as new ideas were introduced. A collection of bits you hash content into that we should be able to ask whether it is present or not in a set. We can say no with confidence but the yes side is actually a maybe. False positives can be minimized to a point where they are acceptable given the savings in storage for what we’re trying to store. (eg 1 in 1000 or 10000 or 1000000 wrong answers saving us megabytes or gigabytes of ram). Chrome had a malicious website db built this way until 2012. They’re not always appropriate but what a neat idea!
- Example 1: Akamai, who use them to avoid caching web pages that are accessed once and never again. They do this by storing all page accesses in a bloom filter, and only writing them into cache if the bloom filter says they’ve been seen before. This does result in some pages being cached on the first access, but that’s fine because it’s still an improvement. It would be impractical for them to store all page accesses in a Set, so they accept the small false-positive rate in favour of the significantly smaller bloom filter. Akamai released a paper about this that goes into the full details if you’re interested.
- Example 2: Google’s BigTable is a distributed key-value store, and uses bloom filters internally to know what keys are stored within. When a read request for a key comes in, a bloom filter in memory is first checked to see if the key is in the database. If not, BigTable can respond with “not found” without ever needing to read from disk. Sometimes the bloom filter will say a key might be in the database when it isn’t, but this is fine because when that happens a disk access will confirm the key in fact isn’t in the database.
- Periodic Face-to-Face: Our remote work debate / understanding of how people work together and the kinds of things we need as humans rages on.
- All you need is Wide Events, not “Metrics, Logs and Traces”: Structured logging and wide events are cool. If you collect data like this there are analysis tools that just light up!
- UML 2 Tutorial - State Machine Diagram: Nice tutorial about state diagrams.
videos