To observability lasanga from dashboard soup

  • How the should platform
    • Here we get together and talk about invariants in terms of system behaviour
  • Prove it does that by applying load
    • Ask questions about how the system is working
  • Measure that it is doing what you think it should
    • Dashboards are important but many are 1*off ones built after an incident
    • They should be thought of as something that provides a feeling to help debug an issue in production

Obervability Lasagna

Overview dashboard has high level metrics of infra, services, alerts, alert routing, etc. They’re a bunch of jumping off points to more detailed dashboards.

Structured Events

An example of a structured event.

Tips

  • Key metric observed user response time for a request
  • Connect metrics and logs through structured events
  • Always log event during panics / crashes
  • Visualize service limits
  • Do game days that use the observability tooling to look into an issue (they do quarterly)

Source