Building production-ready prototypes
- Questions to ask when prototyping
- Who is the customer?
- What is the problem?
- What is the most important customer benefit?
- How do we know what the customer needs?
- What do we want the experience to look like?
- These questions feed into 2 docs: Press release (As if they’ve done the thing … up front), FAQ
- Don’t skimp on security !!
- Big perf bump going from graviton -> graviton2
- Much better perf/watt (Better for environment + sustainability)
- This is a really neat arm processor. Goes through stats
- 4x as many cores (64 vcpus)
- Coherent caches
- Per vcpu caches (64kb l1, 1mb l2)
- Fast interconnect within processor (2tb bisection bandwidth)
- 2nd half was a case study from a team in amazon (the ecommerce side) that migrated to graviton2.
- Bit of pain here I think because they were going from intel -> arm and they were working I guess with an older stack with dependencies that weren’t arm compatible
- Had to choose between upgrading things or dropping / replacing things as well as backwards compatibility! (They were a platform team with services used by many other teams @ Amazon. ie You can’t just go deleting things …)
Beyond five 9s: Lessons from our highest available data planes
- “Quality is a habit not an act” - Aristotle
- “Culture eats strategy for breakfast” - Peter Drucker
- Teams that does this best care about details. They figure out how to test as many different aspects of the system as they can.
- “New code means risk”. Paranoid about testing before it hits production
- AWS does continuous delivery (NOT deployment. There’s a manual deploy step as of 2021.)
- Progressive deployment. First deploy to:
- Pre-prod
- One box
- One AZ
- One region
- etc
- Everyone at aws deploys this way !
- Before deploy
- Code review
- Check in / merge to prod
- Cattle not pets
- Very little system level access to a vm (shell access is very limited if exists as all)
- Access by bastions if it’s available (log, notify) but some systems don’t even have it
- No root. Changes can’t be made to infra outside tooling and standard sdlc
- Blast radius: region, az, cell (if there is trouble in a cell only certain customers experience the pain)
- Shuffle sharding
- 3 strategies for defending a system under high load
- Disable non-critical functionality
- Shed load
- This kicks in when a system is receiving more work than it’s spec’d to handle
- Limits are predefined thresholds
- “Systems operate in known and safe limits”
- Admission control
- Serve stale results
- Testing
- More time spent testing code than writing it for critical systems
- Units
- End to end
- Formal verification
- Deployment (rolling forward, rolling back)
- Static stability
- “A system’s availability can only be as good as it’s least available dependency”
- Muddies a bit when you consider partial failure and the criticality of individual services
- The point remains downstream dependencies factor into your ability to stay up!
- Trick : Can turn an unreliable synchronous dependency into an asynch one. eg Push a relied upon source of truth to asynch and cache the data there in my system. If I can tolerate a bit of delay between seeing up to date data this can make my system safer
- “Constant work” pattern
- Push vs pull
- eg One approach to making config change to a system is to have a tool apply it to a target on demand. In this case if the number of change requests goes up there will be delays and queuing
- eg Another approach is to have the system being managed pull its config and apply it. If config state is stored on s3 and the target’s update strategy is to pull all its config from s3 periodically it never being asked to do more work at any moment. It always does the exact same thing to update itself. This kind of system is probably easier to reason about
- Retries with backoff and jitter
- Client throttling. Clients notice when backends are struggling a stop retries until they detect health
- video