Weekly notes: Gitlab database design processes, guides
- Big Data is Dead: Most problems are not big data problems even in analytics. Usually we want to report over a near term batch of data. More rarely than that a broader report. None of these use cases have low latency constraints. And there’s a potential legal liability to keeping data around too.
- No Time to Do It All! Approaching Overload on DevOps Teams: How do you
know you’re experiencing overload or is there just a lot of stuff going on?
- Ways of addressing overload
- Shed load
- Reduce thoroughness of results
- Recruit resources
- Shift work in time
- Ways of addressing overload
- Google threat intelligence group report 2024: Data about who’s attacking end users and enterprise users of the internet and how. Interesting trend away from end users and towards enterprise as the former is getting better at adopting security controls and better development practice / memory safe languages. Enterprise attacks seem to target networking and security appliances with code bases that are not based on modern tooling. (Memory used-after-free type errors stood out which makes me think of c/c++.)
Gitlab public engineering guides
I have stumbled upon a treasure trove of docs related to how gitlab thinks about engineering their database for performance, operability, and availability. It’s all just so well written and based on hard won learnings from working with data for a large, very popular website / service that hosts git repositories.
- Database Scalability: Limit on-disk table size to < 100 GB for GitLab.com
- Big tables are a pain to deal with for a number of reasons:
- Queries can be slow
- Adding indexes is hard
- Inserting / updating data is hard (worse if there are lots of indexes)
- Migrations are slow
- In my work life we’ve had trouble syncing replicas and running migrations as part of new releases in certain cases
- They have a table that’s 1.5tb!! :scream:
- Strategies for addressing
- Data purge policy
- Don’t store blobs in db
- Sharding (horizontal, vertical)
- Drop, merge indexes
- Review data model and improve
- Splitting a table into a primary frequent access + related table with a 1-1 relationship is something they talk about
- Big tables are a pain to deal with for a number of reasons:
- Migration Style Guide
- They group into 3 kinds
- Run in with a regular deploy before new code is activated
- Run in after code is deployed
- Run in in the background by a batch job triggered somehow (scheduled or by a person unrelated to a deployment)
- They differ in the max amount of time each is expected to run and criticality (<3m, <10m, >10m)
- Special rules for migrations around big tables where extra scrutiny is required
- Lots of advice around common migration tasks
- They group into 3 kinds
- Query performance guidelines: Sets a service level objective for response times of running queries:
- Not hard limits but they are guidelines
- Queries <100ms
- Queries in migration <100ms
- Concurrent operations in a migration <5m
- Concurrent operations in a post deploy migration <20m
- Cold vs warm cache (os file cache is considered cold, in memory postgres cache is warm)
- Draft: Add Database Sharding blueprint: Technical design doc describing the need for sharding. It’s not a small project to take on and a business driver must exist. Performance, availability, scalability, operability. When data gets too big such that vertical scaling doesn’t work anymore or read replicas lots of things can start to slow down. (eg Maintenance and new feature work)
- Example sharding conversation trigger criteria:
- If the feature has tables that have over 100M of rows
- If the feature has tables that are over 100GB of data (excluding indexes)
- Example sharding conversation trigger criteria:
- Database Review Guidelines
- Responsibilities of author, reviewer
- Diffs should be provided to schema
- Include migration plan with forward and reverse migrations, tested
- Include explain plan output
- Include raw sql
- Ensure migrations are put in the right phases (long running ones shouldn’t go out with deploy)
- Best practices for data layout and access patterns
- Avoid wide tables
- Avoid hot rows that are updated by lots of transactions at the same time
- Can insert rows without bothering other writers and handle aggregation and other operations separately later
- Split wide tables
-
Batching best practices: Batching is necessary when you’re doing something that touches many rows. How to ensure batch processing makes progress, and is idempotent and can be restarted.
- OAuth 2.1 explained: Explainer of how oauth 2.1 works. The jwt token method of identifying a user to the backend seems neat. (Could possible be done without a db.) The jwt includes username, and permissions potentially, and it’s tamperproof.
- API Tokens: A Tedious Survey: Talks about several different ways a client can identify the user it acts on behalf of to a server. Very nicely put together by somebody who knows what they’re talking about.
- JSON Web Token (JWT) Profile for OAuth 2.0 Client Authentication and Authorization Grants: Json web tokens (pronounced: ‘jot’) for user identity in requests to servers rfc. (Spec / standard / whatever lol)