Vivcre Learn learn it · write it · retain it

← All posts

Database #b-tree#indexes#storage-engines#pages#query-cost

Why an Index Turns 200,000 Reads Into 4

25 Jun 2026

A table is not a magic cloud of rows. On disk it is a file split into fixed-size blocks called pages (Postgres uses 8 KB). Each page holds a handful of rows packed together, and the database can only read a whole page at a time. The single expensive operation in all of this is reading a page from disk. So hold onto one cost model:

The cost of a query ≈ how many pages it had to read.

The problem: sequential scan

Ask for email = 'ana@co' on a table with no index. The database reads page 0 — nothing. Page 1 — nothing. It cannot know the match isn’t there until it has looked, so it grinds through every page to the end. That is a sequential scan (Seq Scan in EXPLAIN), and its cost grows linearly with the table. Ten million rows across ~200,000 pages means ~200,000 reads to answer one lookup — and every row you add makes every un-indexed lookup a little slower, forever.

The fix: a sorted, nested directory

Think of a dictionary. Because the words are sorted, seeing “systems” lets you throw away the entire first half of the book without reading a word. Sorted keys let you discard most of the data without looking at it — the one move a scan can’t make. But one sorted list only gets you a binary search. The leap is nesting the narrowing into levels: a small top entry points to a medium entry points to the exact spot. That nesting is a B-tree.

Each B-tree node is itself a page, and a single page holds hundreds of these range entries. So each level you descend multiplies your reach by a few hundred: one level covers hundreds of rows, two ≈ tens of thousands, three ≈ tens of millions. That is why a B-tree finds any row in a 10-million-row table in about 3–4 page reads, and adding rows barely moves that number — it grows logarithmically, not linearly. The gap between “200,000 reads, getting worse forever” and “4 reads, basically flat” is the biggest performance lever in everyday backend work.

One refinement that bites people later: the index is the sorted thing, not the table. The rows (the heap) can sit in any physical order the database likes. The index just keeps a sorted list of key → pointer, and the leaf doesn’t contain the row — it points to the heap page that does. Sorted index, unsorted heap.

The price: writes scatter

Keeping everything sorted in place has a cost. Insert into a leaf page that’s already full and it can’t stretch — it splits into two half-full pages and pushes a separator key up to the parent, which may split too, cascading toward the root. Worse than splits, though: because the index is sorted by key, rows arriving in random key order (ann@, then zoe@, then max@) land in pages scattered all over the disk. That is random I/O, one to two orders of magnitude slower than appending sequentially.

So a B-tree is read-optimized: cheap, predictable lookups, paid for with writes that scatter. The villain is specifically random writes, not writes in general — keys that arrive already ordered (an auto-increment integer, a timestamp) all land at the rightmost leaf, which is sequential and fast. That’s exactly why experienced engineers reach for time-ordered IDs (ULID, UUIDv7) over random UUIDv4 for primary keys: they restore the locality that keeps a B-tree happy. And when writes genuinely dominate, you reach for a different storage engine entirely — the LSM tree.


Practise these questions →

Spaced-repetition MCQs for this post, on practise.vivcre.com.