Things you didn't know about indexes

jon.chrt.dev

Let’s start with things you probably did know.

A database index is similar to the index you found at the back of your science textbooks in school. You want to find the pages that talk about Phosphorus? Head to the back and you’ll find an alphabetical list of topics alongside page numbers that reference them.

...
Oxygen .................. 42, 88, 103
Periodic table .......... 12–15
Phosphorus .............. 67, 91
Photosynthesis .......... 54, 72, 110
Potassium ............... 33, 78
...

Let’s imagine we have a database table of Pokémon, like so¹:

id   | name       | type_1   | type_2   | generation | is_legendary | base_attack
-----+------------+----------+----------+------------+--------------+-------------
1    | Bulbasaur  | Grass    | Poison   | 1          | false        | 49
4    | Charmander | Fire     | NULL     | 1          | false        | 52
...
25   | Pikachu    | Electric | NULL     | 1          | false        | 55
...
150  | Mewtwo     | Psychic  | NULL     | 1          | true         | 110

Without an index, finding Pikachu means the database reads every row, one by one, checking the name column for each row. On four rows, that would be nearly instant. On ten million, it could be problematic. That kind of read is called a full table scan, and it’s about as fast as it sounds (not very²).

But, if we add an index on name we get something that conceptually looks a bit like our book index above:

name          row
------------+-----
Bulbasaur   → 1
Charmander  → 4
Mewtwo      → 150
Pikachu     → 25

It’s sorted, and so now our database can binary-search for the name and find the given row, rather than scanning the entire table.

Postgres stores this as a B-tree under the hood, but the idea is the same as the textbook: sorted data you can search quickly.

So, let’s index all the things, right?

The cost of indexing

As tempting as it may be to index everything, now we’ve seen what they can do for our queries, there are trade-offs to consider. With indexes, always keep in mind:

Reads get faster, writes get slower.

With our shiny new index, every INSERT, UPDATE or DELETE now has to update the index too. When we add a new Pokémon, the database has to find the right spot in the sorted name index and slot it in. And we can have multiple indexes, so you can multiply that by each one.

Additionally, indexes are real data structures and we must store them. They live on disk and are pulled into memory. A table with eight indexes has nine things to keep warm in cache, not just one.

And let’s not forget the query planner. When you run a query, the planner’s job is to pick the cheapest way to answer it. The more indexes you have, the more options it weighs, so planning time grows and could even exceed execution time on fast lookups.

Why your index isn’t working

You’ve got as far as weighing the trade-offs and deciding an index is right for the job. Excellent. But it’s not working as you’d hoped. You’re not seeing an improvement, or worse, you’re seeing speed diminish.

Let’s cover some common gotchas.

Composite indexes care about order

Looking back at our Pokémon table, you may have decided a good index is one on type_1 and type_2. After all, “show me all the Pokémon that are Water and Flying types” is a perfectly reasonable query.

CREATE INDEX ON pokemon (type_1, type_2);

This index will definitely help with queries like these:

SELECT * FROM pokemon WHERE type_1 = 'Water';
SELECT * FROM pokemon WHERE type_1 = 'Water' AND type_2 = 'Flying';

But this one? Not so much:

SELECT * FROM pokemon WHERE type_2 = 'Flying';

Do you find that surprising?

When you create a composite index (type_1, type_2) you are asking the database to create a structure that looks something like this:

Bug      → Flying   → [Butterfree, Beedrill, ...]
         → Poison   → [Venonat, Spinarak, ...]
         → Water    → [Surskit, ...]
Electric → NULL     → [Pikachu, Raichu, ...]
         → Flying   → [Zapdos, ...]
         → Steel    → [Magnemite, ...]
Fire     → NULL     → [Charmander, Vulpix, ...]
         → Flying   → [Charizard, Moltres, ...]
         → Ground   → [Numel, ...]
Grass    → Poison   → [Bulbasaur, Oddish, ...]
         → ...
Water    → NULL     → [Squirtle, Psyduck, ...]
         → Flying   → [Wingull, Pelipper, ...]
         → Ground   → [Wooper, ...]

It sorts first by type_1, and then by type_2 within each group. That makes the index great for queries on type_1, and even better for queries on type_1 AND type_2 together, but won’t be used the way you hope for queries on type_2 alone. Look at Flying: it’s scattered under Bug, Electric, Fire and Water with no single place for the database to jump to.

What we must ask ourselves is: What are the common queries going to be? If we will query on type_2 alone as frequently as type_1, then we should create a second index on type_2.

Functions defeat your index

Case-insensitive search is a feature we often add without thinking. Users don’t care whether they type “Pikachu”, “pikachu” or “PiKaChU” - they just want the result. So we might reach for something like this:

SELECT * FROM pokemon WHERE lower(name) = 'pikachu';

I don’t know about you, but I wouldn’t look twice at a query like that. It seems absolutely fine. We’ve got our index on name from earlier, so this query should fly. But, of course, it doesn’t.

Why? The index is on name, not lower(name).

As far as the database is concerned, those are two completely different things. Your index is a sorted list of values like Bulbasaur, Charmander, Squirtle, Pikachu not bulbasaur, charmander, squirtle, pikachu. So when you ask for rows where lower(name) = 'pikachu', the database has no sorted structure to consult and falls back to scanning the whole table, lowercasing each name as it goes.

This applies to any function wrapping the column. If the database can’t see the raw indexed column on the left side of the comparison, the index is off the table.

And here’s a real gotcha: Implicit conversions count too. When you compare some text against an integer, or a timestamp against a string, it can trigger a silent conversion that has the same effect as wrapping a column in a function.

How can we get around this? Well, fortunately, we can build an index on an expression. An index on lower(name) is perfectly valid, and the query above would use it without complaint. We’ll come back to this.

How can I avoid these pitfalls?

The short answer here is, don’t guess. Measure. How? By asking the database, of course!

Postgres ships with a tool called EXPLAIN, which tells you exactly how it plans to run a query. Slap it in front of any SELECT and it will return a report of what the database is about to do:

EXPLAIN SELECT * FROM pokemon WHERE name = 'Pikachu';

Index Scan using pokemon_name_idx on pokemon
  Index Cond: (name = 'Pikachu'::text)

That Index Scan is what you want to see. It means the database is using your index. Compare that with our problem query from earlier:

EXPLAIN SELECT * FROM pokemon WHERE lower(name) = 'pikachu';

Seq Scan on pokemon
  Filter: (lower(name) = 'pikachu'::text)

Seq Scan means it’s reading every row. No index involved.

If you want real timings to go with that rather than the planner’s estimates, use EXPLAIN ANALYZE. It actually runs the query and reports what happened. Try it on a query you think you understand. What comes back can often be surprising.

Things nobody told you existed

We’ve covered the basic single column index and its composite sibling. Between them, they cover the majority of cases. But there are a few other kinds of index that we can field, and they’re sometimes just what you need.

Functional indexes

We touched on this earlier, with this query:

SELECT * FROM pokemon WHERE lower(name) = 'pikachu';

We established that an index on name won’t help, because our database is looking for rows where lower(name) equals something, and it has no sorted structure for lower(name). The fix is to give it one:

CREATE INDEX ON pokemon (lower(name));

That is a functional index (also called an expression index). Instead of indexing the raw column, you index the result of an expression applied to the column.

This is not limited to lower(), of course. You can index any deterministic expression:

CREATE INDEX ON pokemon (lower(name));
CREATE INDEX ON users ((created_at::date));
CREATE INDEX ON pokemon ((base_attack * 2));

You can use any deterministic and immutable function.

Watch out though. Reaching for functional indexes as your first resort can often be a smell. If we’re querying on the lowercase form of the name so often, we should ask ourselves why we haven’t stored the name in lowercase. We can always apply lower to the input.

Partial indexes

Most indexes cover every row in the table. But sometimes you only ever query a small slice of it, and indexing the entirety is wasteful. Remember, indexes are not free.

In our Pokémon table, we have a column called is_legendary. For those unfamiliar, at the time of writing, approximately 1000 species of Pokémon exist and of those, only 80 or so are considered legendary. Less than 10%.

You can imagine an application with an option to “show me all legendaries”. Initially, one might consider creating a compound index like so:

CREATE INDEX ON pokemon (is_legendary, name);

This works, but it’s overkill. The index now contains an entry for every pokemon, of which the vast majority are not legendary. We’ll be paying the storage and write cost for 1000 rows to serve queries that only care about 80. In fact, because is_legendary is a boolean, we end up with a structure where we initially sort into two groups of False and True, where the former half is 920 entries of deadweight, paying their share of every write to the table for queries that ignore them.

A partial index covers only the rows that match a condition:

CREATE INDEX ON pokemon (name) WHERE is_legendary = true;

Now the index has just 80 entries instead of 1000. It’s smaller, faster to query; and queries like WHERE is_legendary = true use it happily. Queries that don’t match the condition (WHERE is_legendary = false) fall back to a full table scan, which is exactly what you’d want. Those queries match almost every row anyway, so an index isn’t helping much.

Any time you have a column where you mostly query a single value is a time you might consider a partial index. Imagine indexing the email column of your users table, where you’ve implemented soft-delete:

CREATE INDEX ON users (email) WHERE deleted_at IS NULL;

Soft-deleted rows are going to be almost never queried, but would still bloat your indexes if you don’t filter them out.

Covering indexes

When a database uses an index to find a row, that’s only half the work. The index points to where the row lives, but then the database must go to the table itself and fetch the rest of the columns the query asked for. Two trips.

But what if I told you it doesn’t have to?

If the index already contains every column the query needs, the database can answer the query from the index alone, never touching the table. This is a covering index. When viewed with EXPLAIN, you’ll see it as an Index Only Scan, three of the sweetest words EXPLAIN can output.

Here’s a query:

SELECT name FROM pokemon WHERE is_legendary = true;

Our partial index from the previous section happens to cover this query perfectly. The index is on name, filtered to legendaries and the query asks for name from legendaries. Everything the query needs is already in the index. No trip to the table required.

We can also build covering indexes deliberately. Postgres lets you tack extra columns onto an index using INCLUDE:

CREATE INDEX ON pokemon (name) INCLUDE (base_attack);

Now a query like this can be answered entirely from the index:

SELECT name, base_attack FROM pokemon WHERE name = 'Pikachu';

Why not just put base_attack in the indexed columns themselves? Because the database treats indexed columns as things to sort and search by. If you add base_attack to the index proper, you’re telling the database to sort the index by name and then by base_attack, work it has to do on every write. Work that has no benefit when you only ever search by name. INCLUDE is a way to say “carry this column along for the ride, but don’t bother sorting by it”.

Indexes can feel like an afterthought to the novice database developer. Even seasoned engineers often create them without much thought, or leave them off entirely. But intelligent indexing will make or break your database performance.

Want to go deeper? Use The Index, Luke will teach you everything you could ever want to know.

Yes, this table is not winning any normalization awards.
To be fair to the full table scan, it’s actually extremely fast in absolute terms. Modern databases can rip through millions of rows a second. The problem isn’t that it’s slow, it’s that it’s linear. Double the rows, double the time. An index lookup, by comparison, barely notices the difference.

Spotted a typo?