Postgres Indexes for ActiveRecord Join Tables in Rails Apps

Indices in join tables are represented by two ducks. Photo by Amir-abbas Abdolali on Unsplash

Join tables are a common citizen in Ruby on Rails apps. Their use case is to provide many to many relation between database models. Adding correct Postgres indexes on join tables is not obvious. I’ve noticed that some tutorials, Stack Overflow posts, and even Rails itself provide incorrect advice on how to do it. In this tutorial, we’ll look into how compound PostgreSQL indexes work and how to correctly use them with join tables.

We will start by explaining the basics of join models and later deep dive into analyzing the output of the PostgreSQL EXPLAIN query plan with different database indexes.

What is a join table and model in Rails?

Many to many relation is often necessary to model the business logic of a web app. Let’s illustrate it with a flagship scenario: a user that has many favorite products.

If you already have a Product and User models, you can well join them using a join table by writing the following migration:

class CreateUsersProducts < ActiveRecord::Migration
  def change
    create_table :users_products do |t|
      t.integer :user_id, null: false
      t.integer :product_id, null: false
      t.timestamps
    end
  end
end

Let’s look at the model files now:

app/models/product.rb

class Product < ApplicationRecord
  has_many :user_products, dependent: :destroy
  has_many :users, through: :user_products
end

app/models/user.rb

class User < ApplicationRecord
  has_many :user_products, dependent: :destroy
  has_many :products, through: :user_products
end

app/models/user_product.rb

class UserProduct < ApplicationRecord
  self.table_name = "users_products"

  belongs_to :user
  belongs_to :product
end

With this structure in place, you can assign multiple users to products and vice versa:

product = Product.last
user = User.last

product.users << user
product.users #=> [user]
user.products #=> [product]

If you need more fine tuned control you can use the join model directly:

product = Product.last
user = User.last

UserProduct.create!(
  product_id: product.id,
  user_id: user.id
)

product.users #=> [user]
user.products #=> [product]

Avoid using `has_and_belongs_to_many`!

You could use has_and_belongs_to_many relation with create_join_table migration helper method and omit to declare the UserProduct model. I don’t recommend this approach. It might be quicker to start with but much more constrained in the long run. E.g., create_join_table generates a table without a primary key, so you couldn’t easily add a customizable join model afterward.

How to add proper indexes to the join table in Rails?

You might have noticed that we did not add any indexes to the users_products table yet. For larger dataset, this setup could result in slow queries killing the performance of a web app. Let’s analyze how Postgres handles fetching objects through the join table without indexes.

Query EXPLAIN plan when indexes are missing

Unfortunately, ActiveRecord does not support EXPLAIN ANALYZE out of the box. You can use this gem if you’d like to measure the actual performance of the queries. In the below examples, we’ll use the built-in explain method, which displays the detailed PostgreSQL plan for executing the query.

user.products.to_sql

"SELECT products.* FROM products
  INNER JOIN users_products
    ON products.id = users_products.product_id
  WHERE users_products.user_id = 5"

user.products.explain


# Nested Loop
# ->  Seq Scan on user_products
#       Filter: (user_id = 2392)
# ->  Index Scan using products_pkey on products
#       Index Cond: (id = user_products.product_id)

user.products.where(id: 1).to_sql

"SELECT products.* FROM products
  INNER JOIN users_products
    ON products.id = users_products.product_id
  WHERE users_products.user_id = 5 AND products.id = 1"

user.products.where(id: 1).explain

# Nested Loop
# ->  Index Scan using products_pkey on products
#       Index Cond: (id = '1')
# ->  Seq Scan on user_products
#       Filter: ((product_id = '1') AND (user_id = 2392))

Output truncated for brevity

As you can see in both cases, it is performing a Seq Scan on users_products table. Seq Scan means that Postgres has to loop through all the rows of a table to run the query. For a larger dataset, it is usually terribly inefficient.

Let’s add indexes to see how it will change the behavior.

Query EXPLAIN plan with two compound indexes

We’ll start with the approach I often see mentioned in tutorials. Even Rails suggests adding two compound indexes when generating migration using create_join_table helper.

class AddIndexesToUsersProducts < ActiveRecord::Migration
  disable_ddl_transaction!

  def change
    add_index :users_products, [:product_id, :user_id],
      unique: true, algorithm: :concurrently

    add_index :users_products, [:user_id, :product_id],
      unique: true, algorithm: :concurrently
  end
end

Unless you have a non-standard use case, you should add unique indexes to validate join objects on the database level.

Using disable_ddl_transaction! and algorithm: :concurrently is the best practice that allows you to add indexes even to large tables without acquiring a full table lock. Otherwise, migration could easily bring down your production.

You can check out my other blog post if you want to learn more exciting tricks on how to crash your production servers.

Let’s start by replying our previous queries:

user.products.explain

# Nested Loop
# ->  Bitmap Heap Scan on users_products
#   Recheck Cond: (user_id = 2392)
#     ->  Bitmap Index Scan on
#         index_users_products_on_user_id_and_product_id
#           Index Cond: (user_id = 2392)
#     ->  Index Scan using products_pkey on products
#         Index Cond: (id = users_products.product_id)

user.products.where(id: 1).explain

# Nested Loop
# ->  Index Scan using products_pkey on products
#   Index Cond: (id = '1')
#   ->  Index Only Scan using
#      index_users_products_on_user_id_and_product_id
#      on users_products
#      Index Cond: ((user_id = 2392) AND (product_id = '1'))

As you can see Seq Scan has been replaced by Index Scan. But if you analyze Index Cond, you’ll see that only the second query uses both fields from the compound index_users_products_on_user_id_and_product_id index.

Let’s see what will happen if we query the other side of the relation:

product.users.explain

# Nested Loop
# ->  Bitmap Heap Scan on users_products
#     Recheck Cond: (product_id = 74190)
#       ->  Bitmap Index Scan on
#           index_users_products_on_product_id_and_user_id
#           Index Cond: (product_id = 74190)
# ->  Index Scan using users_pkey on users
#     Index Cond: (id = users_products.user_id)

product.users.where(id: 1).explain

# Nested Loop
# ->  Index Scan using users_pkey on users
#       Index Cond: (id = '1')
# ->  Index Only Scan using
#     index_users_products_on_user_id_and_product_id
#     on users_products
#     Index Cond: ((user_id = '1') AND (product_id = 74190))

The first query behaves as expected. It uses product_id Index Cond from index_users_products_on_product_id_and_user_id index. But if you look at the second output, you’ll notice that it is still using the same index as the query executed in another direction!

It means that duplicating the compound indexes for join tables is unnecessary. Postgres is smart enough to use both keys of a compound index if they are matching, regardless what’s the order of columns in a query.

I’ve validated this behavior with PostgreSQL from version 9 up to 12.

Simpler SQL query

The above examples might be a bit convoluted. Here’s a more straightforward query you can use to replicate this behavior:

query_a = UserProduct.where(user_id: 1, product_id: 2)

query_a.to_sql

"SELECT users_products.* FROM users_products
  WHERE users_products.user_id = 1
  AND users_products.product_id = 2"

query_a.explain

# Index Scan using
#   index_users_products_on_user_id_and_product_id
#   on users_products
#     Index Cond: ((user_id = 1) AND (product_id = 2))

query_b = UserProduct.where(product_id: 2, user_id: 1)

query_b.to_sql

"SELECT users_products.* FROM users_products
  WHERE users_products.product_id = 2
  AND users_products.user_id = 1"

query_b.explain

# Index Scan using
#   index_users_products_on_user_id_and_product_id
#   on users_products
#     Index Cond: ((user_id = 1) AND (product_id = 2))

Both compound indexes were present when running those queries. One was working hard looking up the query results. The other one was sitting there, useless, doing nothing.

Usually, you can track unused indexes with tools that display those that are getting few hits e.g., Rails PG Extras. Unfortunately, detecting a partially used compound index is not that simple using automated tooling.

Correct indexes migration for join tables

To optimize usage of database resources, the migration adding a join table to your Rails app could look like that:

class CreateUsersProducts < ActiveRecord::Migration
  def change
    create_table :users_products do |t|
      t.integer :user_id, null: false
      t.integer :product_id, null: false
      t.timestamps
    end

    add_index :users_products, [:product_id, :user_id], unique: true
    add_index :users_products, :user_id
  end
end

With those indexes in place accessing any side of relation will generate to following EXPLAIN query plan output:

product.users.explain

# Nested Loop
# ->  Bitmap Heap Scan on users_products
#     Recheck Cond: (product_id = 74655)
#      ->  Bitmap Index Scan on
#       index_users_products_on_product_id_and_user_id
#         Index Cond: (product_id = 74655)
# ->  Index Scan using users_pkey on users
#     Index Cond: (id = users_products.user_id)

product.users.where(id: 1).explain

# Nested Loop
# ->  Index Scan using users_pkey on users
#     Index Cond: (id = '1')
# ->  Index Only Scan using
#     index_users_products_on_product_id_and_user_id
#     on users_products
#       Index Cond: ((product_id = 74655) AND (user_id = '1'))

user.products.explain

# Nested Loop
# ->  Bitmap Heap Scan on users_products
#     Recheck Cond: (user_id = 2395)
#       ->  Bitmap Index Scan on
#           index_users_products_on_user_id
#           Index Cond: (user_id = 2395)
# ->  Index Scan using products_pkey on products
#     Index Cond: (id = users_products.product_id)

user.products.where(id: 1).explain

# Nested Loop
# ->  Index Scan using products_pkey on products
#         Index Cond: (id = '1')
# ->  Index Only Scan using
#     index_users_products_on_product_id_and_user_id
#     on users_products
#       Index Cond: ((product_id = '1') AND (user_id = 2395))

Even with only a single compound index in place, none of the queries has to resort to performing an inefficient Seq Scan operation. A single-column index index_users_products_on_user_id handles finding products assigned to users. product_id column from the compound index index_users_products_on_product_id_and_user_id takes care of finding users assigned to a product.

The same compound index is used on both user_id and product_id columns interchangeably for queries in both directions.

Summary

Keeping unnecessarily bloated and unused indexes not only uses up the disk space of a database but also slows down the write performance.

This might seem like a micro-optimization hardly worth the effort. Like with many PostgreSQL performance tips, they only make some actual difference when the dataset starts to grow. The best approach is to follow good practices from the start of designing your database schema.