PostgreSQL database locks usually work seamlessly until they don’t. Before your Rails app dataset and traffic reach a certain scale, you’re unlikely to face any locks-related issues. But if your app suddenly slows down to a crawl, deadlocks likely are to blame. In this blog post, I’ll describe how to monitor database locks and propose best practices to prevent them from causing issues.
PostgreSQL database locks 101
Locks in PostgreSQL are a mechanism for ensuring data consistency in multithreaded environments. This blog post is not trying to be a comprehensive introduction to how locks and related MVVC (Multi-Version Concurrency Control) system works in PG databases. Instead, I’ll focus on the basics to help us understand how locks affect Ruby on Rails apps’ performance and stability.
It’s important to understand that every single write to the database acquires a lock. Well-behaved databases usually commit changes in a few ms, so observing related locks is not straightforward. As a workaround, you can simulate a long-lasting lock by running a similar code in the Rails console:
Now in a separate Rails console process run the locks
method from rails-pg-extras gem:
As you can see, even a straightforward single-column update operation acquires an array of RowExclusiveLock
locks on a corresponding table and all the indexes. Usually, this process works seamlessly behind the scenes. A lock affecting a single row is not critical (apart from slowing down a single request/job) as long as it’s not blocking other queries.
But let’s see what happens when we try to update the same row concurrently.
In the first Rails console process run:
Then in the second one:
and now, in the third Rails console process, we can investigate the collision of concurrent update operations:
As you can see, the acquired lock prevents the second SQL statement from finishing. It’s reported by blocking
method from rails-pg-extras. In practice, a web server request trying to execute this SQL statement would be stuck until the first lock is released. This is how misbehaving locks can kill the performance of your application or even cause downtime by exhausting available web server connections.
Monitoring PostgreSQL locks in production
You’re unlikely to sprinkle sleep
all over your codebase to simplify investigating database locks. Here’s a sample implementation of a Sidekiq job that can help you monitor long-lasting locks and blocked SQL statements in production:
config/initializers/pg_monitoring.rb
app/jobs/locks_monitoring_job.rb
We use an initializer file to schedule the first task and can optionally disable it by removing PG_MONITOR_LOCKS
variable. The job schedules itself every configurable PG_MONITORING_INTERVAL_SECONDS
and reports locks and blocked queries that cross a PG_MIN_LOCK_DURATION_MS
threshold using slack-notifier gem (check out Slack docs for info on how to configure incoming webhooks). You can tweak the ENV variables to adjust monitoring sensitivity based on your app’s requirements.
If your app is running heavyweight background jobs, you’ll likely have to manually exclude some queries from being raported. And if your current database performance is far-from-perfect, you should bump up the PG_MIN_LOCK_DURATION_MS
and gradually lower it when you manage fix the cause of long-lasting locks. Excluding background jobs and data migrations, a healthy database should acquire exclusive locks for a maximum of ~100-200ms.
Running a similar monitoring task can help you spot many lock-related issues early on before they start affecting end users.
Update: I’ve wrapped this implementation in a easy-to-use gem pg-locks-monitor.
How to resolve lock-related issues in Rails apps
So your Slack alerts channel is spammed with long-lasting locks and blocked query notifications. What now?
Overusing ActiveRecord::Base.transaction scope
I’ve seen misbehaving locks with a duration of a few seconds. Surprsingly, the underlying cause wasn’t even directly caused by the database layer. ActiveRecord makes it too easy to wrap any arbitrary chunk of code into a database transaction. Have a look at the following example:
app/models/user.rb
ActiveRecord callbacks are a usual way to trigger side effects based on the model object’s lifecycle. For example, sending a welcome email to newly registered users is a common use case. Let’s keep discussing whether similar logic belongs in a model aside for now. Instead, let’s focus on how the database layer will react to a similar implementation.
Something to watch out for is that excluding after_commit
, all the AR callbacks are implicitly wrapped in a database transaction. So, the above implementation would keep a RowExclusiveLock
on the user object for the whole time needed to connect to 3rd party API and send an email. So, instead of the expected max ~200ms threshold, our lock could last up to 60 seconds in case there’s an API timeout.
This example might seem contrived. But I’ve seen multiple Rails apps suffering from long-lasting locks caused by implicit 3rd party HTTP calls executed inside database transactions.
The hidden complexity of Rails ActiveRecord makes it too easy to commit similar mistakes, even for more experienced devs. That’s why proper locks monitoring can help you spot similar issues, even if they go unnoticed during the code review.
Other ways to fix long-lasting locks
Apart from limiting the scope of ActiveRecord transactions, the “simplest” way to decrease locks duration is to optimize the performance of the database layer. Check out this blog post for an in-depth guide on analyzing and fixing PostgreSQL performance with the help of rails-pg-extras gem.
But, fine-tuning SQL queries is often a more time-involved process. So, here’s a list of “quick & easy” ad hoc solutions to try before you manage to deploy proper fixes. These are “hacks”, but they might help if your app is currently unstable because of locks-related issues.
1. Throttle Sidekiq jobs
One way to limit the occurances of locks is to reduce the concurrency of database commits. It will limit the scalability and throughput of your app. But the tradeoff might be worth it if you’re experiencing locks-related downtimes.
I usually recommend implementing the Sidekiq config file in a similar way:
config/sidekiq.yml
Even if you’re currently not planning to tweak the concurrency setting, using an ENV variable will allow you to easily change this value without doing a code-change release. A panic fix could be to throttle concurrency down to 1
to minimize concurrent locks’ impact before you apply the proper solution.
Alternatively, if you know which Sidekiq job is causing the issue, you could throttle it from running concurrently with the help of the sidekiq-unique-jobs gem. This is also a good solution for more heavyweight jobs to limit the max memory usage of Sidekiq.
2. Limit the app’s concurrency
Similar throttling can be done for web app processes. Limiting your app’s concurrency could reduce performance by increasing request queue time. But, it’s possible that the throttled app will be more performant because blocking SQL queries are limited.
You can implement your Puma web server config in a similar way to easily tweak settings without a need for code release:
config/puma.rb
3. Kill stuck connections
RailsPgExtras.locks
and RailsPgExtras.blocking
could report that a single long-running query is blocking dozens of other database commits. Depending on the scale of the issue, a similar traffic jam could cause many of your web server connections to get stuck and your app to become unresponsive. A “panic button” solution for this issue that could potentially restore your app’s responsiveness is to kill the offending connection. rails-pg-extras blocking
method provides a blocking_pid
value you can use to cherry-pick and terminate the stuck SQL query. Additionally, you can use RailsPgExtras.connections
to verify a process name that generated the problematic query. It could be helpful for debugging.
You can terminate an SQL query/connection based on its PID value by running the following code:
Optionally, if your database is a bad enough shape that killing a single connection is not enough, you can run:
Remember to restart all the web and background worker processes to prevent ghost connections from hanging around.
As I’ve mentioned, these are not proper fixes, but rather “band-aids” that might help in case your app is unstable.
Summary
PostgreSQL database locks usually work seamlessly behind the scenes. But, following the mentioned best practices and configuring correct monitoring can save you a lot of headaches. A proactive approach to monitoring lock-related issues can help you resolve most of the problems before they start affecting your users.