Ensuring thread safety is critical if you want to build a performant Rails app. Unfortunately, threading-related bugs are often sneaky and only manifest in highly concurrent production environments. In this blog post, we’ll discuss code examples that are not thread-safe. I’ll also describe a toolkit for debugging and discuss possible solutions. Developing an eye for spotting these errors before shipping to production can save you a lot of headaches.
100% thread safety guarantee…
I’d risk saying that just by looking at a piece of Ruby code, you can never tell if it does not hide any multithreading-related bugs. A typical Ruby project is gems all the way down, with the usual excess of external dependencies. So even a simple
+ method could be monkey-patched to introduce a thread safety issue.
However, there’s one way to ensure that your Rails app is bulletproof against any concurrency bugs. But it comes with a terrible tradeoff. Let’s explain it with an example:
This endpoint renders an array of numbers based on a
per_page value received as a param. We add an error path that, in theory, should never happen. You don’t expect a variable to change its value when assigned a few lines before.
sleep 0.1 is here to simulate a blocking IO. Ruby uses Global Interpreter Lock (GIL), which works like a global mutex. It ensures that two Ruby threads can never run in parallel. But blocking IO means that the actual work is delegated to an external process, e.g., SQL database or HTTP client. So while thread A is waiting for its blocking IO to complete, thread B can take over. I’ve covered Ruby threading, GIL, and blocking I/O in much more depth in my other blog post so please refer to it if you need a recap.
If you’re familiar with threading issues in Ruby, you’ll notice that we’re using a class variable. It is shared between all the threads in a process. So while thread A is sleeping, thread B will kick in and overwrite the
@@per_page value causing the buggy path. As a result, a user would receive a different number of objects than requested. This is the most straightforward thread-safety bug I could think of. The point of analyzing it in detail is to describe a toolkit for tackling more complex cases.
I’ve performed all the tests with Rails 7, Ruby 3.1.0 on the Puma server running locally in a single mode with 5 min/max threads in the production environment.
Our goal is to trigger the
"It should never happen!" error. We’ll need to simulate a highly concurrent environment to achieve that. You’re unlikely to hammer CMD+R fast enough, so instead, let’s use Siege a load testing tool.
Now create a
urls.txt file with the following contents:
and now you can start your first test by running this command:
you should get a similar result:
You can see that we’ve managed to trigger the error in over 50% of requests. You can confirm that the presence of multiple concurrent threads is causing the error by running:
and now you should get a 100% success rate.
So what about making our app bulletproof against threading bugs that I’ve mentioned before? You can do it by adding a line to
Don’t forget to restart your server after modifying config files.
Let’s rerun our test:
and you should expect a similar result:
So we’ve got a 100% success rate for the price of decreasing our throughput by over 50%. We’re down to 41 hits from 84, and the longest transaction is now over 1 second instead of 0.26 second. To understand why adding
Rack::Lock had this effect let’s have a quick look at its source code:
You can see that this middleware wraps the incoming requests into a mutex that’s initialized once per process. All the Puma worker threads share the same mutex variable. It means that two threads can never run simultaneously, even for blocking I/O.
Rack::Lock throttles your app to only use one thread per process.
You can see that while effective, the
Rack::Lock is not a viable solution.
We’ve verified that you need to simulate a concurrent environment to trigger even the simplest thread safety bug. Siege is a perfect tool to test your app locally if you’re hunting these weird “sometimes” bugs spotted on production.
SQL database and thread safety
Database interactions add a whole new dimension to the scope of possible threading bugs. Consider this example:
This endpoint counts a number of user interactions. You can test it with the following Siege command:
You should see a similar output:
You’d expect the
counter user attribute value to increase by the same number. But (unless you still have
Rack::Lock enabled), you’ll notice that it increased by only around one-third of API hits counted.
Despite the lack of the obvious blocking IO (like
sleep in the previous example), we’ve managed to trigger a thread safety bug. Our threads fetch the current value of the
counter attribute from the database and increment it concurrently, i.e., committing the same change multiple times. That’s why we lose some values.
You can fix this issue by wrapping the read and update operations into a database transaction with a correct isolation level.
The details of SQL transaction isolation levels are out of the scope for this tutorial. But long story short,
repeatable_read acquires locks that prevent both concurrent reads and writes. As a result, update operations are executed sequentially, and the value of the
counter attribute is now equal to the number of successful HTTP API calls. Please keep in mind that acquiring database transaction with high isolation levels can significantly reduce performance. Always measure the impact before applying this change to production bottlenecks.
Another solution to this particular example is to leverage an API method that’s thread-safe by design:
Finding an API method immune to threading bugs is not always possible. But, searching the docs of an unknown library for keywords like “thread-safe”, “concurrent”, can sometimes save you from unnecessarily convoluted solutions. For example, the concurrent-ruby gem is an awesome collection of thread-safe programming primitives. So, if you find yourself juggling mutexes to patch a threading bug, getting familiar with tools offered by this gem could be a live saver. BTW I’m currently working on a blog post about it, so please subscribe if you want to be notified when it’s out.
It is a commonly repeated mantra that “Globals are bad for thread safety!”. I’d say it all comes down to this tradeoff: The less global a variable, the more cumbersome it is to pass it around between the scopes. I recommend this screencast by DHH where he discusses potentials use cases for globals “when the price is right”.
There are no simple answers to where and how to use globals. So instead, let me describe a technique for making global values safe in multithreaded environments.
Ruby offers built-it support for so-called thread-local variables. Each thread can work as a kind of a hash for storing values accessible globally in the app but only from this single thread.
The best way to explain it is by running the following code example:
These “thread-local globals” could sometimes be helpful in passing data across the application.
However, there’s one critical issue when using barebones
Thread.current for storing values. Puma server seems to be recycling its threads. It means that unless you’re careful, you will leak data between requests. You can confirm this behavior by implementing the following endpoint:
@@assignment_lock class variable mutex that’s global per process to prevent so-called “Time-of-check to time-of-use” error. In a highly concurrent environment, it could be theoretically possible that thread A checks the value of
$assigned global variable, then thread B kicks in and checks that it’s still
nil before thread A sets it to
true. This scenario would cause two different threads to enter the
if condition, and we want to prevent it.
The first response from this endpoint will display the text
assigned. When you request it a few more times, you’ll see that the same text is returned randomly every few hits. It means that the initial thread is being reused. Overlooking this feature of
Thread.current could introduce critical security bugs because sensitive data could be shared between different user sessions.
A built-in Rails
ActiveSupport::CurrentAttributes class serves the same purpose with much better security guarantees. Let’s reimplement our example:
and the endpoint:
You’ll now see that the text
assigned would only be displayed once on the first request. Each subsequent request is getting a fresh copy of a
Current object, so there’s no more risk of a data leak.
Thread safety in Rails is a topic for a hefty eBook instead of a single blog post. But I hope that the above info covers the basics that will help you spot and debug potential issues in your codebase. Let me know in the comments if you know more interesting examples of thread unsafe code in Ruby so that I can include them in this post.