
Ensuring thread safety is critical if you want to build a performant Rails app. Unfortunately, threading-related bugs are often sneaky and only manifest in highly concurrent production environments. In this blog post, we’ll discuss code examples that are not thread-safe. I’ll also describe a toolkit for debugging and discuss possible solutions. Developing an eye for spotting these errors before shipping to production can save you a lot of headaches.
100% thread safety guarantee…
I’d risk saying that just by looking at a piece of Ruby code, you can never tell if it does not hide any multithreading-related bugs. A typical Ruby project is gems all the way down, with the usual excess of external dependencies. So even a simple +
method could be monkey-patched to introduce a thread safety issue.
However, there’s one way to ensure that your Rails app is bulletproof against any concurrency bugs. But it comes with a terrible tradeoff. Let’s explain it with an example:
app/controllers/numbers_controller.rb
class NumbersController < ApplicationController
def index
@@per_page = params[:per_page]
sleep 0.1
if @@per_page == params[:per_page]
render json: [1,2,3,4,5].first(@@per_page.to_i)
else
render plain: "It should never happen!", status: 500
end
end
end
This endpoint renders an array of numbers based on a per_page
value received as a param. We add an error path that, in theory, should never happen. You don’t expect a variable to change its value when assigned a few lines before.
A sleep 0.1
is here to simulate a blocking IO. Ruby uses Global Interpreter Lock (GIL), which works like a global mutex. It ensures that two Ruby threads can never run in parallel. But blocking IO means that the actual work is delegated to an external process, e.g., SQL database or HTTP client. So while thread A is waiting for its blocking IO to complete, thread B can take over. I’ve covered Ruby threading, GIL, and blocking I/O in much more depth in my other blog post so please refer to it if you need a recap.
If you’re familiar with threading issues in Ruby, you’ll notice that we’re using a class variable. It is shared between all the threads in a process. So while thread A is sleeping, thread B will kick in and overwrite the @@per_page
value causing the buggy path. As a result, a user would receive a different number of objects than requested. This is the most straightforward thread-safety bug I could think of. The point of analyzing it in detail is to describe a toolkit for tackling more complex cases.
I’ve performed all the tests with Rails 7, Ruby 3.1.0 on the Puma server running locally in a single mode with 5 min/max threads in the production environment.
Our goal is to trigger the "It should never happen!"
error. We’ll need to simulate a highly concurrent environment to achieve that. You’re unlikely to hammer CMD+R fast enough, so instead, let’s use Siege a load testing tool.
brew install siege
Now create a urls.txt
file with the following contents:
http://localhost:3000/numbers?per_page=1
http://localhost:3000/numbers?per_page=2
http://localhost:3000/numbers?per_page=3
and now you can start your first test by running this command:
siege --time=5s --concurrent=10 -f urls.txt -i
you should get a similar result:
Lifting the server siege...
Transactions: 84 hits
Availability: 45.65 %
Elapsed time: 4.30 secs
Data transferred: 0.00 MB
Response time: 0.49 secs
Transaction rate: 19.53 trans/sec
Throughput: 0.00 MB/sec
Concurrency: 9.65
Successful transactions: 84
Failed transactions: 100
Longest transaction: 0.26
Shortest transaction: 0.14
You can see that we’ve managed to trigger the error in over 50% of requests. You can confirm that the presence of multiple concurrent threads is causing the error by running:
siege --time=5s --concurrent=1 -f urls.txt -i
and now you should get a 100% success rate.
So what about making our app bulletproof against threading bugs that I’ve mentioned before? You can do it by adding a line to config/environments/production.rb
and config/environments/development.rb
:
config.middleware.insert_before 0, Rack::Lock
Don’t forget to restart your server after modifying config files.
Let’s rerun our test:
siege --time=5s --concurrent=10 -f urls.txt -i
and you should expect a similar result:
Lifting the server siege...
Transactions: 41 hits
Availability: 100.00 %
Elapsed time: 4.48 secs
Data transferred: 0.00 MB
Response time: 0.97 secs
Transaction rate: 9.15 trans/sec
Throughput: 0.00 MB/sec
Concurrency: 8.86
Successful transactions: 41
Failed transactions: 0
Longest transaction: 1.11
Shortest transaction: 0.15
So we’ve got a 100% success rate for the price of decreasing our throughput by over 50%. We’re down to 41 hits from 84, and the longest transaction is now over 1 second instead of 0.26 second. To understand why adding Rack::Lock
had this effect let’s have a quick look at its source code:
def initialize(app, mutex = Mutex.new)
@app, @mutex = app, mutex
end
def call(env)
@mutex.lock
begin
response = @app.call(env)
returned = response << BodyProxy.new(response.pop) { unlock }
ensure
unlock unless returned
end
end
You can see that this middleware wraps the incoming requests into a mutex that’s initialized once per process. All the Puma worker threads share the same mutex variable. It means that two threads can never run simultaneously, even for blocking I/O. Rack::Lock
throttles your app to only use one thread per process.
You can see that while effective, the Rack::Lock
is not a viable solution.
We’ve verified that you need to simulate a concurrent environment to trigger even the simplest thread safety bug. Siege is a perfect tool to test your app locally if you’re hunting these weird “sometimes” bugs spotted on production.
SQL database and thread safety
Database interactions add a whole new dimension to the scope of possible threading bugs. Consider this example:
class NumbersController < ApplicationController
def counter
current_user.update!(
counter: current_user.counter + 1
)
end
def current_user
@current_user ||= User.find(session[:user_id])
end
end
This endpoint counts a number of user interactions. You can test it with the following Siege command:
siege --time=5s --concurrent=10 http://localhost:3000/counter
You should see a similar output:
Transactions: 341 hits
Availability: 100.00 %
You’d expect the counter
user attribute value to increase by the same number. But (unless you still have Rack::Lock
enabled), you’ll notice that it increased by only around one-third of API hits counted.
Despite the lack of the obvious blocking IO (like sleep
in the previous example), we’ve managed to trigger a thread safety bug. Our threads fetch the current value of the counter
attribute from the database and increment it concurrently, i.e., committing the same change multiple times. That’s why we lose some values.
You can fix this issue by wrapping the read and update operations into a database transaction with a correct isolation level.
def counter
User.transaction(isolation: :repeatable_read) do
current_user.update!(
counter: current_user.counter + 1
)
end
end
The details of SQL transaction isolation levels are out of the scope for this tutorial. But long story short, repeatable_read
acquires locks that prevent both concurrent reads and writes. As a result, update operations are executed sequentially, and the value of the counter
attribute is now equal to the number of successful HTTP API calls. Please keep in mind that acquiring database transaction with high isolation levels can significantly reduce performance. Always measure the impact before applying this change to production bottlenecks.
Another solution to this particular example is to leverage an API method that’s thread-safe by design:
def counter
current_user.increment!(:counter)
end
Finding an API method immune to threading bugs is not always possible. But, searching the docs of an unknown library for keywords like “thread-safe”, “concurrent”, can sometimes save you from unnecessarily convoluted solutions. For example, the concurrent-ruby gem is an awesome collection of thread-safe programming primitives. So, if you find yourself juggling mutexes to patch a threading bug, getting familiar with tools offered by this gem could be a live saver. BTW I’m currently working on a blog post about it, so please subscribe if you want to be notified when it’s out.
Bad globals
It is a commonly repeated mantra that “Globals are bad for thread safety!”. I’d say it all comes down to this tradeoff: The less global a variable, the more cumbersome it is to pass it around between the scopes. I recommend this screencast by DHH where he discusses potentials use cases for globals “when the price is right”.
There are no simple answers to where and how to use globals. So instead, let me describe a technique for making global values safe in multithreaded environments.
Thread-safe globals
Ruby offers built-it support for so-called thread-local variables. Each thread can work as a kind of a hash for storing values accessible globally in the app but only from this single thread.
The best way to explain it is by running the following code example:
Thread.current[:value] = "parent"
child = Thread.new do
puts "Initial value in child: #{Thread.current[:value].inspect}"
Thread.current[:value] = "child"
end
child.join
puts "Value in main: #{Thread.current[:value]}"
puts "Value in child: #{child[:value]}"
# Output:
# Initial value in child: nil
# Value in main: parent
# Value in child: child
These “thread-local globals” could sometimes be helpful in passing data across the application.
However, there’s one critical issue when using barebones Thread.current
for storing values. Puma server seems to be recycling its threads. It means that unless you’re careful, you will leak data between requests. You can confirm this behavior by implementing the following endpoint:
class ExampleController < ApplicationController
@@assignment_lock = Mutex.new
def index
@@assignment_lock.synchronize {
if $assigned == nil
Thread.current[:value] = "assigned"
$assigned = true
end
}
render plain: Thread.current[:value].to_s
end
end
We use @@assignment_lock
class variable mutex that’s global per process to prevent so-called “Time-of-check to time-of-use” error. In a highly concurrent environment, it could be theoretically possible that thread A checks the value of $assigned
global variable, then thread B kicks in and checks that it’s still nil
before thread A sets it to true
. This scenario would cause two different threads to enter the if
condition, and we want to prevent it.
The first response from this endpoint will display the text assigned
. When you request it a few more times, you’ll see that the same text is returned randomly every few hits. It means that the initial thread is being reused. Overlooking this feature of Thread.current
could introduce critical security bugs because sensitive data could be shared between different user sessions.
A built-in Rails ActiveSupport::CurrentAttributes
class serves the same purpose with much better security guarantees. Let’s reimplement our example:
app/models/current.rb
class Current < ActiveSupport::CurrentAttributes
attribute :value
end
and the endpoint:
class ExampleController < ApplicationController
@@assignment_lock = Mutex.new
def index
@@assignment_lock.synchronize {
if $assigned == nil
Current.value = "assigned"
$assigned = true
end
}
render plain: Current.value.to_s
end
end
You’ll now see that the text assigned
would only be displayed once on the first request. Each subsequent request is getting a fresh copy of a Current
object, so there’s no more risk of a data leak.
Summary
Thread safety in Rails is a topic for a hefty eBook instead of a single blog post. But I hope that the above info covers the basics that will help you spot and debug potential issues in your codebase. Let me know in the comments if you know more interesting examples of thread unsafe code in Ruby so that I can include them in this post.