Intro to Thread-Safety in Ruby on Rails

Ruby threads-safety is represented by threads Photo by Stephane Gagnon on Unsplash

Ensuring thread safety is critical if you want to build a performant Rails app. Unfortunately, threading-related bugs are often sneaky and only manifest in highly concurrent production environments. In this blog post, we’ll discuss code examples that are not thread-safe. I’ll also describe a toolkit for debugging and discuss possible solutions. Developing an eye for spotting these errors before shipping to production can save you a lot of headaches.

100% thread safety guarantee…

I’d risk saying that just by looking at a piece of Ruby code, you can never tell if it does not hide any multithreading-related bugs. A typical Ruby project is gems all the way down, with the usual excess of external dependencies. So even a simple + method could be monkey-patched to introduce a thread safety issue.

However, there’s one way to ensure that your Rails app is bulletproof against any concurrency bugs. But it comes with a terrible tradeoff. Let’s explain it with an example:

app/controllers/numbers_controller.rb

class NumbersController < ApplicationController
  def index
    @@per_page = params[:per_page]

    sleep 0.1

    if @@per_page == params[:per_page]
      render json: [1,2,3,4,5].first(@@per_page.to_i)
    else
      render plain: "It should never happen!", status: 500
    end
  end
end

This endpoint renders an array of numbers based on a per_page value received as a param. We add an error path that, in theory, should never happen. You don’t expect a variable to change its value when assigned a few lines before.

A sleep 0.1 is here to simulate a blocking IO. Ruby uses Global Interpreter Lock (GIL), which works like a global mutex. It ensures that two Ruby threads can never run in parallel. But blocking IO means that the actual work is delegated to an external process, e.g., SQL database or HTTP client. So while thread A is waiting for its blocking IO to complete, thread B can take over. I’ve covered Ruby threading, GIL, and blocking I/O in much more depth in my other blog post so please refer to it if you need a recap.

If you’re familiar with threading issues in Ruby, you’ll notice that we’re using a class variable. It is shared between all the threads in a process. So while thread A is sleeping, thread B will kick in and overwrite the @@per_page value causing the buggy path. As a result, a user would receive a different number of objects than requested. This is the most straightforward thread-safety bug I could think of. The point of analyzing it in detail is to describe a toolkit for tackling more complex cases.

I’ve performed all the tests with Rails 7, Ruby 3.1.0 on the Puma server running locally in a single mode with 5 min/max threads in the production environment.

Our goal is to trigger the "It should never happen!" error. We’ll need to simulate a highly concurrent environment to achieve that. You’re unlikely to hammer CMD+R fast enough, so instead, let’s use Siege a load testing tool.

brew install siege

Now create a urls.txt file with the following contents:

http://localhost:3000/numbers?per_page=1
http://localhost:3000/numbers?per_page=2
http://localhost:3000/numbers?per_page=3

and now you can start your first test by running this command:

siege --time=5s --concurrent=10 -f urls.txt -i

you should get a similar result:

Lifting the server siege...
Transactions:               84 hits
Availability:               45.65 %
Elapsed time:               4.30 secs
Data transferred:           0.00 MB
Response time:              0.49 secs
Transaction rate:           19.53 trans/sec
Throughput:                 0.00 MB/sec
Concurrency:                9.65
Successful transactions:    84
Failed transactions:        100
Longest transaction:        0.26
Shortest transaction:       0.14

You can see that we’ve managed to trigger the error in over 50% of requests. You can confirm that the presence of multiple concurrent threads is causing the error by running:

siege --time=5s --concurrent=1 -f urls.txt -i

and now you should get a 100% success rate.

So what about making our app bulletproof against threading bugs that I’ve mentioned before? You can do it by adding a line to config/environments/production.rb and config/environments/development.rb:

config.middleware.insert_before 0, Rack::Lock

Don’t forget to restart your server after modifying config files.

Let’s rerun our test:

siege --time=5s --concurrent=10 -f urls.txt -i

and you should expect a similar result:

Lifting the server siege...
Transactions:               41 hits
Availability:               100.00 %
Elapsed time:               4.48 secs
Data transferred:           0.00 MB
Response time:              0.97 secs
Transaction rate:           9.15 trans/sec
Throughput:                 0.00 MB/sec
Concurrency:                8.86
Successful transactions:    41
Failed transactions:        0
Longest transaction:        1.11
Shortest transaction:       0.15

So we’ve got a 100% success rate for the price of decreasing our throughput by over 50%. We’re down to 41 hits from 84, and the longest transaction is now over 1 second instead of 0.26 second. To understand why adding Rack::Lock had this effect let’s have a quick look at its source code:

def initialize(app, mutex = Mutex.new)
  @app, @mutex = app, mutex
end

def call(env)
  @mutex.lock
  begin
    response = @app.call(env)
    returned = response << BodyProxy.new(response.pop) { unlock }
  ensure
    unlock unless returned
  end
end

You can see that this middleware wraps the incoming requests into a mutex that’s initialized once per process. All the Puma worker threads share the same mutex variable. It means that two threads can never run simultaneously, even for blocking I/O. Rack::Lock throttles your app to only use one thread per process.

Threads related meme

You can see that while effective, the Rack::Lock is not a viable solution.

We’ve verified that you need to simulate a concurrent environment to trigger even the simplest thread safety bug. Siege is a perfect tool to test your app locally if you’re hunting these weird “sometimes” bugs spotted on production.

SQL database and thread safety

Database interactions add a whole new dimension to the scope of possible threading bugs. Consider this example:

class NumbersController < ApplicationController
  def counter
    current_user.update!(
      counter: current_user.counter + 1
    )
  end

  def current_user
    @current_user ||= User.find(session[:user_id])
  end
end

This endpoint counts a number of user interactions. You can test it with the following Siege command:

siege --time=5s --concurrent=10 http://localhost:3000/counter

You should see a similar output:

Transactions:         341 hits
Availability:         100.00 %

You’d expect the counter user attribute value to increase by the same number. But (unless you still have Rack::Lock enabled), you’ll notice that it increased by only around one-third of API hits counted.

Despite the lack of the obvious blocking IO (like sleep in the previous example), we’ve managed to trigger a thread safety bug. Our threads fetch the current value of the counter attribute from the database and increment it concurrently, i.e., committing the same change multiple times. That’s why we lose some values.

You can fix this issue by wrapping the read and update operations into a database transaction with a correct isolation level.

def counter
  User.transaction(isolation: :repeatable_read) do
    current_user.update!(
      counter: current_user.counter + 1
    )
  end
end

The details of SQL transaction isolation levels are out of the scope for this tutorial. But long story short, repeatable_read acquires locks that prevent both concurrent reads and writes. As a result, update operations are executed sequentially, and the value of the counter attribute is now equal to the number of successful HTTP API calls. Please keep in mind that acquiring database transaction with high isolation levels can significantly reduce performance. Always measure the impact before applying this change to production bottlenecks.

Another solution to this particular example is to leverage an API method that’s thread-safe by design:

def counter
  current_user.increment!(:counter)
end

Finding an API method immune to threading bugs is not always possible. But, searching the docs of an unknown library for keywords like “thread-safe”, “concurrent”, can sometimes save you from unnecessarily convoluted solutions. For example, the concurrent-ruby gem is an awesome collection of thread-safe programming primitives. So, if you find yourself juggling mutexes to patch a threading bug, getting familiar with tools offered by this gem could be a live saver. BTW I’m currently working on a blog post about it, so please subscribe if you want to be notified when it’s out.

Bad globals

It is a commonly repeated mantra that “Globals are bad for thread safety!”. I’d say it all comes down to this tradeoff: The less global a variable, the more cumbersome it is to pass it around between the scopes. I recommend this screencast by DHH where he discusses potentials use cases for globals “when the price is right”.

There are no simple answers to where and how to use globals. So instead, let me describe a technique for making global values safe in multithreaded environments.

Thread-safe globals

Ruby offers built-it support for so-called thread-local variables. Each thread can work as a kind of a hash for storing values accessible globally in the app but only from this single thread.

The best way to explain it is by running the following code example:

Thread.current[:value] = "parent"

child = Thread.new do
  puts "Initial value in child: #{Thread.current[:value].inspect}"
  Thread.current[:value] = "child"
end
child.join

puts "Value in main: #{Thread.current[:value]}"
puts "Value in child: #{child[:value]}"

# Output:
# Initial value in child: nil
# Value in main: parent
# Value in child: child

These “thread-local globals” could sometimes be helpful in passing data across the application.

However, there’s one critical issue when using barebones Thread.current for storing values. Puma server seems to be recycling its threads. It means that unless you’re careful, you will leak data between requests. You can confirm this behavior by implementing the following endpoint:

class ExampleController < ApplicationController
  @@assignment_lock = Mutex.new

  def index
    @@assignment_lock.synchronize {
      if $assigned == nil
        Thread.current[:value] = "assigned"
        $assigned = true
      end
    }

    render plain: Thread.current[:value].to_s
  end
end

We use @@assignment_lock class variable mutex that’s global per process to prevent so-called “Time-of-check to time-of-use” error. In a highly concurrent environment, it could be theoretically possible that thread A checks the value of $assigned global variable, then thread B kicks in and checks that it’s still nil before thread A sets it to true. This scenario would cause two different threads to enter the if condition, and we want to prevent it.

The first response from this endpoint will display the text assigned. When you request it a few more times, you’ll see that the same text is returned randomly every few hits. It means that the initial thread is being reused. Overlooking this feature of Thread.current could introduce critical security bugs because sensitive data could be shared between different user sessions.

A built-in Rails ActiveSupport::CurrentAttributes class serves the same purpose with much better security guarantees. Let’s reimplement our example:

app/models/current.rb

class Current < ActiveSupport::CurrentAttributes
  attribute :value
end

and the endpoint:

class ExampleController < ApplicationController
  @@assignment_lock = Mutex.new

  def index
    @@assignment_lock.synchronize {
      if $assigned == nil
        Current.value = "assigned"
        $assigned = true
      end
    }

    render plain: Current.value.to_s
  end
end

You’ll now see that the text assigned would only be displayed once on the first request. Each subsequent request is getting a fresh copy of a Current object, so there’s no more risk of a data leak.

Summary

Thread safety in Rails is a topic for a hefty eBook instead of a single blog post. But I hope that the above info covers the basics that will help you spot and debug potential issues in your codebase. Let me know in the comments if you know more interesting examples of thread unsafe code in Ruby so that I can include them in this post.