Marc's Blog

About Me

My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in machining, welding, cooking and skiing.

I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I work on databases, serverless, and serverless databases. Before that, I worked on EC2 and EBS.
All opinions are my own.


My Publications and Videos
@marcbrooker on Mastodon @MarcJBrooker on Twitter

Make Your Program Slower With Threads

How much do context switches matter?

Years ago, while taking a numerical methods course, I wrote some code to calculate the expected number of shared birthdays in a group. The code is very simple: each attempt constructs a vector of N birthdays, then counts the duplicates. The outer loop runs millions of attempts, and calculates the mean number of shared birthdays across all the samples. It’s little more than a tight loop around a pseudo-random number generator.

I was also learning about threading at the time, and decided that I could speed up my program by running it on the lab’s shiny dual-core machine. I knew that communicating between threads was expensive, so I had each of my threads calculate their attempts in parallel, and merge the results right at the end. I was expecting a great speedup. Much to my disappointment, though, the multi-threaded version was slower. Much, much, slower.

Much like the birthday paradox runs counter to our intuition about statistics, the behavior of bad multi-threaded programs runs counter to our intuition about computer performance. We’re used to computers being much faster than they used to be, and single-threaded efficiency mattering less than it used to in most cases. Counter to that intuition, the gap between good and bad multithreaded programs has gotten worse over time.

To illustrate just how bad it can be, I replicated my program from back then. It’s not much more than a multi-threaded tight loop around random(3). It’s nice and quick single-threaded: running 10 million attempts in under 7 seconds. Going up to two threads makes it a bit faster, down to less than 6 seconds. When we hit three threads (on my four core Haswell E3-1240), it all goes horribly wrong:

To figure out what’s wrong, we can turn to Linux’s excellent perf tool. Running the 1-thread and 4-thread versions with perf stat make it obvious that something’s going on. For 1 thread:

     3,788,352 L1-dcache-load-misses #0.03% of all L1-dcache hits
43,399,424,441 instructions  #1.46  insns per cycle
           734 context-switches

and for four threads:

  4,110,904,396 L1-dcache-load-misses #6.88% of all L1-dcache hits
248,853,610,160 instructions # 0.51  insns per cycle
     15,993,647 context-switches

Two things are going wrong here. One is that we’re seeing a more L1 cache misses with more threads, but the bigger issue is that we’re seeing a whole lot more context switches. The effect of both of these is visible in the much lower instructions per cycle of the second version. There’s no nice constant for the cost of a context switch, but a good modern estimate is around 3μs. Multiplying 3μs by 16 million context switches gives 48 seconds, which is a good hint that we’re headed in the right direction. So, what’s causing the context switches?

Back to perf, this time running perf record on the processes, followed by perf report. First, the top few rows for the single-threaded version:

# Overhead   Command  Shared Object    Symbol
# ........  ........  ..............   ........................
62.01%  birthday         [.] msort_with_tmp.part.0
11.40%  birthday         [.] __memcpy_sse2        
10.19%  birthday  birthday             [.] simulate

We’re spending 62% of the time sorting the array, which is used to find the duplicates. That’s about what I would have guessed. What about the version with four threads?

# Overhead   Command  Shared Object  Symbol
# ........  ........  .............  ............
46.80%  birthday  [kernel.kallsyms]  [k] _raw_spin_lock   
 8.86%  birthday       [.] __random           
 3.42%  birthday       [.] __lll_lock_wait_private
 3.23%  birthday  [kernel.kallsyms]  [k] try_to_wake_up       
 2.95%  birthday       [.] __random_r        
 2.79%  birthday       [.] msort_with_tmp.part.0
 2.10%  birthday  [kernel.kallsyms]  [k] futex_wake 
 1.46%  birthday  [kernel.kallsyms]  [k] system_call  
 1.35%  birthday  [kernel.kallsyms]  [k] get_futex_value_locked 
 1.15%  birthday  [kernel.kallsyms]  [k] futex_wait_setup  
 1.14%  birthday  [kernel.kallsyms]  [k] futex_wait 

Well, that’s suspicious. There aren’t any locks in my code, but there are a whole lot of references to locks in the trace. raw_spin_lock is obviously a candidate, and it’s suspicious to see so many futex-related calls. Something’s taking locks, and the fact that random is near the top of the list makes it a likely candidate. Before we dive in there, though, let’s confirm that we’re doing a lot of syscalls:

sudo perf stat -e 'syscalls:sys_e*' ./birthday

Which spits out a long list of system calls, most (like mmap) with just a handful of hits. There are two huge outliers:

46,889,267 syscalls:sys_enter_futex
46,889,267 syscalls:sys_exit_futex

That confirms it, something’s taking a lot of futexes. Knowing whether it’s random or not requires a dive into the glibc source, which nearly instantly reveals something suspicious:

 /* POSIX.1c requires that there is mutual exclusion for the `rand' and
  `srand' functions to prevent concurrent calls from modifying common
   data.  */

And, just a little bit further down:

 __libc_lock_lock (lock);
 (void) __random_r (&unsafe_state, &retval);
 __libc_lock_unlock (lock);

Getting rid of the locks means getting rid of one of two things: shared state, or the necessity to prevent concurrent modification to that state. It seems like the former is easier: reasoning about a data-race-safe PRNG is tricky. There are a many good ways to get rid of shared state in the PRNG. Linux has one particularly convenient way: the C library exposes a reentrant random number generator called random_r (which is used by random, as you can see from the snippet above). Dropping random_r in place of random has an amazing effect:

As expected, the context switches are way down and instructions per cycle is nicely improved:

     4,166,540 L1-dcache-load-misses  # 0.04% of all L1-dcache hits
40,201,461,769 instructions # 1.43  insns per cycle
           572 context-switches

I recognize that spinning on a tight loop on random is a contrived example, but it’s not too far away from reality. Many programs that multi-thread for performance end up with library or system calls inside relatively tight loops. Our intuition about these things tends to follow Amdahl’s law. At worst, it’s tempting to think, these things count as a non-parallel portion of code and lower the maximum achievable parallel speedup. In the real world, though, that’s not the case. Multi-threaded programs can, and very often do, run much more slowly than the equivalent single-threaded program.

It’s just another thing that makes writing multi-threaded code difficult.