In my last post, I looked at the performance of HLE on Intel's Haswell processor, and found that while it offered a nice speedup in some cases, it can cost performance in others. Still, Intel's TSX is an extremely exciting technology. In this post, I look at the other half of TSX, which Intel calls Restricted Transactional Memory.
If you've never heard of transactional memory before, it's worth reading up. As usual, the Wikipedia article isn't a bad place to start. Some languages, like Clojure, offer software transactional memory out of the box. When it fits, STM can be an extremely nice way to write concurrent programs. The programming model can be simpler, and some classes of bugs (correctness, mostly, rather than liveness) are easier to avoid.
Unlike all the great STM libraries, the current interfaces available to RTM, at least in the form of compiler builtins, don't offer much in the way of a simpler programming model. They do, however, offer us a great way to taste some of the performance benefits that Intel promises for RTM. First, let's take a look at what Intel says about RTM. Starting with the Intel® 64 and IA-32 Architectures Optimization Reference Manual:
Intel® Transactional Synchronization Extensions (Intel TSX) aim to improve the performance of lock-protected critical sections while maintaining the lock-based programming model
OK, so a simpler programming model isn't really Intel's aim here. I'm still pretty sure that there are great opportunities for TM libraries, some of which are already starting to appear, like xsync. Some more from the manual:
Intel TSX allows the processor to determine dynamically whether threads need to serialize through lock-protected critical sections, and to perform serialization only when required. This lets hardware expose and exploit concurrency hidden in an application due to dynamically unnecessary synchronization through a technique known as lock elision.
That's the high-level view. When the CPU detects that the lock isn't held, it tries to run without it. If that all goes horribly wrong, because another core tried to do the same thing, the processor undoes its misdeeds and tries again with the lock. It's clever stuff, and with inter-core synchronization becoming more and more of a bottleneck in multicore systems, it's not surprising Intel's investing in it.
Let's take a look at how RTM performs. For today's test, I wrote a simple separately-chained hash table, and inserted random longs into it until it reached 1000% occupancy. I also made insertions a little artificially slow by doing tail, rather than head, insertions when chaining. For synchronization, I followed Example 12-3 in the manual, which shows a pattern for the use of RTM. For the fallback lock, I used the simple spinlock from Example 12-4 (without HLE, because you can't use both together on Haswell). In unoptimized assembly, the lock function ended up looking like this:
movq %rdi, -8(%rbp) movl $-1, %eax xbegin .L10 .L10: cmpl $-1, %eax jne .L12 movq -8(%rbp), %rax movl (%rax), %eax testl %eax, %eax jne .L13 jmp .L9 .L13: xabort $255 .L12: movq -8(%rbp), %rax movq %rax, %rdi call orig_spinlock
.L9: leave ret
There you can see the two of the three new operations that make up RTM. Interestingly, the xbegin operation can jump (hence the label argument), but the patterns in the manual when used with the builtins in GCC don't use that functionality. Next, we see that we test the lock, and if it's free we return right away. Finally, if somebody else has taken the lock, we xabort to end our transaction and fall back on the lock path (in this case by calling orig_spinlock, which is my spinlock implementation). The unlock side tests the lock again to differentiate between the elided path and the locked path, and calls xend on the elided path. Nothing very complex at all code-wise.
First, let's look at the results of doing a million inserts into a 100k bucket hash table by number of threads. I ran each test 100 times, and inter-test variance was very low (less than 5%). The RTM implementation is in red, and the straight spinlock-based one (with no HLE) is in blue:
The RTM version is repeatably very marginally slower with a single thread (about 0.2% on average), but otherwise faster across the board. The spinlock implementation succumbs to increasing contention costs and is slower than single-threaded, while RTM is faster until 4 threads. The really impressive performance here is at 2 threads, where the HLE version is much, much faster. It shows true parallel speedup on this task, which is impressive considering it's nearly entirely dominated by lock contention. Comparing performance directly:
RTM is nicely quicker across the board, and runs in only 63% of the time with two threads. It's a really great little performance gain, with very little programmer effort.
What happens if we throw HLE into the mix on this program? I added an HLE version of the code (following Example 12-4) to the two I already had. This was the result:
That's extremely interesting. The HLE implementation is a solid 18% slower than baseline on the single-threaded version, then shows a massive performance advantage until 8 threads. It doesn't improve total parallel speedup beyond three threads, but does very effectively prevent slowdown.
I started looking at this hoping to find some good answers, and it just left me with more questions. Clearly, HLE and RTM have great potential to improve multi-threaded performance on contended data structures, but it's not clear-cut when they should be used. So far, my experiments have shown RTM is better across the board than nothing, while HLE can be even better than that at a potential cost with one thread.
I suspect it's going to take a while for compiler and library writers to untangle all of this. We're not going to be seeing the full performance benefits of these features any time soon.
I'll release the source for these tests as soon as I can.