A couple of months ago, I bought myself a new home PC, upgrading from my old Core2 Q6600 to a shiny new Haswell-based Xeon E3-1240v3. Honestly, I don't use my home PC that much, so the biggest draw for upgrading was trying out some of the features in Haswell, and getting loads of ECC RAM to support another project.
The biggest thing I was excited about with Haswell is Intel's new TSX, a step towards true hardware transactional memory on commodity processors. Transactional memory is a very exciting idea, and seeing better support for it in hardware is really great. TSX provides two broad sets of functionality: restricted transactional memory (RTM), and hardware lock elision (HLE). HLE can be seen as a subset of RTM, offering backward compatibility with pre-Haswell processors. I started my investigations by looking at HLE.
Intel describes Haswell's HLE like this:
If multiple threads execute critical sections protected by the same lock but they do not perform any conflicting operations on eachother's data, then the threads can execute concurrently and without serialization. Even though the software uses lock acquisition operations on a common lock, the hardware is allowed to recognize this, elide the lock, and execute the critical sections on the two threads without requiring any communication through the lock if such communication was dynamically unnecessary.
Based on this description, I was expecting HLE to work best on low-contention locks, possibly significantly increasing performance. Intel's backward-compatible HLE is based on two new instruction prefixes (rather than new instructions): XACQUIRE (F2) and XRELEASE (F3). You basically put the XACQUIRE prefix on the instruction that starts your critical section, and XRELEASE on the instruction that ends it. There are a bunch of good ways to implement locks on x86, but most commonly the start instruction will be an xcgh or cmpxchg, and the end will be one of those, or just a mov.
Luckily, to try this out I didn't need to write any assembly, because GCC 4.8 supports HLE through atomic builtins, thanks to the work of Andi Kleen. Taking advantage of HLE is as simple as passing an additional ATOMIC_HLE_ACQUIRE flag to atomic_exchange_n in your lock implementation, and ATOMIC_HLE_RELEASE to your unlock. GCC then emits the prefixes on the lock instructions.
Implementing a spinlock with atomic_exchange_n the difference in the emitted assembly is very simple:
- xchgl (%rdi), %eax + xacquire xchgl (%rdi), %eax
To test all of this out, I implemented a multithreaded count-sort of 100MB of random integers into 10000 buckets in C. The count buckets were protected by striped spinlocks, with some number of buckets sharing a single spinlock. Each thread looped over its unique data, and for each item took the spinlock, increased the bucket count and released the spinlock. Obviously a very short critical section, and probably better implemented without locks (directly with an atomic cmpxchg, for example), but I'm only starting out here.
I then ran two versions of the program for various thread counts and lock counts: a version with the HLE prefixes and a version without them and measured the difference in performance:
Here's a cut through the number of locks, for 3 threads (blue) and 10 threads (red). With 3 threads, HLE is faster across the board, and with 10 threads, HLE is a wash with more locks and a big loss with fewer. Both win big with a single lock:
The results are actually very interesting. For this (admittedly extremely lock-intensive) program, the version with HLE takes nearly twice as long as the one without when run with a single thread. However, when run with one lock and 10 threads, a massive amount of contention, the HLE version more than 6 times faster. That's pretty amazing.
If we peek under the hood a bit, we should be able to see what's going on inside the processor to make this big performance differences. As always with low-level CPU performance stuff, perf stat is a great place to start. First, let's look at a case where HLE is much faster, with 3 threads and 10 locks. The HLE version is more than 2x faster in this case.
The first step of perf stat for the two versions shows this for the HLE version: 1,333,417,638 instructions # 0.20 insns per cycle and this for the non-HLE version: 1,333,549,171 instructions # 0.09 insns per cycle
The number of executed instructions is nearly the same, as expected, but instructions per cycle is very different. Obviously, the CPU is doing something other than running instructions, and that something is most likely waiting for memory. Looking at some of the cache counters makes this more obvious. First, the version with HLE:
27,804,277 LLC-stores 25,302,175 LLC-loads 412,030,786 L1-dcache-loads 65,940,443 L1-dcache-load-misses
and the version without:
67,958,384 LLC-stores 32,624,733 LLC-loads 400,883,047 L1-dcache-loads 89,777,892 L1-dcache-load-misses
Look at the big difference between LLC-stores (writes to Last Level Cache, L3 in our case) explains a big part of the issue here. The difference between L1-dcache-load-misses is also significant. The non-HLE version is making a lot more trips to memory. HLE is clearly doing its job. In the single-lock case, with HLE's 6x performance improvement, the difference in LLC-stores is huge (6x, actually).
So what about the single thread case? Why does HLE make that slower? I don't understand enough about Intel's architecture to make a good guess, but the 2x slower HLE-enabled version is doing about 2x the LLC-stores. I suspect this is an implementation issue with HLE in TSX, and I hope it's an artifact of this benchmark rather than a general performance issue with TSX. More testing, especially more testing with significant critical sections, should tell.
I'll release the source for these tests as soon as I can. I'm hoping to find some time to play with RTM over the holidays, too.