Litmus Tests for Multithreaded Behavior
Litmus Tests for Multithreaded Behavior Or How Processors Don't Do What You Think
Modern multicore processors are entirely weirder than almost anyone thinks possible. They are somewhat weirder than chip makers were willing to admit until fairly recently. They are sufficiently weird enough that almost all multi-threaded programs, and many lock-free algorithms, had bugs because the hardware just does not work the way anyone would reasonably expect. And, in addition, optimizing compilers did not actually know how to not break your code. [boehm2005threads] I'm in the (slow) process of writing some test cases for multi-threaded code. I eventually want to have some confidence that some code is executed once, and only once, in an efficient manner. It's the 'efficient' part that worries me, because for efficient, it also has a tendency to be clever, and I'm learning that clever MT code is often subtly broken. [bacon2000double] So if smarter people than me make mistakes about MT code, I need tests to compensate. And ones that will cover occasional allowed but unexpected behavior. Which means the test framework should be able to detect them. Also, fortunately, the RPi is a computer that exhibits some odd behavior, as it is an ARM system. X86 has a much stronger model. However, even the x86 model is allowed to perform in odd ways. Starting in 2007, Intel has started publishing short snippets of assembly and documenting what are the allowed and disallowed results running them in parallel. [IWPAug2007] These snippets have come to be called litmus tests, and are used to confirm the behavior of hardware, and confirm models of the hardware behavior. A particularly important model for C++ programmers is the x86 Total Store Order model [owens2009better] which provides a way of mapping the C++11 memory model to X86 hardware. X86 hardware provides a strongly consistent memory model. Power and ARM provide fewer guarantees, and mapping the C++ memory model to these architectures is more challenging. [maranget2012tutorial]Message Passing
The tests outlined in the Intel paper are short pieces of assembly to be exercised on different processors, with guarantees about behavior that will not happen. The first one essentially promises that message passing will work, and is now known as the MP test.
Initially x = y = 0
r1 = 1 and r2 = 0 is not allowed
That says that we can't read the writes of x and y out of order, which ensures that if we're waiting to see the write to the flag y, we can be guaranteed to see the payload in x. If we change it slightly to wait for the write to _y to be visible, we can pass a message from one thread to anothher in _x. This is also known as Dekards Algorithm.
ARM and Power do not provide that guarantee without additional synchronization instructions.
In C++ that looks something like the following, using the test framework I'm writing.
Here, x_ and y_ are atomic<int>s, and we're using the lowest possible atomic guarantee in C++, relaxed. Relaxed guarantees that the operation happens atomically. but there are no synchronization properties with anything else. This usally corresponds to the basic int type. Unless you're using a really insane processor that might let an int be partially written and observable. Like you might get the top half of the int, or the middle byte. The commercial processors that allowed this have pretty much died out.
The test spins on seeing the load of y to be complete. It loads the value of x_ into a result tuple. The tuple is used as the key to a map which accumulates how many times each result has been seen.
Running the above on my x86 laptop:
Processor 0 | Processor 1 |
---|---|
mov [ _x], 1 // M1 | mov r1, [ _y] // M3 |
mov [ _y], 1 // M2 | mov r2, [ _x] // M4 |
MP::MP() : x_(0), y_(0) {} void MP::t1() { x_.store(1, std::memory_order_relaxed); y_.store(1, std::memory_order_relaxed); } void MP::t2(Result& read) { while (!y_.load(std::memory_order_relaxed)){} std::get<0>(read) = x_.load(std::memory_order_relaxed); }
[ RUN ] ExperimentTest.MPTest1 (1) : 2000000on my Raspberry Pi 3:
[ RUN ] ExperimentTest.MPTest1 (0) : 483 (1) : 1999517Using objdump to check the generated assembly
00000088 <litmus::MP::MP()>: 88: mov r2, #0 8c: str r2, [r0] 90: str r2, [r0, #64] ; 0x40 94: bx lr 00000098 <litmus::MP::t1()>: 98: mov r3, #1 9c: str r3, [r0] a0: str r3, [r0, #64] ; 0x40 a4: bx lr 000000a8 <litmus::MP::t2(std::tuple<int>&)>: a8: ldr r3, [r0, #64] ; 0x40 ac: cmp r3, #0 b0: beq a8 <litmus::MP::t2(std::tuple<int>&)> b4: ldr r3, [r0] b8: str r3, [r1] bc: bx lrSo, out of the 2,000,000 times that I ran the experiment, there were 483 times that reading x_ resulted in 0, even though y_ was 1. ARM has a weaker memory model than x86. This has some advantages in processor implementation. It has distinct disadvantages in how our brains work. X86 tries to preserve the model that there is shared memory that everyone sees and works with. That's not strictly true, even for X86, but ARM and Power don't even come close. On the other hand, it's also why it's easier to add more cores to Power and ARM chips and systems. I routinely work with Power systems with 512 physical cores.
Store Buffering
Store buffering is the odd case that is allowed in the Intel memory model. When assigning locations in two threads, and then reading them on opposite threads, both threads are allowed to read the older state. The stores get buffered.
From the Intel White Paper:
Initially x = y = 0
r1 = 0 and r2 ==0 is allowed
Note, in particular, there is no interleaving of M1 - 4 that could result in r1 and r2 being 0. Not without interupting an instruction in the middle. But the instructions themselves are atomic, and indivisible. If they were actually operating on shared memory, this would not be possible. However, it does happen.
That generates the x86 code
Processor 0 | Processor 1 |
---|---|
mov [ _x], 1 // M1 | mov [ _y], 1 // M3 |
mov r1, [ _y] // M2 | mov r2, [ _x] // M4 |
SB::SB() : x_(0), y_(0) {} void SB::t1(Result& read) { y_.store(1, std::memory_order_relaxed); std::get<0>(read) = x_.load(std::memory_order_relaxed); } void SB::t2(Result& read) { x_.store(1, std::memory_order_relaxed); std::get<1>(read) = y_.load(std::memory_order_relaxed); }
00000000000000f0 <litmus::SB::t1(std::__1::tuple<int, int>&)>: f0: mov DWORD PTR [rdi+0x40],0x1 f7: mov eax,DWORD PTR [rdi] f9: mov DWORD PTR [rsi],eax fb: ret 0000000000000100 <litmus::SB::t2(std::__1::tuple<int, int>&)>: 100: mov DWORD PTR [rdi],0x1 106: mov eax,DWORD PTR [rdi+0x40] 109: mov DWORD PTR [rsi+0x4],eax 10c: retAnd on my x86 machine:
[ RUN ] ExperimentTest.SBTest1 (0, 0) : 559 (0, 1) : 999858 (1, 0) : 999576 (1, 1) : 7So 559 times neither core saw the other core's store.
Load Buffering
Load Buffering is the dual of store buffering. Loads into registers might be delayed, or buffered, and actually performed after following instructions. It's not allowed in the Intel architecture.
From the Intel White Paper
Initially x = y = 0
r1 = 1 and r2 = 1 is not allowed
This is the x86 asm code
Processor 0 | Processor 1 |
---|---|
mov r1, [ _x] // M1 | mov r2, [ _y] // M3 |
mov [ _y], 1 // M2 | mov [ _x], 1 // M4 |
LB::LB() : x_(0), y_(0) {} void LB::t1(Result& read) { std::get<0>(read) = x_.load(std::memory_order_relaxed); y_.store(1, std::memory_order_relaxed); } void LB::t2(Result& read) { std::get<1>(read) = y_.load(std::memory_order_relaxed); x_.store(1, std::memory_order_relaxed); }
00000000000000c0 <litmus::LB::t1(std::__1::tuple<int, int>&)>: c0: mov eax,DWORD PTR [rdi] c2: mov DWORD PTR [rsi],eax c4: mov DWORD PTR [rdi+0x40],0x1 cb: ret cc: nop DWORD PTR [rax+0x0] 00000000000000d0 <litmus::LB::t2(std::__1::tuple<int, int>&)>: d0: mov eax,DWORD PTR [rdi+0x40] d3: mov DWORD PTR [rsi+0x4],eax d6: mov DWORD PTR [rdi],0x1 dc: ret dd: nop DWORD PTR [rax]And the ARM code, at -O1
000000d0 <litmus::LB::t1(std::tuple<int, int>&)>: d0: ldr r3, [r0] d4: str r3, [r1, #4] d8: mov r3, #1 dc: str r3, [r0, #64] ; 0x40 e0: bx lr 000000e4 <litmus::LB::t2(std::tuple<int, int>&)>: e4: ldr r3, [r0, #64] ; 0x40 e8: str r3, [r1] ec: mov r3, #1 f0: str r3, [r0] f4: bx lrARM generally allows it, but per [maranget2012tutorial] it's very sensitive, and dependencies will make it not appear. In my tests, I did not observe an instance of a buffering, but it may be due to the first store the compiler introduces, in order to actually get the data into the tuple. That it's documented as possible is still exceedingly strange.
Independent Reads of Independent Writes
IRIW is a generalization of store buffering, where two reader threads each read different apparent orderings of writes from two distinct writer threads.
Initially X=Y=0
Allowed in ARM, not in x86 r1=1, r2=0, r3=1, r4=0 [maranget2012tutorial,owens2009better]
This is not observed in x86 processors, but is in some ARM and POWER, more often in POWER. X86 hardware has a consistent view of memory where other hardware can see memory writes in different orders on different threads. On my rPi, I didn't observe any incidents of X and Y being read out of order, over 40 million runs.
T1 | T2 | T3 | T4 |
---|---|---|---|
X = 1 | Y = 1 | R1 = X | R3 = y |
R2 = Y | R4 = X | ||
IRIW::IRIW() : x_(0), y_(0) {} void IRIW::t1() { x_.store(1, std::memory_order_relaxed); } void IRIW::t2() { y_.store(1, std::memory_order_relaxed); } void IRIW::t3(Result& read) { std::get<0>(read) = x_.load(std::memory_order_relaxed); std::get<1>(read) = y_.load(std::memory_order_relaxed); } void IRIW::t4(Result& read) { std::get<2>(read) = y_.load(std::memory_order_relaxed); std::get<3>(read) = x_.load(std::memory_order_relaxed); }
Summary
The allowed behavior of modern processors is very different than our mental model of a Von Neumann architecture computer. Each core can have a different view of memory, and without additional controls, writes and reads can break the illusion of a single unified memory. The C++ memory model gives the controls and guarantees about what happens when different threads read and write memory, and here I've deliberately used the weakest version available, relaxed, in order to allow the processors the wideest latitude in behavior. Relaxed is, for processors that have it, often just an unconstrained int, which means that you will get odd behavior if you are running shared state multithreaded code that uses plain native types. It is a particular problem with code that was originally written and tested on a x86 architecture because the native model is fairly strong. This frequently causes problems when porting to a mobile platform, where ARM is a very popular hardware choice.
Org-mode source and git repo
Exported from an org-mode doc. All of the source is available on github at SpinGate
References
Bibliography
- [boehm2005threads] Boehm, Threads cannot be implemented as a library, 261-268, in in: ACM Sigplan Notices, edited by (2005)
- [bacon2000double] @miscbacon2000double, title=The “double-checked locking is broken” declaration, author=Bacon, David and Bloch, Joshua and Bogda, Jeff and Click, Cliff and Haahr, Paul and Lea, Doug and May, Tom and Maessen, Jan-Willem and Mitchell, JD and Nilsen, Kelvin and others, year=2000
- [IWPAug2007] @miscIWPAug2007, howpublished = \urlhttp://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf, note = Accessed: 2017-04-30, title = Intel® 64 Architecture Memory Ordering White Paper,
- [owens2009better] Owens, Sarkar & Sewell, A better x86 memory model: x86-TSO, 391-407, in in: International Conference on Theorem Proving in Higher Order Logics, edited by (2009)
- [maranget2012tutorial] Maranget, Sarkar & Sewell, A tutorial introduction to the ARM and POWER relaxed memory models, Draft available from http://www. cl. cam. ac. uk/\~ pes20/ppc-supplemental/test7. pdf, (2012).