1 Building a simple spin gate in C++
The simplest atomic boolean is the std::atomic_flag. It’s guaranteed by standard to be lock free. Unfortunately, to provide that guarantee, it provides no way to do an atomic read or write to the flag, just a clear, and a set and return prior value. This isn’t rich enough for multiple threads to wait, although it is enough to implement a spinlock.
std::atomic<bool> isn’t guaranteed to be lock free, but in practice, any architecture I’m interested has at least 32bit aligned atomics that are lock free. Older hardware, such as ARMv5, SparcV8, and 80386 are missing cmpxchg, so loads are generally implemented with a lock in order to maintain the guarantees if there were a simultaneous load and exchange. See, for example, LLVM Atomic Instructions and Concurrency Guide. Modern ARM, x86, Sparc, and Power chips are fine.
When the spin gate is constructed, we’ll mark the gate as closed. Threads will then wait on the flag, spinning until the gate is opened. For this we use Release-Acquire ordering between the open and wait. This will ensure any stores done before the gate is opened will be visible to the thread waiting.
// spingate.h -*-C++-*- #ifndef INCLUDED_SPINGATE #define INCLUDED_SPINGATE #include <atomic> #include <thread> class SpinGate { std::atomic_bool flag_; public: SpinGate(); void wait(); void open(); }; inline SpinGate::SpinGate() { // Close the gate flag_.store(true, std::memory_order_release); } inline void SpinGate::wait() { while (flag_.load(std::memory_order_acquire)) ; // spin } inline void SpinGate::open() { flag_.store(false, std::memory_order_release); std::this_thread::yield(); } #endif
#include "spingate.h" // Test that header is complete by including
Using a SpinGate is fairly straightfoward. Create an instance of SpinGate and wait() on it in each of the worker threads. Once all of the threads are created, open the gate to let them run. In this example, I sleep for one second in order to check that none of the worker threads get past the gate before it is opened.
The synchronization is on the SpingGate’s std::atomic_bool, flag_. The flag_ is set to true in the constructor, with release memory ordering. The function wait() spins on loading the flag_ with acquire memory ordering, until open() is called, which sets the flag_ to false with release semantics. The other threads that were spinning may now proceed. The release-acquires ordering ensures that happens-before writes by the thread setting up the work and calling open will be read by the threads that were spin waiting.
1.1 Update:
#include "spingate.h" #include <vector> #include <chrono> #include <thread> #include <iostream> int main() { std::vector<std::thread> workers; SpinGate gate; using time_point = std::chrono::time_point<std::chrono::high_resolution_clock>; time_point t1; auto threadCount = std::thread::hardware_concurrency(); std::vector<time_point> times; times.resize(threadCount); for (size_t n = 0; n < threadCount; ++n) { workers.emplace_back([&gate, t1, ×, n]{ gate.wait(); time_point t2 = std::chrono::high_resolution_clock::now(); times[n] = t2; }); } std::cout << "Open the gate in 1 second: " << std::endl; using namespace std::chrono_literals; std::this_thread::sleep_for(1s); t1 = std::chrono::high_resolution_clock::now(); gate.open(); for (auto& thr : workers) { thr.join(); } int threadNum = 0; for (auto& time: times) { auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(time - t1); std::cout << "Thread " << threadNum++ << " waited " << diff.count() << "ns\n"; } }
I’d originally had the body of the threads just spitting out that they were running on std::cout, and the lack of execution before the gate, plus the overlapping output, being evidence of the gate working. That looked like:
for (std::size_t n = 0; n < std::thread::hardware_concurrency(); ++n) { workers.emplace_back([&gate, n]{ gate.wait(); std::cout << "Output from gated thread " << n << std::endl; }); }
The gate is captured in the thread lambda by reference, the thread number by value, and when run, overlapping gibberish is printed to the console as soon as open() is called.
But then I became curious about how long the spin actually lasted. Particularly since the guarantees for atomics with release-acquire semantics, or really even sequentially consistent, are about once a change is visible, that changes before are also visible. It’s really a function of the underlying hardware how fast the change is visible, and what are the costs of making the happened-before writes available. I’d already observed better overlapping execution using the gate, as opposed to just letting the threads run, so for my initial purposes of making contention more likely, I was satisfied. Visibility, on my lightly loaded system, seems to be in the range of a few hundred to a couple thousand nanoseconds, which is fairly good.
Checking how long it took to start let me do two thing. First, play with the new-ish chrono library. Second, check that the release-acquire sync is working the way I expect. The lambdas that the threads are running capture the start time value by reference. The start time is set just before the gate is opened, and well after the threads have started running. The spin gate’s synchronization ensures that if the state change caused by open is visible, the setting of the start time is also visible.
Here are one set of results from running a spingate:
Open the gate in 1 second: Thread 0 waited 821ns Thread 1 waited 14490ns Thread 2 waited 521ns Thread 3 waited 817ns
2 Comparison with Condition Variable gate
// cvgate.h -*-C++-*- #ifndef INCLUDED_CVGATE #define INCLUDED_CVGATE #include <mutex> #include <condition_variable> #include <atomic> #include <thread> class CVGate { std::mutex lock_; std::condition_variable cv_; bool flag_; public: CVGate(); void wait(); void open(); }; inline CVGate::CVGate() : lock_(), cv_(), flag_(true) {} inline void CVGate::wait() { std::unique_lock<std::mutex> lk(lock_); cv_.wait(lk, [&](){return !flag_;}); } inline void CVGate::open() { std::unique_lock<std::mutex> lk(lock_); flag_ = false; cv_.notify_all(); std::this_thread::yield(); } #endif
#include "cvgate.h" // Test that header is complete by including
This has the same interface as SpinGate, and is used exactly the same way.
Running it shows:
Open the gate in 1 second: Thread 0 waited 54845ns Thread 1 waited 76125ns Thread 2 waited 91977ns Thread 3 waited 128816ns
That the overhead of the mutex and condition variable is significant. On the other hand, the system load while it’s waiting is much lower. Spingate will use all available CPU, while CVGate yields, so useful work can be done byu the rest of the system.
However, for the use I was originally looking at, releasing threads for maximal overlap, spinning is clearly better. There is much less overlap as the cv blocked threads are woken up.
3 Building and Running
cmake_minimum_required(VERSION 3.5) set(CMAKE_LEGACY_CYGWIN_WIN32 0) project(SpinGate C CXX) set(THREADS_PREFER_PTHREAD_FLAG ON) find_package(Threads REQUIRED) set(CMAKE_EXPORT_COMPILE_COMMANDS ON) set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++14 -ftemplate-backtrace-limit=0") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -march=native") set(CMAKE_CXX_FLAGS_DEBUG "-O0 -fno-inline -g3") set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -g0 -DNDEBUG") include_directories(${CMAKE_CURRENT_SOURCE_DIR}) add_executable(spingate main.cpp spingate.cpp) add_executable(cvgate cvmain.cpp cvgate.cpp) target_link_libraries(spingate Threads::Threads) target_link_libraries(cvgate Threads::Threads)
And here we build a release version of the test executable:
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ../
make
-- Configuring done -- Generating done -- Build files have been written to: /home/sdowney/src/spingate/build [ 50%] Built target cvgate [100%] Built target spingate
Leave a Reply