## An Experiment Collects Samples

I’m modelling this in order to run bits of code like the various litmus tests used to describe multi-core architectures. A set of functions to be run in parallel that may or may not write to a result, which type is a property of the Test being run. The Experiment will run the Test collecting Samples. The Test type will provide a tuple of functions to run. They will be run under a spingate in all permutations in order to remove scheduling bias.

## What a Test looks like

```class MP { // Message Passing
int x_;
int y_;

public:
typedef std::tuple<int> Result;
MP();
void t1();

auto actions() {
return std::make_tuple([this]() { t1(); },
[this](Result& result) { t2(result); });
}
};
```

The Test interface must provide a Result type, and an actions() member that will produce a tuple of functions to run which either take no arguments or a reference to a result.

The test being defined here is the basic Message Passing litmus test.

```MP::MP() : x_(0), y_(0) {}

void MP::t1() {
x_ = 1;
y_ = 1;
}

while (!y) {
}
}
```

Two variables are initialized to 0. One thread stores 1 to x first, then to 1 to y. The other thread loops until it reads a non-zero in y, and then reads x. The value in x is the message being passed between threads.

In an actual test, the variables would be atomics, specifiying load and store strength, and the variables might have constraints on layout to help sharing cache line updates.

## An Experiment

An Experiment samples a test a number of times. It takes the result of each sample, and puts in a map of the results to count, incrementing the count for each distinct result. The actions to run are permuted each time, to help remove bias about which action is loaded behind the spingate first.

```void Experiment::run(size_t count) {
using Actions = decltype(std::declval<Test>().actions());
auto getters = tupleutil::tuple_getters<Actions>();
for (size_t i = 0; i < count; ++i) {
Sample<Test> sample;
sample.run(getters);
resultMap_[sample.result_]++;
std::next_permutation(getters.begin(), getters.end());
}
}

```

`tupleutil::tuple_getters` returns an array of getters each of which returns a std::variant<Types…> with the same parameter pack as the tuple.

Sample runs all of the actions in a batch that locks them behind a spingate, and collects the results for each action.

```template <class Test> class Sample {
public:
Batch                 batch_;
Test                  test_;
typename Test::Result result_;

template <typename V, size_t I> void run(std::array<V, I> const& getters) {
auto const& actions = test_.actions();
batch_.run();
}
};
```

Add is a templated member function that loops over the array, uses the getter to pull a function out of the tuple of actions and visits that with a lambda that will add either the function with no arguments, or that function with a reference to the results, to the batch.

```    template <typename Tuple, typename Variant, size_t I>
void add(Tuple const& actions, std::array<Variant, I> const& getters) {
auto adder = [this](auto&& f) {
using F = std::remove_cv_t<std::remove_reference_t<decltype(f)>>;
if constexpr (std::is_invocable_v<F>) {
} else {
}
};
for (auto&& get_n : getters) {
}
return;
}
```

I am a bit dissatisfied with the else case not being constexpr if followed by a static assert, but getting the condition right didn’t work the obvious way, so I punted. There will be a compiler error if f(result_) can’t actually be called by the batch.

## Batch recapped:

The key bit of code is

```template <class Function, class... Args>
void Batch::add(Function&& f, Args&&... args) {
workers_.emplace_back([ this, f = std::forward<Function>(f), args... ]() {
gate_.wait();
f(args...);
});
}

```

Batch has a spingate and runs all of the functions that are added sitting behind it. The `run()` function opens the gate and joins all the worker threads.

```void Batch::run() {
gate_.open();
for (auto& thr : workers_) {
thr.join();
}
}

```

## Summary

With all the machinery in place, the test infrascructure can aggressively run multi-threaded tests, giving the thread scheduler the best opportunity to run all of the actions in any order. This allows multi thread bugs to be shaken out by looking for surprising results from the experiment.

## Source Code

Exported from an org-mode doc, experiment.org, which is available, with all of the source on github at SpinGate.

# Why std::bind can’t be (formally) deprecated

## Yes: std::bind should be replaced by lambda

For almost all cases, `std::bind` should be replaced by a lambda expression. It’s idiomatic, and results in better code. There is almost no reason post C++11 to use `std::bind`.

Doing so is quite straightforward, capture each bind argument by value in the lambda capture list, and provide auto parameters for each of the placeholders, then call the bound callable using `std::invoke()`. That will handle the cases of member function pointers, as well as regular functions. Now, this is how to do it mechanically, if you were doing this as part of a manual refactoring, the lambda can be made even clearer.

```#include <functional>
#include <iostream>

void f(int n1, int n2, int n3) {
std::cout << n1 << ' ' << n2 << ' ' << n3 << '\n';
}

int main() {
using namespace std::placeholders;
int n = 5;
auto f1 = std::bind(f, 2, n, _1);
f1(10); // calls f(2, 5, 10);

auto l1 = [ p1 = 2, p2 = n ](auto _1) { return std::invoke(f, p1, p2, _1); };
l1(10);

// idiomatically
auto l1a = [=](auto _1){return f(2, n, _1);};
l1a(10);

auto f2 = std::bind(f, 2, std::cref(n), _1);
auto l2 = [ p1 = 2, p2 = std::cref(n) ](auto _1) {
return std::invoke(f, p1, p2, _1);
};
// or
auto l2a = [ p1 = 2, &p2 = n ](auto _1) {
return std::invoke(f, p1, p2, _1);
};
// more idiomatically
auto l2b = [&](auto _1){f(2, n, _1);};

n = 7;
f2(10); // calls f(2, 7, 10);

l2(10);
l2a(10);
l2b(10);
}
```

Which results in:

```2 5 10
2 5 10
2 5 10
2 7 10
2 7 10
2 7 10
2 7 10

```

## No: std::bind provides one thing lambda doesn’t

The expression `std::bind` evaluates flattens `std::bind` sub-expressions, and passes the same placeholder parameters down. A nested bind is evaluated with the given parameters, and the result is passed in to the outer bind. So you can have a bind that does something like `g( _1, f(_1))`, and when you call it with a parameter, that same value will be passed to both g and f. The function g will receive `f(_1)` as its second parameter.

Now, you could rewrite the whole thing as a lambda, but `auto` potentially makes this a little more difficult. The result of `std::bind` is an unutterable type. They weren’t supposed to be naked. However, `auto` means the expression could be broken down into parts, meaning that the translation from a `std::bind` expression to a lambda expression is potentially not mechanical. Or, the bind could be part of a template, where the subexpression is a template parameter, which is likely working by accident, rather than design.

In any case, `std::bind` does not treat its arguments uniformly. It treats a bind expression distinctly differently. At the time, it made some sense. But it makes reasoning about bind expressions difficult.

Don’t do this. But it is why formally deprecating `std::bind` is difficult. They can be replaced, but not purely mechanically.

There isn’t a simple translation that works, unlike converting from `std::auto_ptr` to `std::unique_ptr`, or putting a space after a string where it now looks like a conversion. And, `std::bind` isn’t broken. It’s sub-optimal because of the complicated machinery to support all of the flexibility, where a lambda allows the compiler to do much better. Also, since the type isn’t utterable, it often ends up in a std::function, which erases the type, removing optimization options.

## Example of fail code

```#include <functional>
#include <iostream>

void f(int n1, int n2, int n3)
{
std::cout << n1 << ' ' << n2 << ' ' << n3 << '\n';
}

int g(int n1) { return n1; }

int main()
{
using namespace std::placeholders;

auto g1 = std::bind(g, _1);
auto f2 = std::bind(f, _1, g1, 4);
f2(10); // calls f(10, g(10), 4);

// THIS DOES NOT WORK
// auto l2 = [p1 = g1, p2 = 4](auto _1) {std::invoke(f, _1, p1, p2);};
// l2(10);

// The bind translation needs to be composed:
auto l1 = [](auto _1){return g(_1);};
auto l2 = [p1 = l1, p2 = 4](auto _1){f(_1, p1(_1), p2); };
// idiomatically
auto l2a = [](auto _1) { return f(_1, g(_1), 4);};
l2(10);
l2a(10);
}

```
```10 10 4
10 10 4
10 10 4

```

## TODO

If someone can figure out a fixit recommendation that could be safely applied, transforming the old bind to a lambda, then `std::bind` could be deprecated in C++Next, and removed as soon as C++(Next++). But that right now is non-trivial in some cases.

• Fix incorrect statement about type-erasure in std::bind. I was thinking std::function
• Add more idiomatic transliterations of the std::bind lambdas

## Building and running the examples

### Makefile

```clean:
-rm example1
-rm example2

example1: example1.cpp
clang++ --std=c++1z example1.cpp -o example1

example2: example2.cpp
clang++ --std=c++1z example2.cpp -o example2

example3: example3.cpp
clang++ --std=c++1z example3.cpp -o example3 2>&1

all: example1 example2

```

### Build

```rm example1
rm example2
clang++ --std=c++1z example1.cpp -o example1
clang++ --std=c++1z example2.cpp -o example2

```

### Original source

Original document is available on Github

# Accessing the elements of a tuple as variant

A further digression, because it turns out I want to be able to permute a tuple at run time. That means treating the element of a tuple generically. And I can just barely do this, for some tuples, in c++17.

So a slight digression into ADTs. Which in this case means Algebraic Data Types, not Abstract Data Types. But just algebra. No calculus, or differentiation, of data types. Not today.

## 1 Tuple is Product, Variant is Sum

### 1.1 Products

In algebra, we usually start out with addition. It’s simpler. But for types, multiplication, or product, is in many ways much more natural. Your basic struct, record, etc is a natural product of types. A type is some kind of collection of things. And I’m being a bit vague here because this is right in the area where set seems like a good idea, and then we get into sets of sets, sets that might contain themselves, and barbers who shave all the people who don’t shave themselves. There is rigour, but I don’t really want to have to go there.

But, if we start with the idea that a type is a collection of things, and that we don’t look to closely at the infinities, we are not going to be terribly wrong. So a type is a way of describing if a thing is in or out of the collection.

Now, I could pretend we don’t know what a struct is. Start with pairs, where there are no names of the components of the struct, and build that up. But we all have a notion of struct. It’s an ordered collection of types. The instances of the struct are all of the elements of each type contained in the struct, matched up with all of the other elements of all the other types in the struct. Known as the Cartesion product. So if you have a type A, and a type B, the collection of things in struct {A a; B b;} is the cross of As and Bs. That is {{a1, b1}, {a1, b2}, {a1, b3}, … , {a2, b1}, {a2, b2}, … {an, b1}, … {an, bm}} is all of the elements that are part of the type struct {A a; B b;}. The cardinality of {A, B} is the product of the cardinalities of A and B.

Structs are very natural in C++, but hard to deal with generically, so there’s a type that does it all for you, std::tuple. Getting at the parts of the tuple is a little more difficult that with a struct. You have to say std::get<0>(tuple), or std::get<int>(tuple). And the second might not even compile, if the tuple has more than one int. But you get tools for composing and decomposing tuples at compile time. And std::tuple lets you put pretty much any C++ type into the tuple, only restricting you when you try to, e.g. move a tuple that has an element that can’t be moved.

There should also be a type that acts as a unit for the product, the equivalent of 1 for multiplication. The empty tuple can work as a unit. It contains any of the list of no types. This implies that all empty tuples are equivalent, so its cardinality is 1. There can be only one. The product of a type with the empty tuple is entirely equivalent to the the type itself. There are no additional elements in the type, and you can convert back and forth between them. They are isomorphic, having the same shape.

Isomorphisms are important in talking about types, because most of the time we can’t actually distinguish between isomorphic types, at least for proving things. The phrase “up to isomorphism” shows up a lot. To be isomorphic means that we can write a transformation $X$ from type A to type B, and a reverse transformation $Y$ from type B to type A, such that $Y(X(a)) == a$ for all a, and that for any function from a1 to a2, there is an equivalent function from b1 to b2. We could mechanically replace instances of a with the appropriate b and add calls to X and Y without changing the behavior of a program.

### 1.2 Sums

The other basic algebraic type is the sum type. The corresponding primitive in C++ is a union, with one difference. In most type systems, the sum type automatically remembers which of the allowed types is in it. A union doesn’t, so the standard technique is to embed the union in a struct that carries a tag saying which type in the union was most recently written, and can be read from. I’ll be ignoring type-punning schemes allowing a read of a different type than was written.

So a Sum type of type A and type B is the union of all of the things in A and all of the things in B. {a1, a2, a3, … , an, b1, b2, … , bm}. The cardinality of is the sum of the cardinalities of A and B.

The unit type of the sum is equivalent to zero. The empty sum type, although a valid type, has no elements in the type. It’s like the empty set. It’s often known as Void, where the unit for product is often called Unit. It may also be known as Bottom, where that is a computation that never completes. Since there are no elements of the type Void, it can’t be instantiated. And a product of Void and any other type is equivalent to Void. The c++ type `void` is related, but not exactly the same, because it also represents an empty argument list, a function that returns, but does not return any value (a subroutine), and is also functions as the universal pointer.

C++17 recently standardized a sum type to go with the long standardized std::tuple, std::variant. Std::variant remembers which of the alternative types was last set. It is almost never empty, only so if a write into one of the alternatives threw an exception. It is not allowed to hold `void`, references, arrays, or to contain no types. This is a bit unfortunate, because except for `void` std::tuple can do all of those things.

There were several competing models for what std::variant should be, with various tradeoffs being made. It was always clear that std::tuple had to be able to represent everything a struct can, and in fact there are now language features to destructure a struct into a tuple. There is no equivalent model for sum types. Union can’t hold anything but trivial types because there is no mechanism to track what to do on destruction, since there is no built-in mechanism to determine what the union was last written as.

One of the popular models for variant rises out of database-like interfaces. Even though databases are internally strongly typed, SQL queries are not. And the model of sending text over and getting some kind of response back makes it difficult to expose that to a host language. Particularly when the database schema may change, the query still be perfectly valid, but no longer return the same types. However, since we do know there is a relatively short list of permitted types in the database, a variant that allows just those types and the ability to query what type was returned can be quite useful, and not terribly hard to implement. There are JSON parsers taking similar approaches, only with the addition that a JSON type may have JSON objects contained in them recursively, and those have to be outside the object somehow, or the size of the object is unbounded.

From the implementors point of view, supporting pointers and arrays is a huge amout of extra work. Not allowing an array to decay to a pointer is quite difficult. References have issues when treated generically. Not to mention that references have decidely odd semantics in the differences between construction and assignment. And the degenerate case of an empty variant was also difficult. If that needs to be represented, the type std::monostate has been introduced, which is a type designed to have exactly one item in it, so that all instances of std::monostate are identical. This is also the same as the unit type for product types. It’s not an accident that it’s represented in Haskell as (), which is the empty tuple. All empty lists are equivalent. It could have been `std::tuple<>`, but no one in the room happened to think of that.

## 2 Tuple is a Heterogenous Container, what is the iterator?

The C++ standard says “tuples are heterogeneous, fixed-size collections of values” – [tuple.general]. Collections generally have iterator types associated with them, but that’s a bit of a challenge since the iterator model in C++ assumes that for a collection, the type of *(Collection<T>::iterator) is T. But if the collection isn’t on T, but on Types…, you doesn’t quite work to say *(Collection<typename… Types>) is of type …Types. You need something to hold that. But in many cases, std::variant can work. It doesn’t quiet work, since we’d really need a variant of references to the elements of the tuple, so that they could be written to. However, for many purposes we can come close. For the case I was looking at, making copies is perfectly fine. What I’m looking for is something roughly with the signature

```template <typename... Types
auto getElement(size_t i, std::tuple<Types...> tuple) -> std::variant<Types...>;
```

That is, something that will get me the ith element of a tuple, as a variant with the same typelist as the tuple, with the index determined at runtime. All of the normal accessors are compile time. So need to do something that will make the compile time information available at runtime.

Start with something I do know how to do, idiomatically printing a tuple.

```template <typename Func, typename Tuple, std::size_t... I>
void tuple_for_each_impl(Tuple&& tuple, Func&& f, std::index_sequence<I...>)
{
auto swallow = {0,
(std::forward<Func>(f)(
I, std::get<I>(std::forward<Tuple>(tuple))))...};
(void)swallow;
}

template <typename Func, typename... Args>
void tuple_for_each(std::tuple<Args...> const& tuple, Func&& f)
{
tuple_for_each_impl(tuple, f, std::index_sequence_for<Args...>{});
}

template <typename... Args>
void print(std::ostream& os, std::tuple<Args...> const& tuple)
{
auto printer = [&os](auto i, auto el) {
os << (i == 0 ? "" : ", ") << el;
return 0;
};
return tuple_for_each(tuple, printer);
}
```

Actually, a bit more complicated than the totally standard idiom, since it factors out the printer into a application across the tuple, but it’s not much more compilcated. The tuple_for_each constructs an index sequence based on the argument list, and delegates that to the impl, which uses it to apply the function to each element of the tuple. The _impl ought to be in a nested detail namespace, so as not to leak out. Swallow is the typical name for using an otherwise unnamed, and uninteresting, type to apply something to each element of the tuple for a side-effect. The void cast is to make sure the variable is used, and is evaluated.

The next step is, instead of an application of a function for its side-effect, instead a mapping of the tuple, returning the transformed tuple.

```template <typename Func, typename Tuple, std::size_t... I>
auto tuple_transform_impl(Tuple&& tuple, Func&& f, std::index_sequence<I...>)
{
return std::make_tuple(
std::forward<Func>(f)(std::get<I>(std::forward<Tuple>(tuple)))...);
}

template <typename Func, typename... Args>
auto tuple_transform(std::tuple<Args...>&& tuple, Func&& f)
{
return tuple_transform_impl(tuple, f, std::index_sequence_for<Args...>{});
}

template <typename Func, typename... Args>
auto tuple_transform(std::tuple<Args...> const& tuple, Func&& f)
{
return tuple_transform_impl(tuple, f, std::index_sequence_for<Args...>{});
}
```

Because the std::tuple is not a template parameter, I have to supply a const& and a forwarding-reference form to cover both cases. And I’m ignoring volatile quals. The _impl function uses forwarding-reference parameters, which will decay or forward properly using std::forward. Using it is straightforward.

```std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
auto transform = tupleutil::tuple_transform(t,
[](auto i) { return i + 1; });

EXPECT_EQ(3.3, std::get<1>(transform));

auto t2 = tupleutil::tuple_transform(std::make_tuple(4, 5.0),
[](auto i) { return i + 1; });
EXPECT_EQ(6, std::get<1>(t2));
```

So, for functions over all the types in a tuple, tuple is a Functor. That is, we can apply the function to all elements in the tuple, and it’s just like making a tuple out of applying the functions to elements before making the tuple. If this sounds like a trivial distinction, you are mostly right. Almost all container-ish things are Functors, and a few non-containerish things are also. Plus Functor sounds more impressive.

The transform also suggests a way of solving the problem I was originally looking at. An array of the elements of the tuple each as a variant will let me permute them with std tools.

```template <typename... Args, std::size_t... I>
constexpr std::array<std::variant<Args...>, sizeof...(Args)>
tuple_to_array_impl(std::tuple<Args...> const& tuple,
std::index_sequence<I...>)
{
using V = std::variant<Args...>;
std::array<V, sizeof...(Args)> array = {
{V(std::in_place_index_t<I>{}, std::get<I>(tuple))...}};
return array;
}

template <typename... Args>
constexpr std::array<std::variant<Args...>, sizeof...(Args)>
tuple_to_array(std::tuple<Args...> const& tuple)

{
return tuple_to_array_impl(tuple, std::index_sequence_for<Args...>{});
}
```

And that can be used something like:

```TEST(TupleTest, to_array)
{
constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
auto arr = tupleutil::tuple_to_array(t);
int  i   = std::get<int>(arr[0]);
EXPECT_EQ(1, i);
}

TEST(TupleTest, to_array_repeated)
{
constexpr std::tuple<int, int, int> t = std::make_tuple(1, 2, 3);
auto arr = tupleutil::tuple_to_array(t);
int  i   = std::get<2>(arr[2]);
EXPECT_EQ(3, i);
}
```

The second test is there because I was about to write, “as you can see, we can tell the differece between variants holding the same type”, except that wasn’t true. The original version of to_ar

```    constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
std::variant<int, double, long> v0{1};
auto v = tupleutil::get(0, t);
EXPECT_EQ(v0, v);
```

ray didn’t use the constructor form with std::in_place_index_t. The code I ended up with did, but not at this point. There’s nothing like writing out what something is supposed to do to make you look and keep you honest.

So here, we’re constructing an array of std::variant<Args…> and constructing each member with the argument pack expansion into the std::variant constructor using the Ith index value to get that element of the tuple, and recording that we’re constructing the ith alternative of the variant. The second test checks that. The 2nd element of the array must be the 2nd variant of the tuple, and can be retrieved only by std::get<2>().

This would allow me to permutate the elements of a tuple, but I’m fairly close now to being able to writing a version that allows choice of the element at runtime, rather than at compile time.

```    constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
std::variant<int, double, long> v0{1};
auto v = tupleutil::get(0, t);
EXPECT_EQ(v0, v);
```

What I’m going to do is construct an array of the getters for the tuple, each of which will return the element wrapped in a variant. The signature of the array will be of function pointer type, because, quite conveniently, a non-capturing lambda can decay to a function pointer.

First getting the array of getters for the tuple

```template <typename V, typename T, size_t I> auto get_getter()
{
return [](T const& t) {
return V{std::in_place_index_t<I>{}, std::get<I>(t)};
};
}

template <typename... Args, std::size_t... I>
auto tuple_getters_impl(std::index_sequence<I...>)
{
using V = std::variant<Args...>;
using T = std::tuple<Args...>;
using F = V (*)(T const&);
std::array<F, sizeof...(Args)> array
//        = {{[](T const& tuple){return V{std::get<I>(tuple)};}...}};
= {{get_getter<V, T, I>()...}};
return array;
}

template <typename... Args> auto tuple_getters(std::tuple<Args...>)
{
return tuple_getters_impl<Args...>(std::index_sequence_for<Args...>{});
}
```

So first a function that returns a function that constructs a variant around the value of what’s returned from std::get<I>. Well, it could return anything that happens to have a constructor that takes a an in_place_index_t, take as the thing to be converted something that std::get<I> can extract from. This is actually a separate function because GCC was unhappy doing the template parameter pack expansion inline in the _impl function. Clang was happy with the expansion noted in the comment. I really have no idea who is wrong here, and the workaround was straight forward. The array is one of function pointers, which the returned lambdas can decay to.

Now the only remaining trick is to use this array as a table to dispatch to the appropriate getter for the tuple.

```const auto get = [](size_t i, auto t) {
static auto tbl = tupleutil::tuple_getters(t);
return tbl[i](t);
};
```

Get the array as a static, so we only need to computer it once, and simply return `tbl[i](t)`

```TEST(TupleTest, gettersStatic)
{
constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
std::variant<int, double, long> v0{1};
auto v = tupleutil::get(0, t);
EXPECT_EQ(v0, v);

int  i = std::get<0>(v);
EXPECT_EQ(1, i);

auto v2 = tupleutil::get(1, t);

EXPECT_EQ(1ul, v2.index());
double d = std::get<double>(v2);

EXPECT_EQ(2.3, d);

constexpr auto t2 = std::make_tuple(2.4, 1l);
auto           v3 = tupleutil::get(0, t2);
double         d2 = std::get<double>(v3);

EXPECT_EQ(2.4, d2);
}
```

## 3 Source

All source is available at TupleUtil on GitHub, including org source for this post.

# Cross Compiling

## 1 Setting up Cross Compiling

In order to test out some of these multi-threaded tool properly, I really need to run them on a less strict platform than x86_64. X86_64 provides a lot of guarantees about sequential consistency and atomicity that hides problems that will happen on architectures that are not as strong, like power, sparc, and arm. Fortunately, one of the toys I have is a recent Raspberry Pi 3, which is based on a recent arm chip. Unfortunately, Raspbian, the normal linux distro for the Raspberry Pi is also based on a fairly old debian distro, with a fairly old compiler. Linaro is back porting their arm code genaration fixes to the old releases, but I’m more interested in the recent C++ language features. So I could attempt to compile GCC 6 on the RPi, or I can cross compile from my normal machine. I decided to cross compile, since if that worked, it would be considerably easier. It turnd out to be pretty straightfoward.

```sudo apt-get install g++-6-arm-linux-gnueabihf
```

This is mostly because I’m already doing software development on the box, so I didn’t need any of the other parts of the compiler ecosystem, just the right c++ toolchain. The hardest part is determining the right one. There are a few flavors for arm development. The RPi is the gnu extended abi, with hardware float. The Ubuntu repositories only supply linux variants, which is sensible. Since that top level package ends up installing not just the compilers, but a libstdc++ and libc for arm-linux-gnueabihf, which need to know much more about the OS in order to interface with it.

This does lead to one snag, though. The versions of the libraries are not the ones available on the RPi. Which is a problem, since I want to use modern, or maybe even post-modern C++. There are two ways of dealing with this, and I’ve ended up using both.

## 2 Sysroot

When cross compiling, a sysroot is a system that looks just like the root file system of the target platform. It will have /lib, /usr/lib, etc, with the versions of the libraries that you want. You can either use a disk image, mounted somewhere convienent, or you can just mount the target computer’s root filesystem somewhere convienent. If you do that, you’ll have access to all of the libraries available, not just the minimal set typically available on a prepackaged sysroot. So that’s what I did.

```sshfs sdowney@cobweb.local:/ /home/sdowney/mnt/rpi/ -o transform_symlinks -o allow_other
```

Cobweb is my Raspberry Pi box, and zeroconf makes the current ip address available as cobweb.local. I’m mounting that into ~/mnt/rpi, transforming symlinks so that they actually work, and allowing others to access the mounted fs.

With that I can specify the sysroot, and have the compiler look there for libraries:

```arm-linux-gnueabihf-g++-6 -v --sysroot ~/mnt/rpi/ -o hello hw.cpp
```

That spits out all of what the compiler driver invokes, and as a byproduct, a bunch of what is needed to set up cross compiling with other compilers, like clang. The key things to look for are the include directories called out by “#include <…> search starts here”, and the LIBRARY_PATH variable that helps define what the linker does. I’ll be pulling those out for the clang cross compile cmake toolchain file.

```Using built-in specs.
COLLECT_GCC=arm-linux-gnueabihf-g++-6
COLLECT_LTO_WRAPPER=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 6.2.0-5ubuntu12' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-armhf-cross/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-armhf-cross --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-armhf-cross --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libgcj --enable-objc-gc --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabihf --program-prefix=arm-linux-gnueabihf- --includedir=/usr/arm-linux-gnueabihf/include
gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12)
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
/usr/lib/gcc-cross/arm-linux-gnueabihf/6/cc1plus -quiet -v -imultiarch arm-linux-gnueabihf -isysroot /home/sdowney/mnt/rpi/ -D_GNU_SOURCE hw.cpp -quiet -dumpbase hw.cpp -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mthumb -mtls-dialect=gnu -auxbase hw -version -fstack-protector-strong -Wformat -Wformat-security -o /tmp/ccUwr5Jd.s
GNU C++14 (Ubuntu 6.2.0-5ubuntu12) version 6.2.0 20161005 (arm-linux-gnueabihf)
compiled by GNU C version 6.2.0 20161005, GMP version 6.1.1, MPFR version 3.1.5, MPC version 1.0.3, isl version 0.15
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring nonexistent directory "/home/sdowney/mnt/rpi/usr/local/include/arm-linux-gnueabihf"
#include "..." search starts here:
#include <...> search starts here:
/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6
/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/arm-linux-gnueabihf
/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/backward
/usr/lib/gcc-cross/arm-linux-gnueabihf/6/include
/usr/lib/gcc-cross/arm-linux-gnueabihf/6/include-fixed
/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include
/home/sdowney/mnt/rpi/usr/include/arm-linux-gnueabihf
/home/sdowney/mnt/rpi/usr/include
End of search list.
GNU C++14 (Ubuntu 6.2.0-5ubuntu12) version 6.2.0 20161005 (arm-linux-gnueabihf)
compiled by GNU C version 6.2.0 20161005, GMP version 6.1.1, MPFR version 3.1.5, MPC version 1.0.3, isl version 0.15
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 8867fa57a9cbba18ebd7880e42ca78ba
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/bin/as -v -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -meabi=5 -o /tmp/ccJH2IA5.o /tmp/ccUwr5Jd.s
GNU assembler version 2.27 (arm-linux-gnueabihf) using BFD version (GNU Binutils for Ubuntu) 2.27
COMPILER_PATH=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/bin/
LIBRARY_PATH=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/:/home/sdowney/mnt/rpi/lib/arm-linux-gnueabihf/:/home/sdowney/mnt/rpi/lib/../lib/:/home/sdowney/mnt/rpi/usr/lib/arm-linux-gnueabihf/:/home/sdowney/mnt/rpi/usr/lib/../lib/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/:/home/sdowney/mnt/rpi/lib/:/home/sdowney/mnt/rpi/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
/usr/lib/gcc-cross/arm-linux-gnueabihf/6/collect2 -plugin /usr/lib/gcc-cross/arm-linux-gnueabihf/6/liblto_plugin.so -plugin-opt=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/lto-wrapper -plugin-opt=-fresolution=/tmp/cctgBCzX.res -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc --sysroot=/home/sdowney/mnt/rpi/ --build-id --eh-frame-hdr -dynamic-linker /lib/ld-linux-armhf.so.3 -X --hash-style=gnu --as-needed -m armelf_linux_eabi -z relro -o hello /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crt1.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crti.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/crtbegin.o -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6 -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib -L/home/sdowney/mnt/rpi/lib/arm-linux-gnueabihf -L/home/sdowney/mnt/rpi/lib/../lib -L/home/sdowney/mnt/rpi/usr/lib/arm-linux-gnueabihf -L/home/sdowney/mnt/rpi/usr/lib/../lib -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib -L/home/sdowney/mnt/rpi/lib -L/home/sdowney/mnt/rpi/usr/lib /tmp/ccJH2IA5.o -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/lib/gcc-cross/arm-linux-gnueabihf/6/crtend.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crtn.o
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
```

Now, note that the compiler will prefer the locally installed versions before using the ones in the sysroot. This is fine, until I need to install something. Then I’ll get an error because the library on the RPi is too old. Particularly libstdc++. This works well for the non-core language libraries, though. Or at least ones that don’t have C++ in their interface. Mixing C++ versions is a horrible minefield. The easiest way to deal with it is to avoid it.

Recent versions of gcc allow libstdc++ to be linked statically. It increases the size of the resulting executable, but with less worries about deployment issues.

```-static-libstdc++
```

That will cause the compiler driver to direct the linker to prefer the static version of libstdc++, rather than the shared version. And I don’t have to worry about deploying or upgrading the system libraries on the target box.

Note, this isn’t really a supported deployment configuration. So any bugs are going to be my problem.

## 4 CMake

I’ve been using CMake to generate the build system, so I need to explain to it how to use the cross compiler instead of one for the host system. CMake has support for supplying definitions for these in Toolchain files. This is what I have so far

```SET(CMAKE_SYSTEM_NAME Linux)
SET(CMAKE_SYSTEM_VERSION 1)
SET(CMAKE_SYSROOT \$ENV{HOME}/mnt/rpi)

SET(CMAKE_C_COMPILER arm-linux-gnueabihf-gcc)
SET(CMAKE_CXX_COMPILER arm-linux-gnueabihf-g++)

SET(CMAKE_FIND_ROOT_PATH \$ENV{HOME}/mnt/rpi)
SET(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
SET(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
SET(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)

SET(CMAKE_CXX_FLAGS "-static-libgcc -static-libstdc++" CACHE STRING "CXX_FLAGS" FORCE)

"0"
CACHE STRING "Result from TRY_RUN" FORCE)
```

That, in addition to setting the compiler to use, forces a few CMake options that are otherwise problems. The first is setting the static link flag for libstdc++. The second is overriding the search for pthreads, because trying to run programs built with a cross compiler doesn’t work very well. This lies and forces the option.

Used like so

```cmake  -D CMAKE_TOOLCHAIN_FILE=~/src/toolchain/pi.cmake -DCMAKE_BUILD_TYPE=Release ..
```

A toolchain file for clang is a little more complicated, because it doesn’t really understand the gcc multilib layout, so it needs to be told where all the include and lib directories are for the target system, for both the C and C++ compiler.

```SET(CMAKE_SYSTEM_NAME Linux)
SET(CMAKE_SYSTEM_VERSION 1)
SET(CMAKE_SYSROOT \$ENV{HOME}/mnt/rpi)

set(triple arm-linux-gnueabihf)

set(CMAKE_C_COMPILER clang)
set(CMAKE_C_COMPILER_TARGET \${triple})
set(CMAKE_CXX_COMPILER clang++)
set(CMAKE_CXX_COMPILER_TARGET \${triple})

SET(CMAKE_FIND_ROOT_PATH \$ENV{HOME}/mnt/rpi)
SET(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
SET(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
SET(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)

SET(CMAKE_CXX_FLAGS "\
-isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6 \
-isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/arm-linux-gnueabihf \
-isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/backward \
-isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include \
-isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include-fixed \
-isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include"
CACHE STRING "CXX_FLAGS" FORCE)

SET(CMAKE_C_FLAGS "\
-isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include \
-isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include-fixed \
-isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include"
CACHE STRING "C_FLAGS" FORCE)

-L /usr/lib/gcc-cross/arm-linux-gnueabihf/6 \
-L /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib \
-L /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib \
-static-libgcc -static-libstdc++"

"0"
CACHE STRING "Result from TRY_RUN" FORCE)
```

## 5 Sources

Toolchain files are on Github next to the spingate sources, that now includes the org file that is the source for this entry, crosscompile.org.

# batch: running functions under a spingate

## 1 A batch of tasks to run

This adds a rather simple component to spingate orchestrating a batch of tasks to be run, gated by the spingate. The tasks are added one at a time, a thread is created for the task, and the thread waits on the spingate to open before calling the task.

Or at least that’s how it started. Task was originally a std::function<void()>, which is essentially the interface of the thread pool I use. I realized, however, that I don’t actually need to restrict the interface quite that much. Thread takes a much wider range of things to run, and I can do the same thing. I have to forward the supplied callable and arguments into the lambda that the thread is running.

The key bit of code is

```template <class Function, class... Args>
void Batch::add(Function&& f, Args&&... args) {
workers_.emplace_back([ this, f = std::forward<Function>(f), args... ]() {
gate_.wait();
f(args...);
});
}
```

There’s a lot of line noise in there, and it really looked simpler when it was just taking a std::function<void()>, but it’s not terrible. We take an object of type Function and a parameter pack of type Args by forwarding reference. That gets captured by the lambda, where we forward the function to the lambda, and capture the parameter pack. Inside the lambda we call the function with the pack, f(args). It’s probable that I should have used std::invoke there, which handles some of the more interesting cases of calling a thing with arguments. But this was sufficient unto the day. The captured this allows access to the gate_ variable the we’re waiting on. The workers_ are a vector of threads that we’ll later run run through and join() on, after open()ing the gate_.

```void Batch::run() {
gate_.open();
for (auto& thr : workers_) {
thr.join();
}
}
```

That’s really all there is to Batch. It’s a middle connective glue component. Does one thing, and tries to do it obviously well. That is important since I’m trying to build up test infrastructure, and testing the test infrastrucure is a hard problem.

I have reorganized the code repo in order to do some light testing, though.

## 2 GTest

I’ve pushed things about in the source repo, moving the code into a library directory, which means I can link it into the existing mains, as well as into new gtests. In the CMake system, I’ve conditioned building tests on the existence of the googletest project being available as a subdirectory. I use enough different compilers and build options that trying to use a system build of gtest just doesn’t work. The best, and recommended, choice, is to build googletest as part of your project. That way any ABI impacting subtlety, like using a different C++ standard library, is take care of automatically. The bit of cmake magic is in the top level CMakeLists.txt :

```# A directory to find Google Test sources.
else()
endif()
```

This looks for googletest to be available, and if it is, add it to the project, and my tests subdirectory, otherwise issue a message. I prefer this to attempting to fix up the missing gtest automatically. That always seems to cause me problems, such as when I’m operating disconnected, on a train, like right now.

The tests I have are pretty simple, not much more than primitive breathing tests.

```TEST_F(BatchTest, run1Test)
{
Batch batch;

EXPECT_EQ(0u, called);

batch.run();

EXPECT_EQ(1u, called);
}
```

or, to make sure that passing arguments worked

```TEST_F(BatchTest, runArgTest)
{
Batch batch;
int i = 0;
batch.add([&i](int k){ i = k;}, 1);

EXPECT_EQ(0, i);

batch.run();

EXPECT_EQ(1, i);
}
```

I don’t actually expect to find runtime errors with these tests. They exercise ths component just enough that I’m not generating compile errors in expected use cases. Template code can be tricky that way. Templates that aren’t instantiated can have horrible errors, but the compiler is willing to let them pass, if they mostly parse.

SFINAE may not be your friend.

## 3 Clang builds with current libc++

Building clang and libc++ locally is getting easier and easier. Using that is still a bit difficult. But there are some reasons to do so. One is just being able to cross check your code for sanity. I won’t reproduce building clang and libc++ here. It’s really at this point just checking out the repos in the right places and running cmake with something like:

```cmake  -DCMAKE_INSTALL_PREFIX=~/install/llvm-master/ -DLLVM_ENABLE_LIBCXX=yes  -DCMAKE_BUILD_TYPE=Release   ../llvm/
```

Using that, at least from within cmake, is more complicated. Cmake has a strong bias towards using the system compiler. It also has a distinct problem with repeating builds.

NEVER edit your CMakeCache.txt. You can’t do anything with it. All the paths are hard coded. Always start over. Either keep the command line around, or create a cmake initial cache file, which isn’t the same thing at all as the CMakeCache.txt file.

Right now, I’m cargo-culting around code in my cmake files that checks if I’ve defined an LLVM_ROOT, and if I have supply the flags to ignore all the system files, and use the ones from the installed LLVM_ROOT, including some rpath fixup. There might be some way to convince cmake to do it, but there’s also only so much I will fight my metabuild system.

```  if(LLVM_ROOT)
message(STATUS "LLVM Root: \${LLVM_ROOT}")
set(CMAKE_CXX_FLAGS "\${CMAKE_CXX_FLAGS} -nostdinc++ -isystem \${LLVM_ROOT}/include/c++/v1")
else()
```

I only check for that if the compiler I’ve chosen is a clang compiler, and it’s not normally part of my environment.

## 4 Direction

Overall, what I want out of this library is to be able to stress test some nominally mt-safe code, and check that the conditions that I think hold are true. It’s heavily influenced by jcstress, but, because this is C+++, it will be rendered quite differently.

For what I’m considering, look at Close Encounters of The Java Memory Model Kind

I want to be able to specify a state, with operations that mutate and observe the state. I want to be able to collect those observations in a deterministic way, which may require cooperation from the observers. I want to be able to collect the observations and report how many times each set of observations was obtained.

Something like:

```class State {
int x_;
int y_;

public:
typedef std::tuple<int, int, int, int> Result;
State() : x_(0), y_(0) {}
void writer1() {
y_ = 1;
x_ = 1;
}
}
}
};
```

Running the writers and readers over different threads and observing the possible results. On some architectures, reader1 and reader2 can see entirely different orders, even though y_ will happen before x_, you might see x_ written and not y_.

What I’d eventually like to be able to do is say things like, “This function will only be evaluated once”, and have some evidence to back that up.

So the next step is something that will take a State and schedule all of the actions with appropriate parameters in a Batch, and produce the overall Result. Then something that will do that many many times, accumulating all of the results. And since this isn’t java, so we don’t have reflection techniques, the State class is going to have to cooperate a bit. The Result typedef is one way. It will also have to produce all of the actions that need to be batched, in some heterogenous form that I can then run.

## 5 Source Code

Exported from an org-mode doc, batch.org, which is available, with all of the source on github at SpinGate.

# spingate

## 1 Building a simple spin gate in C++

This is a very simplified latch which allows several threads to block and busywait until they are released to begin work in parallel. I’m using this to do some multi-thread stress testing, where I want to maximize the overlapping work in order to check for suprising non-deterministic behavior. There’s no count associated with the gate, nor is it reuseable. All we need is the one bit of information, weather the gate is open or closed.

The simplest atomic boolean is the std::atomic_flag. It’s guaranteed by standard to be lock free. Unfortunately, to provide that guarantee, it provides no way to do an atomic read or write to the flag, just a clear, and a set and return prior value. This isn’t rich enough for multiple threads to wait, although it is enough to implement a spinlock.

std::atomic<bool> isn’t guaranteed to be lock free, but in practice, any architecture I’m interested has at least 32bit aligned atomics that are lock free. Older hardware, such as ARMv5, SparcV8, and 80386 are missing cmpxchg, so loads are generally implemented with a lock in order to maintain the guarantees if there were a simultaneous load and exchange. See, for example, LLVM Atomic Instructions and Concurrency Guide. Modern ARM, x86, Sparc, and Power chips are fine.

When the spin gate is constructed, we’ll mark the gate as closed. Threads will then wait on the flag, spinning until the gate is opened. For this we use Release-Acquire ordering between the open and wait. This will ensure any stores done before the gate is opened will be visible to the thread waiting.

```// spingate.h                                                       -*-C++-*-
#ifndef INCLUDED_SPINGATE
#define INCLUDED_SPINGATE

#include <atomic>

class SpinGate
{
std::atomic_bool flag_;

public:
SpinGate();
void wait();
void open();
};

inline
SpinGate::SpinGate() {
// Close the gate
flag_.store(true, std::memory_order_release);
}

inline
void SpinGate::wait() {
; // spin
}

inline
void SpinGate::open() {
flag_.store(false, std::memory_order_release);
}

#endif
```
```#include "spingate.h"
// Test that header is complete by including
```

Using a SpinGate is fairly straightfoward. Create an instance of SpinGate and wait() on it in each of the worker threads. Once all of the threads are created, open the gate to let them run. In this example, I sleep for one second in order to check that none of the worker threads get past the gate before it is opened.

The synchronization is on the SpingGate’s std::atomic_bool, flag_. The flag_ is set to true in the constructor, with release memory ordering. The function wait() spins on loading the flag_ with acquire memory ordering, until open() is called, which sets the flag_ to false with release semantics. The other threads that were spinning may now proceed. The release-acquires ordering ensures that happens-before writes by the thread setting up the work and calling open will be read by the threads that were spin waiting.

### 1.1 Update:

I’ve added a yield() call after the open, hinting that the thread opening the gate may be rescheduled and allow the other threads to run. On the system I’m testing on, it doesn’t seem to make a difference, but it does seem like the right thing to do, and it does not seem to hurt.

```#include "spingate.h"

#include <vector>
#include <chrono>
#include <iostream>

int main()
{
SpinGate gate;
using time_point = std::chrono::time_point<std::chrono::high_resolution_clock>;
time_point t1;
std::vector<time_point> times;

for (size_t n = 0; n < threadCount; ++n) {
workers.emplace_back([&gate, t1, &times, n]{
gate.wait();
time_point t2 = std::chrono::high_resolution_clock::now();
times[n] = t2;
});
}

std::cout << "Open the gate in 1 second: " << std::endl;
using namespace std::chrono_literals;
t1 = std::chrono::high_resolution_clock::now();
gate.open();

for (auto& thr : workers) {
thr.join();
}

for (auto& time: times) {
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(time - t1);
std::cout << "Thread " << threadNum++ << " waited " << diff.count() << "ns\n";
}
}
```

I’d originally had the body of the threads just spitting out that they were running on std::cout, and the lack of execution before the gate, plus the overlapping output, being evidence of the gate working. That looked like:

```for (std::size_t n = 0; n < std::thread::hardware_concurrency(); ++n) {
workers.emplace_back([&gate, n]{
gate.wait();
std::cout << "Output from gated thread " << n << std::endl;
});
}
```

The gate is captured in the thread lambda by reference, the thread number by value, and when run, overlapping gibberish is printed to the console as soon as open() is called.

But then I became curious about how long the spin actually lasted. Particularly since the guarantees for atomics with release-acquire semantics, or really even sequentially consistent, are about once a change is visible, that changes before are also visible. It’s really a function of the underlying hardware how fast the change is visible, and what are the costs of making the happened-before writes available. I’d already observed better overlapping execution using the gate, as opposed to just letting the threads run, so for my initial purposes of making contention more likely, I was satisfied. Visibility, on my lightly loaded system, seems to be in the range of a few hundred to a couple thousand nanoseconds, which is fairly good.

Checking how long it took to start let me do two thing. First, play with the new-ish chrono library. Second, check that the release-acquire sync is working the way I expect. The lambdas that the threads are running capture the start time value by reference. The start time is set just before the gate is opened, and well after the threads have started running. The spin gate’s synchronization ensures that if the state change caused by open is visible, the setting of the start time is also visible.

Here are one set of results from running a spingate:

```Open the gate in 1 second:
```

## 2 Comparison with Condition Variable gate

The normal way of implementing a gate like this is with a condition variable, associated mutex, and a plain bool. The mutex guarantees synchronization between the wait and open, rather than the atomic variable in the SpinGate. The unlock/lock pair in mutex have release-acquire semantics. The actual lock and unlock are done by the unique_lock guard.

```// cvgate.h                                                           -*-C++-*-
#ifndef INCLUDED_CVGATE
#define INCLUDED_CVGATE

#include <mutex>
#include <condition_variable>
#include <atomic>

class CVGate
{
std::mutex lock_;
std::condition_variable cv_;
bool flag_;

public:
CVGate();

void wait();
void open();
};

inline
CVGate::CVGate()
: lock_(),
cv_(),
flag_(true)
{}

inline
void CVGate::wait() {
std::unique_lock<std::mutex> lk(lock_);
cv_.wait(lk, [&](){return !flag_;});
}

inline
void CVGate::open() {
std::unique_lock<std::mutex> lk(lock_);
flag_ = false;
cv_.notify_all();
}
#endif
```
```#include "cvgate.h"
// Test that header is complete by including
```

This has the same interface as SpinGate, and is used exactly the same way.

Running it shows:

```Open the gate in 1 second:
```

That the overhead of the mutex and condition variable is significant. On the other hand, the system load while it’s waiting is much lower. Spingate will use all available CPU, while CVGate yields, so useful work can be done byu the rest of the system.

However, for the use I was originally looking at, releasing threads for maximal overlap, spinning is clearly better. There is much less overlap as the cv blocked threads are woken up.

## 3 Building and Running

This is a minimal CMake file for building with the system compiler.

```cmake_minimum_required(VERSION 3.5)
set(CMAKE_LEGACY_CYGWIN_WIN32 0)

project(SpinGate C CXX)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

set(CMAKE_CXX_FLAGS "\${CMAKE_CXX_FLAGS} -std=c++14 -ftemplate-backtrace-limit=0")
set(CMAKE_CXX_FLAGS "\${CMAKE_CXX_FLAGS} -Wall -Wextra -march=native")
set(CMAKE_CXX_FLAGS_DEBUG "-O0 -fno-inline -g3")
set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -g0 -DNDEBUG")

include_directories(\${CMAKE_CURRENT_SOURCE_DIR})

```

And here we build a release version of the test executable:

```mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ../
make
```
```-- Configuring done
-- Generating done
-- Build files have been written to: /home/sdowney/src/spingate/build
[ 50%] Built target cvgate
[100%] Built target spingate
```

## 4 Org-mode source and git repo

Exported from an org-mode doc. All of the source is available on github at SpinGate

# Building Emacs 25.1 on Ubuntu 16.10

## 1 Why notes

Making notes so I don’t forget, although the key problem is fixed upstream.

Ubuntu 16.10 (Yakkety Yak) has made a critical change to the system compiler, and everything is by default built with position independent executable support (PIE), in order to get better support for address space layout randomization. Here are the security notes for PIE. Emacs does not dump successfully with this. The compiler option -no-pie needs to be added into CLFAGS.

The option also means that static libraries you’ve built before will probably need to be rebuilt. See the link above for typical errors.

First get dpkg-dev, g++, gcc, libc, make:

```sudo apt-get install build-essentials
```

Then get the full set of build dependencies for last emacs, emacs24:

```sudo apt-get build-dep emacs24
```

Decide if you want to build just this version, or track emacs. I track from git, because. So I have a directory for emacs where I have master, emacs25, and build directories. I try to avoid building in src dirs. It makes it easier to try out different options without polluting the src.

```mkdir -p ~/bld/emacs
cd ~/bld/emacs
git clone git://git.savannah.gnu.org/emacs.git
cd emacs.git
cd ..
mkdir bld-25.1
```

## 3 Configure and build with magic option

Now configure in the build directory:

```cd bld-25.1
../emacs-25.1/configure \
--prefix=~/install/emacs-25.1 \
--with-x-toolkit=gtk3 \
--with-xwidgets \
CFLAGS=-no-pie
```

I built with xwidget support to play with the embedded webkit widget. It’s not really useable as a browser, but has uses for rendering. I also install into a local program directory, under my homedir.

Build and install:

```make
make install
```

I have a bin directory early in \$PATH so that I can select versions of local software ahead of system software.

```cd ~/bin
ln -s ~/install/emacs-25.1/bin/emacs
ln -s ~/install/emacs-25.1/bin/emacsclient
```

Now you should have a working emacs 25.1 available.