Multithread Experiments

An Experiment Collects Samples

I'm modelling this in order to run bits of code like the various litmus tests used to describe multi-core architectures. A set of functions to be run in parallel that may or may not write to a result, which type is a property of the Test being run. The Experiment will run the Test collecting Samples. The Test type will provide a tuple of functions to run. They will be run under a spingate in all permutations in order to remove scheduling bias.

What a Test looks like

class MP { // Message Passing
    int x_;
    int y_;

  public:
    typedef std::tuple<int> Result;
    MP();
    void t1();
    void t2(Result& read);

    auto actions() {
        return std::make_tuple([this]() { t1(); },
                               [this](Result& result) { t2(result); });
    }
};
The Test interface must provide a Result type, and an actions() member that will produce a tuple of functions to run which either take no arguments or a reference to a result. The test being defined here is the basic Message Passing litmus test.
MP::MP() : x_(0), y_(0) {}

void MP::t1() {
    x_ = 1;
    y_ = 1;
}

void MP::t2(Result& read) {
    while (!y) {
    }
    std::get<0>(read) = x_;
}
Two variables are initialized to 0. One thread stores 1 to x first, then to 1 to y. The other thread loops until it reads a non-zero in y, and then reads x. The value in x is the message being passed between threads. In an actual test, the variables would be atomics, specifiying load and store strength, and the variables might have constraints on layout to help sharing cache line updates.

An Experiment

An Experiment samples a test a number of times. It takes the result of each sample, and puts in a map of the results to count, incrementing the count for each distinct result. The actions to run are permuted each time, to help remove bias about which action is loaded behind the spingate first.
void Experiment::run(size_t count) {
    using Actions = decltype(std::declval<Test>().actions());
    auto getters = tupleutil::tuple_getters<Actions>();
    for (size_t i = 0; i < count; ++i) {
        Sample<Test> sample;
        sample.run(getters);
        resultMap_[sample.result_]++;
        std::next_permutation(getters.begin(), getters.end());
    }
}

tupleutil::tuple_getters returns an array of getters each of which returns a std::variant<Types…> with the same parameter pack as the tuple. Sample runs all of the actions in a batch that locks them behind a spingate, and collects the results for each action.
template <class Test> class Sample {
  public:
    Batch                 batch_;
    Test                  test_;
    typename Test::Result result_;

    template <typename V, size_t I> void run(std::array<V, I> const& getters) {
        auto const& actions = test_.actions();
        add(actions, getters);
        batch_.run();
    }
};
Add is a templated member function that loops over the array, uses the getter to pull a function out of the tuple of actions and visits that with a lambda that will add either the function with no arguments, or that function with a reference to the results, to the batch.
    template <typename Tuple, typename Variant, size_t I>
    void add(Tuple const& actions, std::array<Variant, I> const& getters) {
        auto adder = [this](auto&& f) {
            using F = std::remove_cv_t<std::remove_reference_t<decltype(f)>>;
            if constexpr (std::is_invocable_v<F>) {
                batch_.add(f);
            } else {
                batch_.add(f, std::ref(result_));
            }
        };
        for (auto&& get_n : getters) {
            std::visit(adder, get_n(actions));
        }
        return;
    }
I am a bit dissatisfied with the else case not being constexpr if followed by a static assert, but getting the condition right didn't work the obvious way, so I punted. There will be a compiler error if f(result_) can't actually be called by the batch.

Batch recapped:

The key bit of code is
template <class Function, class... Args>
void Batch::add(Function&& f, Args&&... args) {
    workers_.emplace_back([ this, f = std::forward<Function>(f), args... ]() {
            gate_.wait();
            f(args...);
        });
}

Batch has a spingate and runs all of the functions that are added sitting behind it. The run() function opens the gate and joins all the worker threads.
void Batch::run() {
    gate_.open();
    for (auto& thr : workers_) {
        thr.join();
    }
}

Summary

With all the machinery in place, the test infrascructure can aggressively run multi-threaded tests, giving the thread scheduler the best opportunity to run all of the actions in any order. This allows multi thread bugs to be shaken out by looking for surprising results from the experiment.

Source Code

Exported from an org-mode doc, experiment.org, which is available, with all of the source on github at SpinGate.

The Identity Monad

A short exposition on the simplest monad possible, the Identity monad, which does nothing. It is a pure context, that provides no extra capabilites. The only reason to consider it is that it provides a fairly pure implementation of the monad interface, with no other distractions. The only known use for the Identity monad, other than exposition, is to model Taint, where the context, such as user input, can't be removed.

Identity Monad in Haskell

This is a definition of the Identity monad in haskell, for reference. This is using the relatively new Applicative typeclass between Functor and Monad.

import Control.Monad

data Id a = Id a deriving Show

instance Functor Id where
    fmap f (Id x)   = Id (f x)

instance Applicative Id where
    pure x          = Id x
    Id f <*> Id x   = Id (f x)

instance Monad Id where
    (Id x) >>= f    = f x

Id wraps a single value. There is no empty state. The mapping function, fmap, simply applies the a funtion to the value being wrapped. The function (<*>), also known as ap, takes a wrapped function, applies it to a wrapped value, and returns it wrapped up. The function pure, which is the same as return, just constructs an Id with the value in it. The bind function (>>=) does the same thing, only with a function that returns a wrapped value.

Identity Monad in C++

The basic data structure for the Identity monad is fairly trivial. It's a generic type, and holds a single element, privately, of that type. I am ignoring any issues about reference types, non-copyable types, non-moveable types, etc. as distractions. I'm also assuming it's default constructable, for simplicity.

template <typename T>
class Identity {
    T t_;

  public:
    Identity() : t_(){};
    Identity(T const& t) : t_(t){};

    // ...
};

In particular, there are no accesors for the T t_ being held. Values go in but they can't get out. At least not yet, and not without side-effect using functions. This is why it's sometimes used as Taint, where you can not lose track of the data coming from outside and being untrusted.

Identity Functor

To make a data structure a Functor a fmap function needs to be provided that satisfies certain requirements. It also turns out that if there is such a function, for a particular data struture, all of the implmentations are isomorphic. So there's basically only one for a particular data structure. Fmap for a functor F takes two arguments, a function from T->U, an F<T>, and it will return an F<U>. Although, for C++ this is a bit over-restrictive, and we probably want to be able to apply any function where the template type of the Identity can be used to call the function being mapped. Fortunately C++17 added some traits to work with invocables, and we can use std::invoke_result_t to both detect if a call can be made, and determine the type of the result.

template <typename U, typename Func>
auto fmap(Identity<U> const& i, Func const& f)
    -> Identity<std::invoke_result_t<Func, U>> {
    using V = std::invoke_result_t<Func, U>;
    return Identity<V>{std::invoke(f, i.t_)};
}

It's also declared as a friend, so that it can have access to the internally held t_.

There are two "laws" for how fmap must behave

-- identity preserving
fmap id = id

-- Composition
fmap (f . g) = (fmap f) . (fmap g)

The first one says that fmaping the identity function won't change the object, so it's equivalent to just the identity function. For Identity, this is straightforward to show, since the held value isn't going to be changed, the result won't change the value of the Identity. We can write an equality test for Identity:

template <typename T, typename U>
bool operator==(Identity<T> t, Identity<U> u) {
    return t.t_ == u.t_;
}

template <typename T, typename U>
bool operator!=(Identity<T> t, Identity<U> u) {
    return !(t == u);
}

It's expressed as a non-member, in order to support allowed conversions between T and U. So we can compare Identity<int> and Identity<long> the same way we can compare int and long.

The second law is more interesting. It's expressing that you can distribute fmap over function composition. Chaining functions works transparently. with appropriate adjustment of the types of the functions. So what does applying fmap to a function, without a Functor object like Identity do? It's a higher order function, one that takes and returns a function.

template <typename Func>
auto fmap(Func const& f) {
    // maps (T -> U) -> (Identity<T> -> Identity<U>)
    return [f](auto t) { return fmap(t, f); };
}

In haskell this is natuaral, and does not need a separate defintion because of currying. In haskell if you have a function of type (T -> U) -> F<T> -> F<U>, it is both a function taking one argument, and returning a function, and a function taking two argument and returning a value. Both (T->U) -> (F<T> -> F<U>) and (T -> U), F<T> -> F<U> are true, depending on how many arguments you supply.

The operator (.) in haskell has type (.) :: (b -> c) -> (a -> b) -> (a -> c), so f . g is g(f()), more function chaining. We can either lift the the function (a -> c) into F<a> -> F<c>, or we can chain together F<a> -> F<b> into F<b> -> F<c> and get the same result. This is a natural property of container-like Functors, like Identity, or an optional, or std::vector.

Identity function digression

It's surprisingly tricky to get a good identity function, one that really returns the same object in the C++ sense. This is the one I use:

constexpr auto identity = [](auto &&v) -> decltype(auto)
{
    return std::forward<decltype(v)>(v);
};

Using fmap on Identity

A short example, from a test case of using the two argument fmap:

Identity<int>  i;
Identity<long> l;
Identity<char> c;

auto twice = [](auto z) { return 2 * z; };
auto i2    = fmap(i, twice);
auto l2    = fmap(l, twice);
auto c2    = fmap(c, twice);

ASSERT_EQ(Identity{0}, i2);
ASSERT_EQ(Identity{0L}, l2);
ASSERT_EQ(Identity{'\0'}, c2);

Identity<int>  i3(3);
Identity<long> l3(3);
Identity<char> c3(3);

auto i6 = fmap(i3, twice);
auto l6 = fmap(l3, twice);
auto c6 = fmap(c3, twice);

ASSERT_EQ(Identity{6}, i6);
ASSERT_EQ(Identity{6L}, l6);
ASSERT_EQ(Identity{'\6'}, c6);

And the one argument form, where the only difference is we can use the fmapped lambda on all of the Identity instances:

Identity<int>  i;
Identity<long> l;
Identity<char> c;

auto twice     = [](auto i) { return 2 * i; };
auto fmapTwice = fmap(twice);

auto i2 = fmapTwice(i);
auto l2 = fmapTwice(l);
auto c2 = fmapTwice(c);

ASSERT_EQ(Identity{0}, i2);
ASSERT_EQ(Identity{0L}, l2);
ASSERT_EQ(Identity{'\0'}, c2);

Identity<int>  i3(3);
Identity<long> l3(3);
Identity<char> c3(3);

auto i6 = fmapTwice(i3);
auto l6 = fmapTwice(l3);
auto c6 = fmapTwice(c3);

ASSERT_EQ(Identity{6}, i6);
ASSERT_EQ(Identity{6L}, l6);
ASSERT_EQ(Identity{'\6'}, c6);


Fmap, and the Functor laws, gives a way of applying chains of functions to a Functor object and know that the application will behave sensibly. The equivalent STL algorithm is std::transform, however that requires that there be iterators for the Functor, which means it can not be used for Functors like std::optional. If it were, then we could write code like

std::optional<int> mult(std::optional<int> o1, std::optional<int> o2)
{
    for (auto const& a : o1)
    {
        for (auto const& b : o2)
        {
            return a*b;
        }
    }
    return {};
}

The core function (a * b) is also something that fmap can't deal with directly. Fmap deals with functions of one argument, and this has two. It's int -> int -> int, not int -> int. That's what Applicative is about.

Applicative and ap

Applicative is a little unnatural in C++. It arises via partial application, which functional languages usually have a direct syntax for, while in C++ we have at best lambda capture, and at worst std::bind, or the late, and unlamented, std::bind1st. The idea, though, is that you can treat a function like (*), multiply, with the type int -> int -> int as a function that takes an int and returns a function with type int -> int. And it is useful to be able to do this with Functors.

Applicative introduces two operators in Haskell, one is an infix version of fmap, and the other is new, and exists to deal with chaining multi-argument functions.

(<$>) :: Functor f => (a -> b) -> f a -> f b
(<*>) :: Applicative f => f (a -> b) -> f a -> f b

Infix notation, putting the operation in the middle of the arguments, can improve readability and ergonomics. It's one of the reasons the Uniform Function Call Syntax proposal gets traction. It's useful and natural to be able to say things in noun verb order, and is one of the attactions of member functions in the first place. There are libraries, such as Hana, that provide mechanisms to be able to infix funtions with syntax like f <fmap> x, which would translate to fmap(f, x). Note this doesn't match the argument order of the fmap I've used so far. The ergonomics of C++ dictate that functions that might take lambdas put that argument last, unless there is a compelling reason otherwise. API design work is hard.

The new function (<*>), also known as ap, takes a function wrapped in a Functor and applies it to an object wrapped in the Functor, and returns the result wrapped in the Functor. That function usually comes from fmaping, so you will see a chain of <$><*><*>… providing arguments to the function. For Identity this will just be calling the function with the contents. Often with the pure function as part of it, which is the other piece that Applicative introduces, with a terribly misleading name. The pure function takes a value and lifts it into the Functor in the most natural way possible. For container-like Applicatives, that is putting a single value into the container.

pure :: Applicative f => a -> f a

Pure is also known as return for Monad, and also known as unit. All of these are somewhat misleading names. For C++ I tend to prefer make. This is generally a constructor, but we want to disambiguate constructor calls with control parameters. That is, we want to be sure we're calling the constructor of std::vector that makes a std::vector<size_t> that has one element, not a std::vector<unsigned long> that has size_t elements, all zero. In the case of the identity monad, make is pretty simple to implement.

template <typename T>
Identity<T> make(T t) {
    return Identity<T>{t};
}

If I wanted a more fully general make that handled other Applicative generic types, it would be a bit more complicated. If I want something like:

    Identity<int>           i  = make<Identity>(1);
    Identity<Identity<int>> ii = make<Identity>(i);

where I can parameterize the make function on the template type parameter, it's a bit more complicated, because we can't partially specialize a function, and we want to be able to on the Value type of the Identity template. The standard trick is to delegate to a class template that is partially specialized. The base template looks like

template <template <typename> typename Ap, typename Value>
struct Applicative {
    Ap<Value> make(Value const& v);
    Ap<Value> make(Value&& v);
};

specialized as

template <typename Value>
struct Applicative<Identity, Value> {
    typedef typename std::decay<Value>::type V;
    Identity<V> make(V const& v) { return Identity<V>{v}; }
    Identity<V> make(V&& v) { return Identity<V>{v}; }
};

so that a make function looks like

template <template <typename> typename Ap, typename Value>
Ap<typename std::decay<Value>::type> make(Value const& v) {
    Applicative<Ap, Value> a;
    return a.make(v);
}

The base Applicative type allows writing generic code against different Applicative types. We'll come back to that in a bit.

The ap function, takes a function lifted into the applicative and a value lifted into the applicative, and invokes the function on the value. For Identity, this is just function invocation. For other Applicatives it can be more complicated, but it is always some variation on calling functions. Since there's no way of getting a value out of the Identity, ap needs to be a friend or member.

template <typename T>
class Identity {
// ...
    template <typename Func, typename U>
    friend auto ap(Identity<Func> const& f, Identity<U> t)
        -> Identity<std::invoke_result_t<Func, U>>;
// ...
};

template <typename Func, typename U>
auto ap(Identity<Func> const& f, Identity<U> t)
    -> Identity<std::invoke_result_t<Func, U>> {
    return std::invoke(f.t_, t.t_);
}

A function wrapped in an applicative context shows up when you partially apply a function to an applicative. That is if you had a function like add

auto add(int a, int b) -> int {return a + b;}

which takes two ints and returns an int, and you bind the first argument, returning a function from int to int

auto bind_add = [](int a) {return [a](int b){return add(a, b);};};

so instead of a function (int, int) -> int, we have a function (int) -> (int -> int) and we fmap that function, where fmap takes a function (a -> b) and gives you a function (a -> F<b>), so the fmap gives you a function a function (int -> F<(int -> int)>). And then that function is something we can use for ap

    Identity<int> i3(3);
    auto partial = fmap(i3, bind_add);
    // partial is Identity<int -> int> (roughly)
    Identity<int> i4(4);
    Identity<int> k = ap(partial, i4);
    ASSERT_EQ(Identity<int>(7), k);

In Haskell, where partial application comes built into the syntax of the language, this is all much more natural. The expressive power comes in cases where ap is more than just simple function application, as it is for the Identity applicative. Applicatives like Maybe or Expected can short circuit evaluation, for example. For multivalued functions, like List, the eventual result is the cartesian product of all the parameters, in a list.

Another way of looking at ap involves the curry conversion of a multiparameter function, named after Haskell Curry. A curry converts a function from something called like f(a,b,c) to something called like f(a)(b)(c), where each call returns a function taking a single parameter, an the final result is the same.

    auto curry1 = [](auto func) {
        return [func](int a) {
            return [func, a](int b) {
                return func(a, b);
            };
        };
    };
    auto g = curry1(three);
    auto h = g(3);
    ASSERT_EQ(7, h(4));

Here curry1 unrolls a func(int, int) -> int into a function that takes an int and returns a function taking an int returning an int. It's a straightforward generalization of bind_add from above. We can tale the curried function g, fmap it over an Identity<int> and then apply that partially bound function to another Identity<int>, as so:

    Identity<int> id3(3);
    Identity<int> id4(4);
    auto partial2 = fmap(id3, g);
    Identity<int> id7 = ap(partial2, id4);
    ASSERT_EQ(Identity<int>(7), id7);

Why std::bind can't be (formally) deprecated

Yes: std::bind should be replaced by lambda

For almost all cases, std::bind should be replaced by a lambda expression. It's idiomatic, and results in better code. There is almost no reason post C++11 to use std::bind. Doing so is quite straightforward, capture each bind argument by value in the lambda capture list, and provide auto parameters for each of the placeholders, then call the bound callable using std::invoke(). That will handle the cases of member function pointers, as well as regular functions. Now, this is how to do it mechanically, if you were doing this as part of a manual refactoring, the lambda can be made even clearer.
#include <functional>
#include <iostream>

void f(int n1, int n2, int n3) {
  std::cout << n1 << ' ' << n2 << ' ' << n3 << '\n';
}

int main() {
  using namespace std::placeholders;
  int n = 5;
  auto f1 = std::bind(f, 2, n, _1);
  f1(10); // calls f(2, 5, 10);

  auto l1 = [ p1 = 2, p2 = n ](auto _1) { return std::invoke(f, p1, p2, _1); };
  l1(10);

  // idiomatically
  auto l1a = [=](auto _1){return f(2, n, _1);};
  l1a(10);

  auto f2 = std::bind(f, 2, std::cref(n), _1);
  auto l2 = [ p1 = 2, p2 = std::cref(n) ](auto _1) {
      return std::invoke(f, p1, p2, _1);
  };
  // or
  auto l2a = [ p1 = 2, &p2 = n ](auto _1) {
      return std::invoke(f, p1, p2, _1);
  };
  // more idiomatically
  auto l2b = [&](auto _1){f(2, n, _1);};

  n = 7;
  f2(10); // calls f(2, 7, 10);

  l2(10);
  l2a(10);
  l2b(10);
}
Which results in:
2 5 10
2 5 10
2 5 10
2 7 10
2 7 10
2 7 10
2 7 10

No: std::bind provides one thing lambda doesn't

The expression std::bind evaluates flattens std::bind sub-expressions, and passes the same placeholder parameters down. A nested bind is evaluated with the given parameters, and the result is passed in to the outer bind. So you can have a bind that does something like g( _1, f(_1)), and when you call it with a parameter, that same value will be passed to both g and f. The function g will receive f(_1) as its second parameter. Now, you could rewrite the whole thing as a lambda, but auto potentially makes this a little more difficult. The result of std::bind is an unutterable type. They weren't supposed to be naked. However, auto means the expression could be broken down into parts, meaning that the translation from a std::bind expression to a lambda expression is potentially not mechanical. Or, the bind could be part of a template, where the subexpression is a template parameter, which is likely working by accident, rather than design. In any case, std::bind does not treat its arguments uniformly. It treats a bind expression distinctly differently. At the time, it made some sense. But it makes reasoning about bind expressions difficult. Don't do this. But it is why formally deprecating std::bind is difficult. They can be replaced, but not purely mechanically. There isn't a simple translation that works, unlike converting from std::auto_ptr to std::unique_ptr, or putting a space after a string where it now looks like a conversion. And, std::bind isn't broken. It's sub-optimal because of the complicated machinery to support all of the flexibility, where a lambda allows the compiler to do much better. Also, since the type isn't utterable, it often ends up in a std::function, which erases the type, removing optimization options.

Example of fail code

#include <functional>
#include <iostream>

void f(int n1, int n2, int n3)
{
    std::cout << n1 << ' ' << n2 << ' ' << n3 << '\n';
}

int g(int n1) { return n1; }

int main()
{
    using namespace std::placeholders;

    auto g1 = std::bind(g, _1);
    auto f2 = std::bind(f, _1, g1, 4);
    f2(10); // calls f(10, g(10), 4);

    // THIS DOES NOT WORK
    // auto l2 = [p1 = g1, p2 = 4](auto _1) {std::invoke(f, _1, p1, p2);};
    // l2(10);

    // The bind translation needs to be composed:
    auto l1 = [](auto _1){return g(_1);};
    auto l2 = [p1 = l1, p2 = 4](auto _1){f(_1, p1(_1), p2); };
    // idiomatically
    auto l2a = [](auto _1) { return f(_1, g(_1), 4);};
    l2(10);
    l2a(10);
}


10 10 4
10 10 4
10 10 4

TODO

If someone can figure out a fixit recommendation that could be safely applied, transforming the old bind to a lambda, then std::bind could be deprecated in C++Next, and removed as soon as C++(Next++). But that right now is non-trivial in some cases.

Updates:

  • Fix incorrect statement about type-erasure in std::bind. I was thinking std::function
  • Add more idiomatic transliterations of the std::bind lambdas

Building and running the examples

Makefile

clean:
    -rm example1
    -rm example2

example1: example1.cpp
    clang++ --std=c++1z example1.cpp -o example1

example2: example2.cpp
    clang++ --std=c++1z example2.cpp -o example2

example3: example3.cpp
    clang++ --std=c++1z example3.cpp -o example3 2>&1

all: example1 example2

Build

rm example1
rm example2
clang++ --std=c++1z example1.cpp -o example1
clang++ --std=c++1z example2.cpp -o example2

Original source

Original document is available on Github

Accessing the elements of a tuple as variant

A further digression, because it turns out I want to be able to permute a tuple at run time. That means treating the element of a tuple generically. And I can just barely do this, for some tuples, in c++17. So a slight digression into ADTs. Which in this case means Algebraic Data Types, not Abstract Data Types. But just algebra. No calculus, or differentiation, of data types. Not today.

1 Tuple is Product, Variant is Sum

1.1 Products

In algebra, we usually start out with addition. It's simpler. But for types, multiplication, or product, is in many ways much more natural. Your basic struct, record, etc is a natural product of types. A type is some kind of collection of things. And I'm being a bit vague here because this is right in the area where set seems like a good idea, and then we get into sets of sets, sets that might contain themselves, and barbers who shave all the people who don't shave themselves. There is rigour, but I don't really want to have to go there. But, if we start with the idea that a type is a collection of things, and that we don't look to closely at the infinities, we are not going to be terribly wrong. So a type is a way of describing if a thing is in or out of the collection. Now, I could pretend we don't know what a struct is. Start with pairs, where there are no names of the components of the struct, and build that up. But we all have a notion of struct. It's an ordered collection of types. The instances of the struct are all of the elements of each type contained in the struct, matched up with all of the other elements of all the other types in the struct. Known as the Cartesion product. So if you have a type A, and a type B, the collection of things in struct {A a; B b;} is the cross of As and Bs. That is {{a1, b1}, {a1, b2}, {a1, b3}, … , {a2, b1}, {a2, b2}, … {an, b1}, … {an, bm}} is all of the elements that are part of the type struct {A a; B b;}. The cardinality of {A, B} is the product of the cardinalities of A and B. Structs are very natural in C++, but hard to deal with generically, so there's a type that does it all for you, std::tuple. Getting at the parts of the tuple is a little more difficult that with a struct. You have to say std::get<0>(tuple), or std::get<int>(tuple). And the second might not even compile, if the tuple has more than one int. But you get tools for composing and decomposing tuples at compile time. And std::tuple lets you put pretty much any C++ type into the tuple, only restricting you when you try to, e.g. move a tuple that has an element that can't be moved. There should also be a type that acts as a unit for the product, the equivalent of 1 for multiplication. The empty tuple can work as a unit. It contains any of the list of no types. This implies that all empty tuples are equivalent, so its cardinality is 1. There can be only one. The product of a type with the empty tuple is entirely equivalent to the the type itself. There are no additional elements in the type, and you can convert back and forth between them. They are isomorphic, having the same shape. Isomorphisms are important in talking about types, because most of the time we can't actually distinguish between isomorphic types, at least for proving things. The phrase "up to isomorphism" shows up a lot. To be isomorphic means that we can write a transformation $LATEX X$ from type A to type B, and a reverse transformation $LATEX Y$ from type B to type A, such that $latex Y(X(a)) == a$ for all a, and that for any function from a1 to a2, there is an equivalent function from b1 to b2. We could mechanically replace instances of a with the appropriate b and add calls to X and Y without changing the behavior of a program.

1.2 Sums

The other basic algebraic type is the sum type. The corresponding primitive in C++ is a union, with one difference. In most type systems, the sum type automatically remembers which of the allowed types is in it. A union doesn't, so the standard technique is to embed the union in a struct that carries a tag saying which type in the union was most recently written, and can be read from. I'll be ignoring type-punning schemes allowing a read of a different type than was written. So a Sum type of type A and type B is the union of all of the things in A and all of the things in B. {a1, a2, a3, … , an, b1, b2, … , bm}. The cardinality of is the sum of the cardinalities of A and B. The unit type of the sum is equivalent to zero. The empty sum type, although a valid type, has no elements in the type. It's like the empty set. It's often known as Void, where the unit for product is often called Unit. It may also be known as Bottom, where that is a computation that never completes. Since there are no elements of the type Void, it can't be instantiated. And a product of Void and any other type is equivalent to Void. The c++ type void is related, but not exactly the same, because it also represents an empty argument list, a function that returns, but does not return any value (a subroutine), and is also functions as the universal pointer. C++17 recently standardized a sum type to go with the long standardized std::tuple, std::variant. Std::variant remembers which of the alternative types was last set. It is almost never empty, only so if a write into one of the alternatives threw an exception. It is not allowed to hold void, references, arrays, or to contain no types. This is a bit unfortunate, because except for void std::tuple can do all of those things. There were several competing models for what std::variant should be, with various tradeoffs being made. It was always clear that std::tuple had to be able to represent everything a struct can, and in fact there are now language features to destructure a struct into a tuple. There is no equivalent model for sum types. Union can't hold anything but trivial types because there is no mechanism to track what to do on destruction, since there is no built-in mechanism to determine what the union was last written as. One of the popular models for variant rises out of database-like interfaces. Even though databases are internally strongly typed, SQL queries are not. And the model of sending text over and getting some kind of response back makes it difficult to expose that to a host language. Particularly when the database schema may change, the query still be perfectly valid, but no longer return the same types. However, since we do know there is a relatively short list of permitted types in the database, a variant that allows just those types and the ability to query what type was returned can be quite useful, and not terribly hard to implement. There are JSON parsers taking similar approaches, only with the addition that a JSON type may have JSON objects contained in them recursively, and those have to be outside the object somehow, or the size of the object is unbounded. From the implementors point of view, supporting pointers and arrays is a huge amout of extra work. Not allowing an array to decay to a pointer is quite difficult. References have issues when treated generically. Not to mention that references have decidely odd semantics in the differences between construction and assignment. And the degenerate case of an empty variant was also difficult. If that needs to be represented, the type std::monostate has been introduced, which is a type designed to have exactly one item in it, so that all instances of std::monostate are identical. This is also the same as the unit type for product types. It's not an accident that it's represented in Haskell as (), which is the empty tuple. All empty lists are equivalent. It could have been std::tuple<>, but no one in the room happened to think of that.

2 Tuple is a Heterogenous Container, what is the iterator?

The C++ standard says "tuples are heterogeneous, fixed-size collections of values" - [tuple.general]. Collections generally have iterator types associated with them, but that's a bit of a challenge since the iterator model in C++ assumes that for a collection, the type of *(Collection<T>::iterator) is T. But if the collection isn't on T, but on Types…, you doesn't quite work to say *(Collection<typename… Types>) is of type …Types. You need something to hold that. But in many cases, std::variant can work. It doesn't quiet work, since we'd really need a variant of references to the elements of the tuple, so that they could be written to. However, for many purposes we can come close. For the case I was looking at, making copies is perfectly fine. What I'm looking for is something roughly with the signature
template <typename... Types
auto getElement(size_t i, std::tuple<Types...> tuple) -> std::variant<Types...>;
That is, something that will get me the ith element of a tuple, as a variant with the same typelist as the tuple, with the index determined at runtime. All of the normal accessors are compile time. So need to do something that will make the compile time information available at runtime. Start with something I do know how to do, idiomatically printing a tuple.
template <typename Func, typename Tuple, std::size_t... I>
void tuple_for_each_impl(Tuple&& tuple, Func&& f, std::index_sequence<I...>)
{
    auto swallow = {0,
                    (std::forward<Func>(f)(
                        I, std::get<I>(std::forward<Tuple>(tuple))))...};
    (void)swallow;
}

template <typename Func, typename... Args>
void tuple_for_each(std::tuple<Args...> const& tuple, Func&& f)
{
    tuple_for_each_impl(tuple, f, std::index_sequence_for<Args...>{});
}

template <typename... Args>
void print(std::ostream& os, std::tuple<Args...> const& tuple)
{
    auto printer = [&os](auto i, auto el) {
        os << (i == 0 ? "" : ", ") << el;
        return 0;
    };
    return tuple_for_each(tuple, printer);
}
Actually, a bit more complicated than the totally standard idiom, since it factors out the printer into a application across the tuple, but it's not much more compilcated. The tuple_for_each constructs an index sequence based on the argument list, and delegates that to the impl, which uses it to apply the function to each element of the tuple. The _impl ought to be in a nested detail namespace, so as not to leak out. Swallow is the typical name for using an otherwise unnamed, and uninteresting, type to apply something to each element of the tuple for a side-effect. The void cast is to make sure the variable is used, and is evaluated. The next step is, instead of an application of a function for its side-effect, instead a mapping of the tuple, returning the transformed tuple.
template <typename Func, typename Tuple, std::size_t... I>
auto tuple_transform_impl(Tuple&& tuple, Func&& f, std::index_sequence<I...>)
{
    return std::make_tuple(
        std::forward<Func>(f)(std::get<I>(std::forward<Tuple>(tuple)))...);
}

template <typename Func, typename... Args>
auto tuple_transform(std::tuple<Args...>&& tuple, Func&& f)
{
    return tuple_transform_impl(tuple, f, std::index_sequence_for<Args...>{});
}

template <typename Func, typename... Args>
auto tuple_transform(std::tuple<Args...> const& tuple, Func&& f)
{
    return tuple_transform_impl(tuple, f, std::index_sequence_for<Args...>{});
}
Because the std::tuple is not a template parameter, I have to supply a const& and a forwarding-reference form to cover both cases. And I'm ignoring volatile quals. The _impl function uses forwarding-reference parameters, which will decay or forward properly using std::forward. Using it is straightforward.
std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
auto transform = tupleutil::tuple_transform(t,
                                            [](auto i) { return i + 1; });

EXPECT_EQ(3.3, std::get<1>(transform));

auto t2 = tupleutil::tuple_transform(std::make_tuple(4, 5.0),
                                     [](auto i) { return i + 1; });
EXPECT_EQ(6, std::get<1>(t2));
So, for functions over all the types in a tuple, tuple is a Functor. That is, we can apply the function to all elements in the tuple, and it's just like making a tuple out of applying the functions to elements before making the tuple. If this sounds like a trivial distinction, you are mostly right. Almost all container-ish things are Functors, and a few non-containerish things are also. Plus Functor sounds more impressive. The transform also suggests a way of solving the problem I was originally looking at. An array of the elements of the tuple each as a variant will let me permute them with std tools.
template <typename... Args, std::size_t... I>
constexpr std::array<std::variant<Args...>, sizeof...(Args)>
tuple_to_array_impl(std::tuple<Args...> const& tuple,
                    std::index_sequence<I...>)
{
    using V = std::variant<Args...>;
    std::array<V, sizeof...(Args)> array = {
        {V(std::in_place_index_t<I>{}, std::get<I>(tuple))...}};
    return array;
}

template <typename... Args>
constexpr std::array<std::variant<Args...>, sizeof...(Args)>
tuple_to_array(std::tuple<Args...> const& tuple)

{
    return tuple_to_array_impl(tuple, std::index_sequence_for<Args...>{});
}
And that can be used something like:
TEST(TupleTest, to_array)
{
    constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
    auto arr = tupleutil::tuple_to_array(t);
    int  i   = std::get<int>(arr[0]);
    EXPECT_EQ(1, i);
}

TEST(TupleTest, to_array_repeated)
{
    constexpr std::tuple<int, int, int> t = std::make_tuple(1, 2, 3);
    auto arr = tupleutil::tuple_to_array(t);
    int  i   = std::get<2>(arr[2]);
    EXPECT_EQ(3, i);
}
The second test is there because I was about to write, "as you can see, we can tell the differece between variants holding the same type", except that wasn't true. The original version of to_ar
    constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
    std::variant<int, double, long> v0{1};
    auto v = tupleutil::get(0, t);
    EXPECT_EQ(v0, v);
ray didn't use the constructor form with std::in_place_index_t. The code I ended up with did, but not at this point. There's nothing like writing out what something is supposed to do to make you look and keep you honest. So here, we're constructing an array of std::variant<Args…> and constructing each member with the argument pack expansion into the std::variant constructor using the Ith index value to get that element of the tuple, and recording that we're constructing the ith alternative of the variant. The second test checks that. The 2nd element of the array must be the 2nd variant of the tuple, and can be retrieved only by std::get<2>(). This would allow me to permutate the elements of a tuple, but I'm fairly close now to being able to writing a version that allows choice of the element at runtime, rather than at compile time.
    constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
    std::variant<int, double, long> v0{1};
    auto v = tupleutil::get(0, t);
    EXPECT_EQ(v0, v);
What I'm going to do is construct an array of the getters for the tuple, each of which will return the element wrapped in a variant. The signature of the array will be of function pointer type, because, quite conveniently, a non-capturing lambda can decay to a function pointer. First getting the array of getters for the tuple
template <typename V, typename T, size_t I> auto get_getter()
{
    return [](T const& t) {
        return V{std::in_place_index_t<I>{}, std::get<I>(t)};
    };
}

template <typename... Args, std::size_t... I>
auto tuple_getters_impl(std::index_sequence<I...>)
{
    using V = std::variant<Args...>;
    using T = std::tuple<Args...>;
    using F = V (*)(T const&);
    std::array<F, sizeof...(Args)> array
        //        = {{[](T const& tuple){return V{std::get<I>(tuple)};}...}};
        = {{get_getter<V, T, I>()...}};
    return array;
}

template <typename... Args> auto tuple_getters(std::tuple<Args...>)
{
    return tuple_getters_impl<Args...>(std::index_sequence_for<Args...>{});
}
So first a function that returns a function that constructs a variant around the value of what's returned from std::get<I>. Well, it could return anything that happens to have a constructor that takes a an in_place_index_t, take as the thing to be converted something that std::get<I> can extract from. This is actually a separate function because GCC was unhappy doing the template parameter pack expansion inline in the _impl function. Clang was happy with the expansion noted in the comment. I really have no idea who is wrong here, and the workaround was straight forward. The array is one of function pointers, which the returned lambdas can decay to. Now the only remaining trick is to use this array as a table to dispatch to the appropriate getter for the tuple.
const auto get = [](size_t i, auto t) {
    static auto tbl = tupleutil::tuple_getters(t);
    return tbl[i](t);
};
Get the array as a static, so we only need to computer it once, and simply return tbl[i](t)
TEST(TupleTest, gettersStatic)
{
    constexpr std::tuple<int, double, long> t = std::make_tuple(1, 2.3, 1l);
    std::variant<int, double, long> v0{1};
    auto v = tupleutil::get(0, t);
    EXPECT_EQ(v0, v);

    int  i = std::get<0>(v);
    EXPECT_EQ(1, i);

    auto v2 = tupleutil::get(1, t);

    EXPECT_EQ(1ul, v2.index());
    double d = std::get<double>(v2);

    EXPECT_EQ(2.3, d);

    constexpr auto t2 = std::make_tuple(2.4, 1l);
    auto           v3 = tupleutil::get(0, t2);
    double         d2 = std::get<double>(v3);

    EXPECT_EQ(2.4, d2);
}

3 Source

All source is available at TupleUtil on GitHub, including org source for this post.

Cross Compiling

1 Setting up Cross Compiling

In order to test out some of these multi-threaded tool properly, I really need to run them on a less strict platform than x86_64. X86_64 provides a lot of guarantees about sequential consistency and atomicity that hides problems that will happen on architectures that are not as strong, like power, sparc, and arm. Fortunately, one of the toys I have is a recent Raspberry Pi 3, which is based on a recent arm chip. Unfortunately, Raspbian, the normal linux distro for the Raspberry Pi is also based on a fairly old debian distro, with a fairly old compiler. Linaro is back porting their arm code genaration fixes to the old releases, but I'm more interested in the recent C++ language features. So I could attempt to compile GCC 6 on the RPi, or I can cross compile from my normal machine. I decided to cross compile, since if that worked, it would be considerably easier. It turnd out to be pretty straightfoward.
sudo apt-get install g++-6-arm-linux-gnueabihf
This is mostly because I'm already doing software development on the box, so I didn't need any of the other parts of the compiler ecosystem, just the right c++ toolchain. The hardest part is determining the right one. There are a few flavors for arm development. The RPi is the gnu extended abi, with hardware float. The Ubuntu repositories only supply linux variants, which is sensible. Since that top level package ends up installing not just the compilers, but a libstdc++ and libc for arm-linux-gnueabihf, which need to know much more about the OS in order to interface with it. This does lead to one snag, though. The versions of the libraries are not the ones available on the RPi. Which is a problem, since I want to use modern, or maybe even post-modern C++. There are two ways of dealing with this, and I've ended up using both.

2 Sysroot

When cross compiling, a sysroot is a system that looks just like the root file system of the target platform. It will have /lib, /usr/lib, etc, with the versions of the libraries that you want. You can either use a disk image, mounted somewhere convienent, or you can just mount the target computer's root filesystem somewhere convienent. If you do that, you'll have access to all of the libraries available, not just the minimal set typically available on a prepackaged sysroot. So that's what I did.
sshfs sdowney@cobweb.local:/ /home/sdowney/mnt/rpi/ -o transform_symlinks -o allow_other
Cobweb is my Raspberry Pi box, and zeroconf makes the current ip address available as cobweb.local. I'm mounting that into ~/mnt/rpi, transforming symlinks so that they actually work, and allowing others to access the mounted fs. With that I can specify the sysroot, and have the compiler look there for libraries:
arm-linux-gnueabihf-g++-6 -v --sysroot ~/mnt/rpi/ -o hello hw.cpp
That spits out all of what the compiler driver invokes, and as a byproduct, a bunch of what is needed to set up cross compiling with other compilers, like clang. The key things to look for are the include directories called out by "#include <…> search starts here", and the LIBRARY_PATH variable that helps define what the linker does. I'll be pulling those out for the clang cross compile cmake toolchain file.
Using built-in specs.
COLLECT_GCC=arm-linux-gnueabihf-g++-6
COLLECT_LTO_WRAPPER=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 6.2.0-5ubuntu12' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-armhf-cross/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-armhf-cross --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-armhf-cross --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libgcj --enable-objc-gc --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabihf --program-prefix=arm-linux-gnueabihf- --includedir=/usr/arm-linux-gnueabihf/include
Thread model: posix
gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12)
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/cc1plus -quiet -v -imultiarch arm-linux-gnueabihf -isysroot /home/sdowney/mnt/rpi/ -D_GNU_SOURCE hw.cpp -quiet -dumpbase hw.cpp -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mthumb -mtls-dialect=gnu -auxbase hw -version -fstack-protector-strong -Wformat -Wformat-security -o /tmp/ccUwr5Jd.s
GNU C++14 (Ubuntu 6.2.0-5ubuntu12) version 6.2.0 20161005 (arm-linux-gnueabihf)
    compiled by GNU C version 6.2.0 20161005, GMP version 6.1.1, MPFR version 3.1.5, MPC version 1.0.3, isl version 0.15
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring nonexistent directory "/home/sdowney/mnt/rpi/usr/local/include/arm-linux-gnueabihf"
#include "..." search starts here:
#include <...> search starts here:
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/arm-linux-gnueabihf
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/backward
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include-fixed
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include
 /home/sdowney/mnt/rpi/usr/include/arm-linux-gnueabihf
 /home/sdowney/mnt/rpi/usr/include
End of search list.
GNU C++14 (Ubuntu 6.2.0-5ubuntu12) version 6.2.0 20161005 (arm-linux-gnueabihf)
    compiled by GNU C version 6.2.0 20161005, GMP version 6.1.1, MPFR version 3.1.5, MPC version 1.0.3, isl version 0.15
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 8867fa57a9cbba18ebd7880e42ca78ba
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/bin/as -v -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -meabi=5 -o /tmp/ccJH2IA5.o /tmp/ccUwr5Jd.s
GNU assembler version 2.27 (arm-linux-gnueabihf) using BFD version (GNU Binutils for Ubuntu) 2.27
COMPILER_PATH=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/bin/
LIBRARY_PATH=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/:/home/sdowney/mnt/rpi/lib/arm-linux-gnueabihf/:/home/sdowney/mnt/rpi/lib/../lib/:/home/sdowney/mnt/rpi/usr/lib/arm-linux-gnueabihf/:/home/sdowney/mnt/rpi/usr/lib/../lib/:/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/:/home/sdowney/mnt/rpi/lib/:/home/sdowney/mnt/rpi/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
 /usr/lib/gcc-cross/arm-linux-gnueabihf/6/collect2 -plugin /usr/lib/gcc-cross/arm-linux-gnueabihf/6/liblto_plugin.so -plugin-opt=/usr/lib/gcc-cross/arm-linux-gnueabihf/6/lto-wrapper -plugin-opt=-fresolution=/tmp/cctgBCzX.res -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc --sysroot=/home/sdowney/mnt/rpi/ --build-id --eh-frame-hdr -dynamic-linker /lib/ld-linux-armhf.so.3 -X --hash-style=gnu --as-needed -m armelf_linux_eabi -z relro -o hello /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crt1.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crti.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/crtbegin.o -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6 -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib -L/home/sdowney/mnt/rpi/lib/arm-linux-gnueabihf -L/home/sdowney/mnt/rpi/lib/../lib -L/home/sdowney/mnt/rpi/usr/lib/arm-linux-gnueabihf -L/home/sdowney/mnt/rpi/usr/lib/../lib -L/usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib -L/home/sdowney/mnt/rpi/lib -L/home/sdowney/mnt/rpi/usr/lib /tmp/ccJH2IA5.o -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/lib/gcc-cross/arm-linux-gnueabihf/6/crtend.o /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib/crtn.o
COLLECT_GCC_OPTIONS='-v' '-o' 'hello' '-shared-libgcc' '-march=armv7-a' '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mthumb' '-mtls-dialect=gnu'
Now, note that the compiler will prefer the locally installed versions before using the ones in the sysroot. This is fine, until I need to install something. Then I'll get an error because the library on the RPi is too old. Particularly libstdc++. This works well for the non-core language libraries, though. Or at least ones that don't have C++ in their interface. Mixing C++ versions is a horrible minefield. The easiest way to deal with it is to avoid it.

3 Static linking

Recent versions of gcc allow libstdc++ to be linked statically. It increases the size of the resulting executable, but with less worries about deployment issues.
-static-libstdc++
That will cause the compiler driver to direct the linker to prefer the static version of libstdc++, rather than the shared version. And I don't have to worry about deploying or upgrading the system libraries on the target box. Note, this isn't really a supported deployment configuration. So any bugs are going to be my problem.

4 CMake

I've been using CMake to generate the build system, so I need to explain to it how to use the cross compiler instead of one for the host system. CMake has support for supplying definitions for these in Toolchain files. This is what I have so far
SET(CMAKE_SYSTEM_NAME Linux)
SET(CMAKE_SYSTEM_VERSION 1)
SET(CMAKE_SYSROOT $ENV{HOME}/mnt/rpi)

SET(CMAKE_C_COMPILER arm-linux-gnueabihf-gcc)
SET(CMAKE_CXX_COMPILER arm-linux-gnueabihf-g++)

SET(CMAKE_FIND_ROOT_PATH $ENV{HOME}/mnt/rpi)
SET(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
SET(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
SET(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)

SET(CMAKE_CXX_FLAGS "-static-libgcc -static-libstdc++" CACHE STRING "CXX_FLAGS" FORCE)

SET( THREADS_PTHREAD_ARG
     "0"
     CACHE STRING "Result from TRY_RUN" FORCE)
That, in addition to setting the compiler to use, forces a few CMake options that are otherwise problems. The first is setting the static link flag for libstdc++. The second is overriding the search for pthreads, because trying to run programs built with a cross compiler doesn't work very well. This lies and forces the option. Used like so
cmake  -D CMAKE_TOOLCHAIN_FILE=~/src/toolchain/pi.cmake -DCMAKE_BUILD_TYPE=Release ..
A toolchain file for clang is a little more complicated, because it doesn't really understand the gcc multilib layout, so it needs to be told where all the include and lib directories are for the target system, for both the C and C++ compiler.
SET(CMAKE_SYSTEM_NAME Linux)
SET(CMAKE_SYSTEM_VERSION 1)
SET(CMAKE_SYSROOT $ENV{HOME}/mnt/rpi)

set(triple arm-linux-gnueabihf)

set(CMAKE_C_COMPILER clang)
set(CMAKE_C_COMPILER_TARGET ${triple})
set(CMAKE_CXX_COMPILER clang++)
set(CMAKE_CXX_COMPILER_TARGET ${triple})

SET(CMAKE_FIND_ROOT_PATH $ENV{HOME}/mnt/rpi)
SET(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
SET(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
SET(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)

SET(CMAKE_CXX_FLAGS "\
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6 \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/arm-linux-gnueabihf \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include/c++/6/backward \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include-fixed \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include"
  CACHE STRING "CXX_FLAGS" FORCE)

SET(CMAKE_C_FLAGS "\
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/include-fixed \
 -isystem /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/include"
  CACHE STRING "C_FLAGS" FORCE)

SET(CMAKE_EXE_LINKER_FLAGS "\
 -L /usr/lib/gcc-cross/arm-linux-gnueabihf/6 \
 -L /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib/../lib \
 -L /usr/lib/gcc-cross/arm-linux-gnueabihf/6/../../../../arm-linux-gnueabihf/lib \
 -static-libgcc -static-libstdc++"
  CACHE STRING "LINKER FLAGS" FORCE)

SET( THREADS_PTHREAD_ARG
     "0"
     CACHE STRING "Result from TRY_RUN" FORCE)

5 Sources

Toolchain files are on Github next to the spingate sources, that now includes the org file that is the source for this entry, crosscompile.org.

batch: running functions under a spingate

1 A batch of tasks to run

This adds a rather simple component to spingate orchestrating a batch of tasks to be run, gated by the spingate. The tasks are added one at a time, a thread is created for the task, and the thread waits on the spingate to open before calling the task. Or at least that's how it started. Task was originally a std::function<void()>, which is essentially the interface of the thread pool I use. I realized, however, that I don't actually need to restrict the interface quite that much. Thread takes a much wider range of things to run, and I can do the same thing. I have to forward the supplied callable and arguments into the lambda that the thread is running. The key bit of code is
template <class Function, class... Args>
void Batch::add(Function&& f, Args&&... args) {
    workers_.emplace_back([ this, f = std::forward<Function>(f), args... ]() {
            gate_.wait();
            f(args...);
        });
}
There's a lot of line noise in there, and it really looked simpler when it was just taking a std::function<void()>, but it's not terrible. We take an object of type Function and a parameter pack of type Args by forwarding reference. That gets captured by the lambda, where we forward the function to the lambda, and capture the parameter pack. Inside the lambda we call the function with the pack, f(args). It's probable that I should have used std::invoke there, which handles some of the more interesting cases of calling a thing with arguments. But this was sufficient unto the day. The captured this allows access to the gate_ variable the we're waiting on. The workers_ are a vector of threads that we'll later run run through and join() on, after open()ing the gate_.
void Batch::run() {
    gate_.open();
    for (auto& thr : workers_) {
        thr.join();
    }
}
That's really all there is to Batch. It's a middle connective glue component. Does one thing, and tries to do it obviously well. That is important since I'm trying to build up test infrastructure, and testing the test infrastrucure is a hard problem. I have reorganized the code repo in order to do some light testing, though.

2 GTest

I've pushed things about in the source repo, moving the code into a library directory, which means I can link it into the existing mains, as well as into new gtests. In the CMake system, I've conditioned building tests on the existence of the googletest project being available as a subdirectory. I use enough different compilers and build options that trying to use a system build of gtest just doesn't work. The best, and recommended, choice, is to build googletest as part of your project. That way any ABI impacting subtlety, like using a different C++ standard library, is take care of automatically. The bit of cmake magic is in the top level CMakeLists.txt :
# A directory to find Google Test sources.
if (EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/googletest/CMakeLists.txt")
  add_subdirectory(googletest EXCLUDE_FROM_ALL)
  add_subdirectory(tests)
else()
  message("GTEST Not Found at ${CMAKE_CURRENT_SOURCE_DIR}/googletest/CMakeLists.txt")
endif()
This looks for googletest to be available, and if it is, add it to the project, and my tests subdirectory, otherwise issue a message. I prefer this to attempting to fix up the missing gtest automatically. That always seems to cause me problems, such as when I'm operating disconnected, on a train, like right now. The tests I have are pretty simple, not much more than primitive breathing tests.
TEST_F(BatchTest, run1Test)
{
    Batch batch;
    batch.add([this](){incrementCalled();});

    EXPECT_EQ(0u, called);

    batch.run();

    EXPECT_EQ(1u, called);
}
or, to make sure that passing arguments worked
TEST_F(BatchTest, runArgTest)
{
    Batch batch;
    int i = 0;
    batch.add([&i](int k){ i = k;}, 1);

    EXPECT_EQ(0, i);

    batch.run();

    EXPECT_EQ(1, i);
}
I don't actually expect to find runtime errors with these tests. They exercise ths component just enough that I'm not generating compile errors in expected use cases. Template code can be tricky that way. Templates that aren't instantiated can have horrible errors, but the compiler is willing to let them pass, if they mostly parse. SFINAE may not be your friend.

3 Clang builds with current libc++

Building clang and libc++ locally is getting easier and easier. Using that is still a bit difficult. But there are some reasons to do so. One is just being able to cross check your code for sanity. I won't reproduce building clang and libc++ here. It's really at this point just checking out the repos in the right places and running cmake with something like:
cmake  -DCMAKE_INSTALL_PREFIX=~/install/llvm-master/ -DLLVM_ENABLE_LIBCXX=yes  -DCMAKE_BUILD_TYPE=Release   ../llvm/
Using that, at least from within cmake, is more complicated. Cmake has a strong bias towards using the system compiler. It also has a distinct problem with repeating builds. NEVER edit your CMakeCache.txt. You can't do anything with it. All the paths are hard coded. Always start over. Either keep the command line around, or create a cmake initial cache file, which isn't the same thing at all as the CMakeCache.txt file. Right now, I'm cargo-culting around code in my cmake files that checks if I've defined an LLVM_ROOT, and if I have supply the flags to ignore all the system files, and use the ones from the installed LLVM_ROOT, including some rpath fixup. There might be some way to convince cmake to do it, but there's also only so much I will fight my metabuild system.
  if(LLVM_ROOT)
    message(STATUS "LLVM Root: ${LLVM_ROOT}")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -nostdinc++ -isystem ${LLVM_ROOT}/include/c++/v1")
    set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -L ${LLVM_ROOT}/lib -l c++ -l c++abi")
    set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,-rpath,${LLVM_ROOT}/lib")
  else()
I only check for that if the compiler I've chosen is a clang compiler, and it's not normally part of my environment.

4 Direction

Overall, what I want out of this library is to be able to stress test some nominally mt-safe code, and check that the conditions that I think hold are true. It's heavily influenced by jcstress, but, because this is C+++, it will be rendered quite differently. For what I'm considering, look at Close Encounters of The Java Memory Model Kind I want to be able to specify a state, with operations that mutate and observe the state. I want to be able to collect those observations in a deterministic way, which may require cooperation from the observers. I want to be able to collect the observations and report how many times each set of observations was obtained. Something like:
class State {
    int x_;
    int y_;

  public:
    typedef std::tuple<int, int, int, int> Result;
    State() : x_(0), y_(0) {}
    void writer1() {
        y_ = 1;
        x_ = 1;
    }
    void reader1(Result& read) {
        std::get<0>(read) = x_;
        std::get<1>(read) = y_;
    }
    void reader2(Result& read) {
        std::get<2>(read) = x_;
        std::get<3>(read) = y_;
    }
};
Running the writers and readers over different threads and observing the possible results. On some architectures, reader1 and reader2 can see entirely different orders, even though y_ will happen before x_, you might see x_ written and not y_. What I'd eventually like to be able to do is say things like, "This function will only be evaluated once", and have some evidence to back that up. So the next step is something that will take a State and schedule all of the actions with appropriate parameters in a Batch, and produce the overall Result. Then something that will do that many many times, accumulating all of the results. And since this isn't java, so we don't have reflection techniques, the State class is going to have to cooperate a bit. The Result typedef is one way. It will also have to produce all of the actions that need to be batched, in some heterogenous form that I can then run.

5 Source Code

Exported from an org-mode doc, batch.org, which is available, with all of the source on github at SpinGate.

spingate

1 Building a simple spin gate in C++

This is a very simplified latch which allows several threads to block and busywait until they are released to begin work in parallel. I'm using this to do some multi-thread stress testing, where I want to maximize the overlapping work in order to check for suprising non-deterministic behavior. There's no count associated with the gate, nor is it reuseable. All we need is the one bit of information, weather the gate is open or closed. The simplest atomic boolean is the std::atomic_flag. It's guaranteed by standard to be lock free. Unfortunately, to provide that guarantee, it provides no way to do an atomic read or write to the flag, just a clear, and a set and return prior value. This isn't rich enough for multiple threads to wait, although it is enough to implement a spinlock. std::atomic<bool> isn't guaranteed to be lock free, but in practice, any architecture I'm interested has at least 32bit aligned atomics that are lock free. Older hardware, such as ARMv5, SparcV8, and 80386 are missing cmpxchg, so loads are generally implemented with a lock in order to maintain the guarantees if there were a simultaneous load and exchange. See, for example, LLVM Atomic Instructions and Concurrency Guide. Modern ARM, x86, Sparc, and Power chips are fine. When the spin gate is constructed, we'll mark the gate as closed. Threads will then wait on the flag, spinning until the gate is opened. For this we use Release-Acquire ordering between the open and wait. This will ensure any stores done before the gate is opened will be visible to the thread waiting.
// spingate.h                                                       -*-C++-*-
#ifndef INCLUDED_SPINGATE
#define INCLUDED_SPINGATE

#include <atomic>
#include <thread>

class SpinGate
{
    std::atomic_bool flag_;

  public:
    SpinGate();
    void wait();
    void open();
};

inline
SpinGate::SpinGate() {
    // Close the gate
    flag_.store(true, std::memory_order_release);
}

inline
void SpinGate::wait() {
    while (flag_.load(std::memory_order_acquire))
        ; // spin
}

inline
void SpinGate::open() {
    flag_.store(false, std::memory_order_release);
    std::this_thread::yield();
}


#endif
#include "spingate.h"
// Test that header is complete by including
Using a SpinGate is fairly straightfoward. Create an instance of SpinGate and wait() on it in each of the worker threads. Once all of the threads are created, open the gate to let them run. In this example, I sleep for one second in order to check that none of the worker threads get past the gate before it is opened. The synchronization is on the SpingGate's std::atomic_bool, flag_. The flag_ is set to true in the constructor, with release memory ordering. The function wait() spins on loading the flag_ with acquire memory ordering, until open() is called, which sets the flag_ to false with release semantics. The other threads that were spinning may now proceed. The release-acquires ordering ensures that happens-before writes by the thread setting up the work and calling open will be read by the threads that were spin waiting.

1.1 Update:

I've added a yield() call after the open, hinting that the thread opening the gate may be rescheduled and allow the other threads to run. On the system I'm testing on, it doesn't seem to make a difference, but it does seem like the right thing to do, and it does not seem to hurt.
#include "spingate.h"

#include <vector>
#include <chrono>
#include <thread>
#include <iostream>


int main()
{
    std::vector<std::thread> workers;
    SpinGate gate;
    using time_point = std::chrono::time_point<std::chrono::high_resolution_clock>;
    time_point t1;
    auto threadCount = std::thread::hardware_concurrency();
    std::vector<time_point> times;
    times.resize(threadCount);

    for (size_t n = 0; n < threadCount; ++n) {
        workers.emplace_back([&gate, t1, &times, n]{
                gate.wait();
                time_point t2 = std::chrono::high_resolution_clock::now();
                times[n] = t2;
            });
    }

    std::cout << "Open the gate in 1 second: " << std::endl;
    using namespace std::chrono_literals;
    std::this_thread::sleep_for(1s);
    t1 = std::chrono::high_resolution_clock::now();
    gate.open();

    for (auto& thr : workers) {
        thr.join();
    }

    int threadNum = 0;
    for (auto& time: times) {
        auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(time - t1);
        std::cout << "Thread " << threadNum++ << " waited " << diff.count() << "ns\n";
    }
}
I'd originally had the body of the threads just spitting out that they were running on std::cout, and the lack of execution before the gate, plus the overlapping output, being evidence of the gate working. That looked like:
for (std::size_t n = 0; n < std::thread::hardware_concurrency(); ++n) {
    workers.emplace_back([&gate, n]{
            gate.wait();
            std::cout << "Output from gated thread " << n << std::endl;
        });
}
The gate is captured in the thread lambda by reference, the thread number by value, and when run, overlapping gibberish is printed to the console as soon as open() is called. But then I became curious about how long the spin actually lasted. Particularly since the guarantees for atomics with release-acquire semantics, or really even sequentially consistent, are about once a change is visible, that changes before are also visible. It's really a function of the underlying hardware how fast the change is visible, and what are the costs of making the happened-before writes available. I'd already observed better overlapping execution using the gate, as opposed to just letting the threads run, so for my initial purposes of making contention more likely, I was satisfied. Visibility, on my lightly loaded system, seems to be in the range of a few hundred to a couple thousand nanoseconds, which is fairly good. Checking how long it took to start let me do two thing. First, play with the new-ish chrono library. Second, check that the release-acquire sync is working the way I expect. The lambdas that the threads are running capture the start time value by reference. The start time is set just before the gate is opened, and well after the threads have started running. The spin gate's synchronization ensures that if the state change caused by open is visible, the setting of the start time is also visible. Here are one set of results from running a spingate:
Open the gate in 1 second: 
Thread 0 waited 821ns
Thread 1 waited 14490ns
Thread 2 waited 521ns
Thread 3 waited 817ns

2 Comparison with Condition Variable gate

The normal way of implementing a gate like this is with a condition variable, associated mutex, and a plain bool. The mutex guarantees synchronization between the wait and open, rather than the atomic variable in the SpinGate. The unlock/lock pair in mutex have release-acquire semantics. The actual lock and unlock are done by the unique_lock guard.
// cvgate.h                                                           -*-C++-*-
#ifndef INCLUDED_CVGATE
#define INCLUDED_CVGATE

#include <mutex>
#include <condition_variable>
#include <atomic>
#include <thread>

class CVGate
{
    std::mutex lock_;
    std::condition_variable cv_;
    bool flag_;

  public:
    CVGate();

    void wait();
    void open();
};

inline
CVGate::CVGate()
: lock_(),
  cv_(),
  flag_(true)
{}

inline
void CVGate::wait() {
    std::unique_lock<std::mutex> lk(lock_);
    cv_.wait(lk, [&](){return !flag_;});
}

inline
void CVGate::open() {
    std::unique_lock<std::mutex> lk(lock_);
    flag_ = false;
    cv_.notify_all();
    std::this_thread::yield();
}
#endif
#include "cvgate.h"
// Test that header is complete by including
This has the same interface as SpinGate, and is used exactly the same way. Running it shows:
Open the gate in 1 second: 
Thread 0 waited 54845ns
Thread 1 waited 76125ns
Thread 2 waited 91977ns
Thread 3 waited 128816ns
That the overhead of the mutex and condition variable is significant. On the other hand, the system load while it's waiting is much lower. Spingate will use all available CPU, while CVGate yields, so useful work can be done byu the rest of the system. However, for the use I was originally looking at, releasing threads for maximal overlap, spinning is clearly better. There is much less overlap as the cv blocked threads are woken up.

3 Building and Running

This is a minimal CMake file for building with the system compiler.
cmake_minimum_required(VERSION 3.5)
set(CMAKE_LEGACY_CYGWIN_WIN32 0)

project(SpinGate C CXX)

set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++14 -ftemplate-backtrace-limit=0")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -march=native")
set(CMAKE_CXX_FLAGS_DEBUG "-O0 -fno-inline -g3")
set(CMAKE_CXX_FLAGS_RELEASE "-Ofast -g0 -DNDEBUG")

include_directories(${CMAKE_CURRENT_SOURCE_DIR})

add_executable(spingate main.cpp spingate.cpp)
add_executable(cvgate cvmain.cpp cvgate.cpp)
target_link_libraries(spingate Threads::Threads)
target_link_libraries(cvgate Threads::Threads)
And here we build a release version of the test executable:
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ../
make
-- Configuring done
-- Generating done
-- Build files have been written to: /home/sdowney/src/spingate/build
[ 50%] Built target cvgate
[100%] Built target spingate

4 Org-mode source and git repo

Exported from an org-mode doc. All of the source is available on github at SpinGate

C++ code in Org-mode

1 C++

This is a simple example with one c++ file, all in one src code block.
#include <iostream>

int main()
{
    std::cout << "Hello World++! 2.9";
}
The src block looks like:
    #+HEADERS: :tangle hello.c++ :exports code :eval never
    #+BEGIN_SRC C++
    // source code
    #+END_SRC
The HEADERS block is on a separate line because when the buffer is evaluated, code will get run, and the SRC blocks will get rewritten, as well as the RESULTS blocks. Since we want the headers to be preserved, we can't make them part of the SRC block. You can also tangle a Makefile.
clean:
    -rm hello

hello:  hello.c++
    clang++ hello.c++ -o hello
With the org mode headers exporting the code, and not evaluating this block. Just like the C++ code. We'll evaluate the makefile, and run the program, a bit further down.
    #+HEADERS: :tangle Makefile :exports code :eval never
    #+BEGIN_SRC makefile
    # Makefile
    #+END_SRC
Now, we tangle the code out to the files.
    #+NAME: tangle-buffer
    #+HEADERS: :exports both :results value
    #+BEGIN_SRC emacs-lisp
    (org-babel-tangle)
    #+END_SRC
(org-babel-tangle)
That will write out two files when the buffer is evaluated using org-babel-execute-buffer, bound to \C-c \C-v b. Give this block a name, so that where the results go can be controlled. Do that by giving the RESULTS block the name of the SRC block. Org will then produce a table of the results of executing the elisp, which is the two files produced. And put the results here:
hello.c++ Makefile
Next, we run make with the target to compile the code. You could also simply write the compiler command here.
    #+NAME: make-clean-hello
    #+BEGIN_SRC sh :exports both :results output
    make clean
    make hello
    #+END_SRC
make clean
make hello
And make will run our compilation as spec'd in the Makefile we just tangled out.
rm hello
clang++ hello.c++ -o hello
And now get the output by running the program.
    #+NAME: run-hello
    #+BEGIN_SRC sh :exports results
    ./hello
    #+END_SRC
Which prints out our hello, world text. Which is has version number to convince myself it gets updated.
Hello World++! 2.9

Building Emacs 25.1 on Ubuntu 16.10

1 Why notes

Making notes so I don't forget, although the key problem is fixed upstream. Ubuntu 16.10 (Yakkety Yak) has made a critical change to the system compiler, and everything is by default built with position independent executable support (PIE), in order to get better support for address space layout randomization. Here are the security notes for PIE. Emacs does not dump successfully with this. The compiler option -no-pie needs to be added into CLFAGS. The option also means that static libraries you've built before will probably need to be rebuilt. See the link above for typical errors.

2 Getting Ready

First get dpkg-dev, g++, gcc, libc, make:
sudo apt-get install build-essentials
Then get the full set of build dependencies for last emacs, emacs24:
sudo apt-get build-dep emacs24
Decide if you want to build just this version, or track emacs. I track from git, because. So I have a directory for emacs where I have master, emacs25, and build directories. I try to avoid building in src dirs. It makes it easier to try out different options without polluting the src.
mkdir -p ~/bld/emacs
cd ~/bld/emacs
git clone git://git.savannah.gnu.org/emacs.git
cd emacs.git
git worktree add ../emacs-25.1 emacs-25.1
cd ..
mkdir bld-25.1

3 Configure and build with magic option

Now configure in the build directory:
cd bld-25.1
../emacs-25.1/configure \
  --prefix=~/install/emacs-25.1 \
  --with-x-toolkit=gtk3 \
  --with-xwidgets \
  CFLAGS=-no-pie
I built with xwidget support to play with the embedded webkit widget. It's not really useable as a browser, but has uses for rendering. I also install into a local program directory, under my homedir. Build and install:
make
make install
I have a bin directory early in $PATH so that I can select versions of local software ahead of system software.
cd ~/bin
ln -s ~/install/emacs-25.1/bin/emacs
ln -s ~/install/emacs-25.1/bin/emacsclient
Now you should have a working emacs 25.1 available.