std::execution, Sender/Receiver, and the Continuation Monad

Some thoughts on the std::execution proposal and my understanding of the underlying theory.

What’s proposed

From the paper’s Introduction

This paper proposes a self-contained design for a Standard C++ framework for managing asynchronous execution on generic execution contexts. It is based on the ideas in [P0443R14] and its companion papers.

Which doesn’t tell you much.

It proposes a framework where the principle abstractions are Senders, Receivers, and Schedulers.

A composable unit of work.
Delimits work, handling completion, exceptions, or cancellation.
Arranges for the context work is done in.

The primary user facing concept is the sender. Values and functions can be lifted directly into senders. Senders can be stacked together, with a sender passing its value on to another function. Or stacking exception or cancellation handling the same way.

Receivers handle the three ways a sender can complete, by returning a value, throwing an exception, or being canceled. As described, receivers are most likely to be implemented within particular algorithms that combine senders, such as `then` or `retry`.

Schedulers provide access to execution contexts. Like inline, single thread, a thread pool, a GPU, and so on, would all have schedulers that provide for putting a sender into the context they manage.

There’s a fairly large API surface being proposed. But there’s an underlying theory about this, governing what algorithms need to be there and how the pieces fit together.

Continuation Passing Style and the Continuation Monad

Continuation passing style is a transformation from a normal function and a call stack to a direction to send the result to the “continuation” without returning. This means the functions context can be cleaned up. Delimited continuations are a slight variation, where instead of an unbounded “rest of the program”, the continuation has an end point and a value. It’s essentially a function, and can be handled as such. There is a purely mechanical method for converting all of the lambda calculus transforms into CPS form, and this can be profitable for compilers based on lambda, or related logics, like system F.

The mechanical transformation also means that all the control structures, like loops, gotos, coroutines, exceptions, have CPS equivalents.

CPS is tedious, though. Having to explicitly add a continuation to everything is complicated.

However, there’s also a typeclass, or concept, that allows you to convert regular functions into continuation passing style, automatically. It’s then rather straightforward to involve concerns like where work is being run for something that wraps up the entire work. Even being able to switch back and forth between contexts. That’s the continuation monad.

And unfortunately monads became an organizing principle in programming language theory one or two decades after most CS programs were standardized. So it’s all complicated and involves things we weren’t trained on. Fitting it into C++ has been an ongoing challenge, and until we had generic lambda was neither reasonbly concise nor idiomatic.

See, however, the new monadic interface additions for std::optional for why you want this. Or Ranges, which are solidly based in the ‘list’ or non-deterministic monad.

We have a poor relationship with Theory

There is no satisfactory PL theory for object oriented programming. There’s lots of work, but it mostly ends up describing something that OO programmers don’t think is quite the same as what they do. Even the ones who spend a lot of time doing theory.

Yet OO was, and is, a successful discipline. Working with identity, behavior, and state has produced remarkable results. Temporal calculi, not so much.

For a long while, we as a discipline thought that multi-threading was similar. There was poor theory, but we had hardware, and libraries that let us use that hardware to do concurrent work correctly.

That turned out not to be the case.

Concurrency can’t be just a library, unfortunately. Concurrency models that hardware vendors will commit to won’t promise not to violate causality. That makes producing a programming model programmers can use frighteningly difficult.

Which is why having a sound theory for std::execution is a good thing, even if the theory is unfamiliar.

But as a group, we learned the wrong lessons from the 80s and thought it was a researcher’s job to take the successes of practitioners and put a sound basis to them. Ignoring that it is a feedback loop. In the 60s and 70s, those researchers were also the practitioners. It’s not wrong to get out ahead of theory, but we do need to check back.

p2300 std::execution

Senders, via the Decorator pattern, lift ordinary functions into the continuation passing style. People writing functions only need to be concerned with handling the arguments they are passed, without concern for execution context or continuations. Functions used by senders act like, and are, normal functions.

Senders manage a bundle of channels, representing normal return of a value, throwing an exception, or an error channel to handle cancellation, or other errors not within the bound of ordinary functions. All of these channels can be composed taking the result to another function, or monadically with a function returning a sender, where that function can determine the kind of sender based on the values of the arguments. The channels can be combined or rerouted, connecting one to another, or presenting a variant containing either result, exception, and/or error to the continuation function.

Although senders form a logical graph of units of work, the physical type model is containment, much like expression templates. The result of binding senders together via an algorithm is a sender that contains the bound together senders. There are no nodes or allocations inherent to the model, just function calls.

C++ coroutines fit into this model. C++ coroutines are, from the outside, functions with rules about the interaction patterns with the returned value. Making a coroutine owning type a sender, and a sender co_awaitable, is possible and has been demonstrated.

std::execution takes the Continuation Monad and fits it to C++ control flow, return or exception, and adds cancellation, which incidentally allows a channel for failures from execution contexts. The thread pool can potentially signal failure via the error channel, without aliasing problems from application function code. However, for advanced users, these can be folded back into the normal function arguments and handled by application code. Policy decisions are not burned into the ROM of std::execution, but there are defaults that can be provided by application infrastructure authors.

Those infrastructure authors do not have to be std library vendors. The protocols, rendered as concepts, are available to normal users.

Network TS

Eppur si muove
And yet it moves

I do not believe ASIO’s model is a firm foundation for all async programming. However, it is well proven, and exists. It works.

And …

I have confidence that a networking library can and will be built using p2300. I am less confident that can be done in the timeframe for C++26. I do not believe for a moment we could have one for C++23, even with an existence proof a networking library appearing now. It’s simply too late to review and agree. We’re in the same place as coroutines. We can have the machinery, but without all of the application user facing infrastructure we should have.

I think this was the right choice with coroutines, and I think providing the machinery for general continuation based async in the standard library so that we can build on top of it is the right choice. The authors have committed to making sure all the facilities are available for programmers, in particular the pipe syntax (an issue for ranges) as well as providing bases or adapters for coroutine promises and typed senders. We can experiment and add existing practice as we go.


This is all my personal opinion, based on my own understanding. I’ve been in the meetings, I’ve been in discussions, asked questions. But if I’m wrong about some aspect of the proposal, that’s on me. Certainly not a formal opinion of Bloomberg, where I work. While we do lots of network services, and async programming, this isn’t what our tech looks like at all. Getting from here to there is an open question, but it would be for ASIO, too.

At least it isn’t CORBA.

Source For Blog

Standard Vocabulary for Algorithms

This is feedback after considering A Plan for C++23 Ranges

Disclosure: I voted in favor of this. It does not suggest work on views::maybe [P1255R6]​. I’m fine with that priority.

Vocabulary is not just Types between Components

There’s broad agreement that ‘vocabulary types’ belong in the standard. Domain independent types that can be used between otherwise uncoupled facilities. They each depend on facilities provided by Standard C++ and can thus be developed independently. Each component can rely on the API of a core vocabulary type not changing. This allows facilities to communicate effectively.

A component can have std::string or std::vector<int> in its interface with little concern. The STL parts of the standard library have always relied on concepts for constraining template parameters implicitly, requiring an Iterator of particular category in places, or specializing algorithms based on such. Composition of container types has also been natural. No one is confused by a std::unordered_map<std::string, std::vector<my::person>>.

The new C++20 range facilities allow composition of algorithms, which means that the names of algorithms will, and must, become part of the vocabulary of C++ programmers. Algorithms, and the names for them, are far broader than the linear containers supported by C++ today. It is a particular problem because C++ has named core algorithms badly, as they were not, until recently, first class entities.

Algorithms are now first class entities

Functions are a terribly overloaded term in C++. Overloaded is an overloaded term.

With respect to ranges and views, however, it should be clear that the adaptors, transformers, and other entities we apply to ranges and views to produce new ranges, views, and values, are now first class entities in their own right.

The important thing, for programmers, is that moving up the ladder of abstraction should not be conceptually more difficult than dealing with existing libraries of functions. We have algorithms that deal with generic types rather than functions that deal with concrete types, but keeping them in mind should not be fundamentally more difficult.

Naming is therefore critical.

As are chosing the algorithms that are primitive, that is, the ones that compose into others in useful ways.

I have some concerns about providing fused algorithms that a sufficiently smart compiler could optimize. On the other hand, I doubt how imminent sufficiently smart compilers are.

Why Haskell keeps coming up

Python was the answer to “Why can’t I just write pseudo-code?”

I have text books where the authors seriously discussed writing out your algorithm into pseudo code, and then translating it into actual programming language. Programming languages were too high ceremony to discuss what you wanted to do, while normal English was not structured enough. “Scripting” languages gave precise higher level semantics that were implicit in the pseudo code, or it would have been impossible to translate by hand to other computer languages.

Haskell was designed to do that for algorithm research. The primary goal was to be able to give direct executable form to a research paper. This was early replicable research. The paper provided, on its own, the mechanism to test and reproduce the result. Haskell source code can be embedded in LaTeX and given directly to a Haskell compiler. Every interesting algorithms paper for the last 30 years or so has implicitly included the prelude.

Haskell itself is not a terribly complex language. Much of it is specified as sugar where a form is translated into a more primitive expression, as range for loops are in C++. Its power for researchers is in its inherent laziness, that an expression is not evaluated unless it is required, and its purity, that values are immutable. These properties make it straightforward to reason with and about. The core property of lazy evaluation has forced it to be pure, not allowing side-effects. This has forced it to be exemplary, and evolutionary, in supporting value oriented programming.

These factors are why we end up often looking to Haskell.

Value oriented programming has been an emerging style in modern C++. The new Range facilities embody the style. Being able to crib proveably correct facilities from modern programming language research is a benefit.

There was a long period, in which C++ grew up, that academic research lagged behind the pragmatic craft of programming. This has, in retrospect, not been an unqualified success. Parallelism and concurrency were widely implemented without theoretic underpnnings. We are still trying to put C++’s memory model on firm footing. Many constructs regularly used in the past are now understood to be unsound, in explicable ways. Multi-threading should not be an experimental science.

Object orientation is an area which still does not have satisfying theoretic backing. There are various theories that provide many of the same benefits, allowing abstraction and encapsulation. In OO, however, objects have identity and state, which implies that equational reasoning can fail, as two equivalent object may have distinct identity and then different behavior. Temporal calculus formalisms are still open research.

The manifest success of OO{A,D,P} with limited underlying theory has allowed many practitioners to believe that academic research is irrelevant. To the extent of dismissing rigor. Jokes about the “M-word” are emblematic.

Sidebar: Monads are boring

A monad is a generic type with a few operations, some of which can be expressed in terms of the others, that follow a few fundamental and fairly trivial rules. They allow predictable composition of entities that model monad.

It is not about `then`. Although `then` does tend to fall out.

The interesting thing is that algorithms can be built that take generic types that follow some “Laws”, and they will do, something, correctly. For some kinds of generic type, what happens may be surprising, even interesting. The surprises will, however, not break anything. That’s the point of the semantic guarantees surrounding the operations.

C++ has philosophically adopted semantics surrounding Concepts. A Concept is entitled to demand behavior and laws in addition to the checkable syntax that can be enforced.

The utility for an algorithm writer is that reasoning can be applied at a much higher level. Even beyond the abstraction that this is a Container, or Range, or View, being able reason about the operations for anything that provides a set of well defined operations means that the algorithm is correct, even if the details when applied to a new kind are obscure.

Iterators provide a way for algorithms to work with many containers. Those same, correct, algorithms should be able to work with things that aren’t Iterable, or Containers. It turns out having a named way of converting a Type<Type<Value>> to a Type<Value> opens up many algorithms to broader use. Flattening a vector of vectors into a linear vector is not the only use of a `concat`.

Fold: In Which I Become Cranky About Names

One of the most primitive and yet powerful algorithms is `fold`. Fold takes a binary operation and interposes it between elements in an iterable data type. It replaces the commas {a, b, c, d, …} with `op`, resulting in {a op b op c op d … }

Except that is slightly ambiguous. Is that (a op (b op (c op (d)))) or ((((a) op b) op c) op d)

Right or left evaluation of the terms.

Folding right is strictly more general, at least in the lambda calculus. Left fold can be implemented as a right fold, but not vice-versa.

However, in C++, the normal implementation of folds is not as general as it might be, because C++ is strict in evaluating function arguments. This means that even though a fold whose operation might not need to evaluate one of its arguments in order to return a value, will still do so, because the argument is evaluated before being passed to the operation. This mean, in particular, that we can’t simply undo fold by using a `cons` operation, reconstructing the list. The entire, indeterminate and possibly infinite, list must be evaluated even though all that is necesary is the first element.

Left folding is tail recursive, and so can be converted to a loop. This is why it seems natural in C++.

T fold_left(Range&& r, T init, BinaryOperation op) {
    range_iterator_t<Range> b = begin(r);
    range_sentinel_t<Range> e = end(r);
    for (; b != e; ++b) {
        init = op(std::move(init), *b);
    return init;

Right fold is not generally tail recursive, although for finite sequences in C++, using reverse iterators, foldr is also a loop.

auto fold_right_loop(Range&& r, T init, BinaryOperation op) -> result_of<>{
    range_iterator_t<Range> b = begin(r);
    range_sentinel_t<Range> e = end(r);
    for (; b != e; ++b) {
        init = op(std::move(init), *b);
    return init;

To make the diagrams below pretty, we assume that foldl over a sequence of T takes a single value of type Z and an op of Z -> T -> Z, while foldr over a sequence of T takes an op of type T -> Z -> Z and a value of type Z. For a left fold, the value is given to the op that takes the first element in the sequence, while for a right fold, it is given to the op that takes the last element in the sequence.



result = op(op(op(op(op(op(op(op(z, a), b), c), d), e), f), g), h);


result = op(a, op(b, op(c, op(d, op(e, op(f, op(g, op(h, z))))))));

So, left fold piles up the ops on the left and right fold piles them up on the right. The ‘zero’ or init value is with the last value for a right fold, and the first on a left fold. In converting to iteration, which is important, left fold is in the natural direction, and right fold is reversed, which means it can’t be used that way on some sequences.

So which one should be the fold. It’s an important question because short names are implicitly privileged. Programmers assume that the short name is the best one, and longer names indicate more specialized usage. This isn’t true for C++, but not really on purpose. We did think that `map` was a good choice. The programmers should probably reach for `unordered_map` instead is unfortunate, but at this point unfixable.

Since C++ is strict, it would seem that left fold is the fold you mostly want, so it would be OK to call it fold.

Haskell, as lazy as it is, has issues with folds on ranges with indeterminate bounds. foldl vs foldr vs foldl' is a constant question. Foldr Foldl Foldl’ Memory consumption of either stack or unevaluated expressions will surprise programmers.

But in the literature, fold is right fold. In the pure lambda calculus, right fold is strictly more expressive. Tail recursion is a subset of primitive recursion, and left is tail recursive while right is not. If we call our left fold, fold, we will confuse people.

Further, the forward iteration is somewhat balanced by the fact that the normal addition to a sequence is emplace_back or push_back, which means that left fold will tend to reverse lists if the operation is accumulation into a sequence, if for example the op were [](auto t){v.emplace_back(t);}.

More, we have started to introduce lazy features into the language, with coroutines. Although we won’t get them soon, the C++ standard library should eventually have good lazy algorithms. It would be unfortunate if we have to teach that you can’t use co_fold in place of fold because they are opposite, or that co_fold_left is how you spell the coroutine fold. [Note co_ prefixes are mildly sarcastic.]

C++ has a long history of setting defaults and realizing later that they are a problem.

I would much prefer we not make a choice about which fold is the fold, and just spell them as fold_left and fold_right.

We should all be able to get behind not calling it accumulate, even if we must live with transform.


Some other time I’ll vent about the various fold operations and how they get extended if your type is a monoid. See Chris Di Bella’s work such as A Concept Design for the Numeric Algorithms, although this rant suggests that algorithms that operate on different Concepts should have different names. Overloading on concepts with semantic differences can be confusing.

Value Oriented Programming: A Manifesto

Object Oriented Programming

The primitive entities are objects which have certain properties

  • Objects have Identity
  • Objects have State
  • Objects have Behavior

Values are not Objects

The primitive entities are values which have none of the properties of Objects

  • Values have no identity
  • Values have no State
  • Values have no Behavior

A positive definition of VOP

  • Values are Equivalent
  • Values are Immutable
  • Values are Side-effect Free

Values are Equivalent

One entity with a value will give the same results as a different entity with the same value. All ‘fives’ are the same. This allows equational reasoning. If `a` and `b` have the same value, `f(a)` and `f(b)` have the same value.

Values are Immutable

‘five’ never becomes ‘six’. Values do not change.

Values are Side-effect free

No operation on a value has an effect on other values. Temporal changes are explicit.

Value Oriented Programming does not deny Objects

  • There are things with identity
  • There are things that vary over time
  • There are things that affect the state of other things

Values are are a better basis

We can, and do, build objects out of values, but objects in the OO sense are impossible to reason about.

Values are.

AnyDuck : A Value Type Erased Type

A Constrained Duck Typed Value Type

For yak shaving reasons, I need a type able to hold any type conforming to a particular interface. I’d like this to act as a (Semi)Regular value type. That is, I’d like it to be copyable, assignable, and so forth, and not be sliced or otherwise mangeled in the process. I want it to be reasonably efficient, not significantly worse than traditional virtual function overhead. I also don’t want to be terribly creative in implementing, using existing std library types.

The overall pattern I’ve settled on for now is to hold the type in a std::any and dispatch to the held type through function pointers referring to lambdas. The lambda allows me to capture the type being stored into the AnyDuck and safely recover it. There’s some boilerplate to write to dispatch to the lambda. Perhaps one day, when we have reflection, that can be automated.

For purposes of this paper, I’ll assume I have an interface Duck that I want to model:

class Duck {
    void quack(int length) const;

Ducks are defined as things that quack, and quack is a const function. I want to be able to put any type that models Duck into an AnyDuck, and pass AnyDuck into any generic function expecting a Duck. I also want to be able to extend AnyDuck to unrelated types, as long as they model Duck. Mallard, for example:

class Mallard {
    void quack(int length) const;

The core of the idea, is to capture the Duck type in a templated constructor where I know the exact type, and create the appropriate lambda:

auto quack_ = [](std::any const& d, int i) {
    return std::any_cast<std::remove_reference_t<Duck>>(&d)->quack(i);

And then wrap the public facing call so that quackfn can be stored as a function pointer

void AnyDuck::quack(int length) const { return quack_(this->duck_, length); }

Here’s the whole thing:

class AnyDuck {
    std::any duck_;
    using quackfn = void (*)(std::any const&, int);
    quackfn quack_;

    AnyDuck(AnyDuck const&) = default;
    AnyDuck(AnyDuck&)       = default;

    template <typename Duck>
    AnyDuck(Duck&& duck)
        : duck_(std::forward<Duck>(duck)),
          quack_([](std::any const& d, int i) {
              return std::any_cast<std::remove_reference_t<Duck>>(&d)->quack(
          }) {}

    void quack(int length) const { return quack_(this->duck_, length); }

The copy constructors are there to be a better match than the templated constructor for copy by value. Codegen is surprisingly good. If the types are all present, the functions are inlined well, except for the overhead of storage into the any. For any unknown AnyDuck, there’s a dispatch via pointer indirection:

void test(AnyDuck a) {

results in something like

0000000000000050 <test(scratch::AnyDuck)>:
  50:   48 8b 47 10             mov    0x10(%rdi),%rax
  54:   be 01 00 00 00          mov    $0x1,%esi
  59:   ff e0                   jmpq   *%rax
  5b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

and the any_cast<> from the address of the passed in std::any is noexcept, but does in general have to check if the any has a value. Not as cheap as pure an interface type, but not terribly more expensive.

For the case where the quack is known, codegen is something like

scratch::AnyDuck::AnyDuck<Duck&>(Duck&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int): # @scratch::AnyDuck::AnyDuck<Duck&>(Duck&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int)
        movl    %esi, %edi
        jmp     bell(int)                # TAILCALL

If the implementation of the underlying quack is not available there’s a little more work

scratch::AnyDuck::AnyDuck<Mallard&>(Mallard&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int): # @scratch::AnyDuck::AnyDuck<Mallard&>(Mallard&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int)
        movl    $_ZNSt3any17_Manager_internalI7MallardE9_S_manageENS_3_OpEPKS_PNS_4_ArgE, %ecx
        xorl    %eax, %eax
        cmpq    %rcx, (%rdi)
        leaq    8(%rdi), %rcx
        cmoveq  %rcx, %rax
        movq    %rax, %rdi
        jmp     Mallard::quack(int) const    # TAILCALL

But far less than I would have expected. std::any is less awful than I thought.

You can take a look at the code so far here Compiler Explorer Link to see the results

I’ll clean up my scratch project and push it at some point.

Steve Downey’s Birthday (Observed)


Blooms Tavern
208 East 58th Street
Between 2nd & 3rd Ave
New York, NY 10022


6:00 PM ->


This is the sound
I bought a ticket to the world
But now I’ve come back again
Why do I find it hard to write the next line?
Oh, I want the truth to be said
I know this much is true


From: 110 W 14th St

Walk Walk

About 2 min , 335 ft

6:03 PM

14 Street Station

L Canarsie – Rockaway Pkwy

2 min (non-stop)

6:05 PM

14 Street – Union Sq Station

Walk Walk

About 1 min

6:09 PM

14 Street – Union Sq Station

5X Eastchester – Dyre Av

7 min (2 stops)

6:16 PM

59 St-Lexington Av Station

Walk Walk

About 4 min , 0.2 mi

6:20 PM

Blooms Tavern

208 E 58th St, New York, NY 10022

Walk Walk

About 2 min , 0.1 mi

6:03 PM

14 Street Station

M Forest Hills – 71 Av

13 min (6 stops)

6:16 PM

Lexington Av-53 St

Walk Walk

About 6 min , 0.3 mi

6:22 PM

Blooms Tavern

208 E 58th St, New York, NY 10022

From: 731 Lexington Avenue

Head southwest on Beacon Ct toward E 58th St
Restricted usage road
184 ft

Turn left onto E 58th St
Pass by Wells Fargo Bank (on the left in 115 ft)
Destination will be on the right
420 ft

Bloom’s Tavern

208 E 58th St, New York, NY 10022

Building Saar Raz’s clang concepts branch

A Recipe for building Saar Raz’s clang concepts branch

Saar Raz has been working on a Concepts implementation, available at It’s not much harder to build it than clang usually is, it’s just a matter of getting things checked out into the right places before configuring the build. Just like LLVM and clang normally.

In order to double check how, I peeked at the shell script used by the compiler explorer image to build the clang-concepts compiler:

The really important bit is getting exactly the right commit from LLVM, 893a41656b527af1b00a1f9e5c8fcecfff62e4b6.

To get a working directory something like: Starting from where you want your working tree and build tree, e.g ~/bld/llvm-concepts

git clone

pushd llvm
git reset --hard 893a41656b527af1b00a1f9e5c8fcecfff62e4b6

pushd llvm/tools
git clone clang

pushd llvm/projects
git clone
git clone
# The sanitizers: this is optional but you want them
git clone

Then to build and install

mkdir build && cd build

cmake \
    -DCMAKE_INSTALL_PREFIX=~/install/llvm-concepts/ \
    -DCMAKE_BUILD_TYPE=Release  \
    -G Ninja  \


ninja check

ninja install

Note that I install into a separate prefix, keeping it isolated from everything else. The compiler can be invoked as ~/install/llvm-concepts/bin/clang++. There’s no particular reason to put the compiler on your PATH.

Should Unicode literals be guaranteed to be well-formed?

TL;DR Betteridge’s law applies: No.

Are you still here?

Unicode Literals

In C++ 20 there are 2 kinds and 6 forms of Unicode literals. Character literals and string literals, in UTF-8, UTF-16, and UTF-32 encodings. Each of them uses a distinct char type to signal in the type system what the encoding is for the data. For plain char literals, the encoding is the execution character set, which is implementation defined. It might be something useful, like UTF-8, or old, like UCS-2, or something distinctly unhelpful, like EBCDIC. Whatever it is, it was fixed at compilation time, and is not affected by things like LOCALE settings. The source character set is the encoding the compiler believes your source code is written in. Including the octets that happen to be between " and ' characters.

char32_t s1[] = U"\u0073tring";
char16_t s2[] = U"\u0073tring";
char8_t s2[] = U"\u0073tring";

char32_t c1 = U'\u0073';
char16_t c1 = u'\u0073';
char8_t c1 = u8'\u0073';

Unicode codepoint U+0073 is ‘s’. So all of the strings are “string”, and all of the characters are ‘s’, however the rendering of each in memory is different. For the string types, each unit in the string is 32, 16 or 8 bits respectively, and for the character literals, each is one code unit, again of 32, 16, or 8 bits.


This is due to Zach Laine, an actual expert, and sorting out what happened took several other experts, so the rest of us have little hope.

char8_t suprise[] = u8"ς";
assert(strlen(surprise) == 5);

This comes down to your editor and compiler having a minor disagreement about encoding. The compiler was under the impression that the encoding was cp-1252, where the editor believed it to be UTF-8. The underlying octets for ς are 0xCF 0x82, each of which is a character in cp-1252. All octets are valid characters in cp-1252. So each was converted to UTF-8, resulting in 0xC3 0x8F and 0xE2 0x80 0x9A.

Of course.

The contents of the string are still in the source character set, not in any of the Unicode forms. Unless that happens to be the source character set.

But at least it’s well formed.


The \u escapes, which identify Unicode codepoints by number, will produce well formed Unicode encoding in strings, and in character literals if they fit. That is, in a char32_t, you can put any code point. In a char16_t you can put any character from the basic multilingual plane. In a char8_t you can put 7 bit ASCII characters.

Or you can use hex or octal escapes, which will be widened to code unit size and the value placed into the resultant character or string. And no current compiler checks if that makes sense, although they will warn you if, for example you try to write something like:

char16_t w4 = u'\x0001f9ff'; // NAZAR AMULET - Unicode 11.0 (June 2018)
char16_t sw4[] = u"\x0001f9ff";

where you’re trying to put a value that won’t fit into a code unit into the result.

warning: hex escape sequence out of range

The \xnn and \nnn hex and octal escapes are currently a hole that lets you construct ill-formed string literals. For example

char8_t oops = u8"\xfe\xed";

But there are lots of ways of constructing ill-formed arrays of char8_t or char16_t. Just spell them out as arrays:

char8_t ill = {0xfe, 0xed};

The type system doesn’t provide that char8_t means well formed UTF-8. All it does is tell you that the intended encoding is UTF-8. Which is a huge improvement over char.

But it does not provide any guarantee to an API taking a char8_t*.

\0 is an octal escape sequence

You can spell it \u0000. But that’s weird, since we spell it as \0 everywhere, and that it’s an octal escape is a C++ trivia question.

I want to be told if I’m forming ill-formed Unicode.

Then don’t use escape sequences. If you use either well encoded source code encodings or universal character names you will get well formed Unicode.

The primary reason for wanting ill-formed Unicode is for testing. It’s a convenience. And there are straightforward workarounds.

But disallowing hex and octal escapes in Unicode strings makes the language less regular while preventing an error that you had to go out of your way to create, and does not actually produce more runtime safety.

Litmus Tests for Multithreaded Behavior

Litmus Tests for Multithreaded Behavior

Or How Processors Don’t Do What You Think

Modern multicore processors are entirely weirder than almost anyone thinks possible. They are somewhat weirder than chip makers were willing to admit until fairly recently. They are sufficiently weird enough that almost all multi-threaded programs, and many lock-free algorithms, had bugs because the hardware just does not work the way anyone would reasonably expect. And, in addition, optimizing compilers did not actually know how to not break your code. [boehm2005threads]

I’m in the (slow) process of writing some test cases for multi-threaded code. I eventually want to have some confidence that some code is executed once, and only once, in an efficient manner. It’s the ‘efficient’ part that worries me, because for efficient, it also has a tendency to be clever, and I’m learning that clever MT code is often subtly broken. [bacon2000double] So if smarter people than me make mistakes about MT code, I need tests to compensate. And ones that will cover occasional allowed but unexpected behavior. Which means the test framework should be able to detect them.

Also, fortunately, the RPi is a computer that exhibits some odd behavior, as it is an ARM system. X86 has a much stronger model. However, even the x86 model is allowed to perform in odd ways.

Starting in 2007, Intel has started publishing short snippets of assembly and documenting what are the allowed and disallowed results running them in parallel. [IWPAug2007] These snippets have come to be called litmus tests, and are used to confirm the behavior of hardware, and confirm models of the hardware behavior. A particularly important model for C++ programmers is the x86 Total Store Order model [owens2009better] which provides a way of mapping the C++11 memory model to X86 hardware. X86 hardware provides a strongly consistent memory model. Power and ARM provide fewer guarantees, and mapping the C++ memory model to these architectures is more challenging. [maranget2012tutorial]

Message Passing

The tests outlined in the Intel paper are short pieces of assembly to be exercised on different processors, with guarantees about behavior that will not happen. The first one essentially promises that message passing will work, and is now known as the MP test.

Processor 0 Processor 1
mov [ _x], 1 // M1 mov r1, [ _y] // M3
mov [ _y], 1 // M2 mov r2, [ _x] // M4

Initially x = y = 0

r1 = 1 and r2 = 0 is not allowed

That says that we can’t read the writes of x and y out of order, which ensures that if we’re waiting to see the write to the flag y, we can be guaranteed to see the payload in x. If we change it slightly to wait for the write to _y to be visible, we can pass a message from one thread to anothher in _x. This is also known as Dekards Algorithm.

ARM and Power do not provide that guarantee without additional synchronization instructions.

In C++ that looks something like the following, using the test framework I’m writing.

MP::MP() : x_(0), y_(0) {}
void MP::t1() {, std::memory_order_relaxed);, std::memory_order_relaxed);
void MP::t2(Result& read) {
    while (!y_.load(std::memory_order_relaxed)){}
    std::get<0>(read) = x_.load(std::memory_order_relaxed);

Here, x_ and y_ are atomic<int>s, and we’re using the lowest possible atomic guarantee in C++, relaxed. Relaxed guarantees that the operation happens atomically. but there are no synchronization properties with anything else. This usally corresponds to the basic int type. Unless you’re using a really insane processor that might let an int be partially written and observable. Like you might get the top half of the int, or the middle byte. The commercial processors that allowed this have pretty much died out.

The test spins on seeing the load of y to be complete. It loads the value of x_ into a result tuple. The tuple is used as the key to a map which accumulates how many times each result has been seen.
Running the above on my x86 laptop:

[ RUN      ] ExperimentTest.MPTest1
(1) : 2000000

on my Raspberry Pi 3:

[ RUN      ] ExperimentTest.MPTest1
(0) : 483
(1) : 1999517

Using objdump to check the generated assembly

00000088 <litmus::MP::MP()>:
  88:   mov     r2, #0
  8c:   str     r2, [r0]
  90:   str     r2, [r0, #64]   ; 0x40
  94:   bx      lr

00000098 <litmus::MP::t1()>:
  98:   mov     r3, #1
  9c:   str     r3, [r0]
  a0:   str     r3, [r0, #64]   ; 0x40
  a4:   bx      lr

000000a8 <litmus::MP::t2(std::tuple<int>&)>:
  a8:   ldr     r3, [r0, #64]   ; 0x40
  ac:   cmp     r3, #0
  b0:   beq     a8 <litmus::MP::t2(std::tuple<int>&)>
  b4:   ldr     r3, [r0]
  b8:   str     r3, [r1]
  bc:   bx      lr

So, out of the 2,000,000 times that I ran the experiment, there were 483 times that reading x_ resulted in 0, even though y_ was 1. ARM has a weaker memory model than x86. This has some advantages in processor implementation. It has distinct disadvantages in how our brains work. X86 tries to preserve the model that there is shared memory that everyone sees and works with. That’s not strictly true, even for X86, but ARM and Power don’t even come close. On the other hand, it’s also why it’s easier to add more cores to Power and ARM chips and systems. I routinely work with Power systems with 512 physical cores.

Store Buffering

Store buffering is the odd case that is allowed in the Intel memory model. When assigning locations in two threads, and then reading them on opposite threads, both threads are allowed to read the older state. The stores get buffered.
From the Intel White Paper:

Processor 0 Processor 1
mov [ _x], 1 // M1 mov [ _y], 1 // M3
mov r1, [ _y] // M2 mov r2, [ _x] // M4

Initially x = y = 0

r1 = 0 and r2 ==0 is allowed

Note, in particular, there is no interleaving of M1 – 4 that could result in r1 and r2 being 0. Not without interupting an instruction in the middle. But the instructions themselves are atomic, and indivisible. If they were actually operating on shared memory, this would not be possible. However, it does happen.

SB::SB() : x_(0), y_(0) {}
void SB::t1(Result& read) {, std::memory_order_relaxed);
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
void SB::t2(Result& read) {, std::memory_order_relaxed);
    std::get<1>(read) = y_.load(std::memory_order_relaxed);

That generates the x86 code

00000000000000f0 <litmus::SB::t1(std::__1::tuple<int, int>&)>:
  f0:   mov    DWORD PTR [rdi+0x40],0x1
  f7:   mov    eax,DWORD PTR [rdi]
  f9:   mov    DWORD PTR [rsi],eax
  fb:   ret

0000000000000100 <litmus::SB::t2(std::__1::tuple<int, int>&)>:
 100:   mov    DWORD PTR [rdi],0x1
 106:   mov    eax,DWORD PTR [rdi+0x40]
 109:   mov    DWORD PTR [rsi+0x4],eax
 10c:   ret

And on my x86 machine:

[ RUN      ] ExperimentTest.SBTest1
(0, 0) : 559
(0, 1) : 999858
(1, 0) : 999576
(1, 1) : 7

So 559 times neither core saw the other core’s store.

Load Buffering

Load Buffering is the dual of store buffering. Loads into registers might be delayed, or buffered, and actually performed after following instructions. It’s not allowed in the Intel architecture.

From the Intel White Paper

Processor 0 Processor 1
mov r1, [ _x] // M1 mov r2, [ _y] // M3
mov [ _y], 1 // M2 mov [ _x], 1 // M4

Initially x = y = 0

r1 = 1 and r2 = 1 is not allowed

LB::LB() : x_(0), y_(0) {}
void LB::t1(Result& read) {
    std::get<0>(read) = x_.load(std::memory_order_relaxed);, std::memory_order_relaxed);
void LB::t2(Result& read) {
    std::get<1>(read) = y_.load(std::memory_order_relaxed);, std::memory_order_relaxed);

This is the x86 asm code

00000000000000c0 <litmus::LB::t1(std::__1::tuple<int, int>&)>:
  c0:   mov    eax,DWORD PTR [rdi]
  c2:   mov    DWORD PTR [rsi],eax
  c4:   mov    DWORD PTR [rdi+0x40],0x1
  cb:   ret
  cc:   nop    DWORD PTR [rax+0x0]

00000000000000d0 <litmus::LB::t2(std::__1::tuple<int, int>&)>:
  d0:   mov    eax,DWORD PTR [rdi+0x40]
  d3:   mov    DWORD PTR [rsi+0x4],eax
  d6:   mov    DWORD PTR [rdi],0x1
  dc:   ret
  dd:   nop    DWORD PTR [rax]

And the ARM code, at -O1

000000d0 <litmus::LB::t1(std::tuple<int, int>&)>:
  d0:   ldr     r3, [r0]
  d4:   str     r3, [r1, #4]
  d8:   mov     r3, #1
  dc:   str     r3, [r0, #64]   ; 0x40
  e0:   bx      lr

000000e4 <litmus::LB::t2(std::tuple<int, int>&)>:
  e4:   ldr     r3, [r0, #64]   ; 0x40
  e8:   str     r3, [r1]
  ec:   mov     r3, #1
  f0:   str     r3, [r0]
  f4:   bx      lr

ARM generally allows it, but per [maranget2012tutorial] it’s very sensitive, and dependencies will make it not appear. In my tests, I did not observe an instance of a buffering, but it may be due to the first store the compiler introduces, in order to actually get the data into the tuple. That it’s documented as possible is still exceedingly strange.

Independent Reads of Independent Writes

IRIW is a generalization of store buffering, where two reader threads each read different apparent orderings of writes from two distinct writer threads.

T1 T2 T3 T4
X = 1 Y = 1 R1 = X R3 = y
R2 = Y R4 = X

Initially X=Y=0
Allowed in ARM, not in x86 r1=1, r2=0, r3=1, r4=0 [maranget2012tutorial,owens2009better]

This is not observed in x86 processors, but is in some ARM and POWER, more often in POWER. X86 hardware has a consistent view of memory where other hardware can see memory writes in different orders on different threads. On my rPi, I didn’t observe any incidents of X and Y being read out of order, over 40 million runs.

IRIW::IRIW() : x_(0), y_(0) {}
void IRIW::t1() {, std::memory_order_relaxed);

void IRIW::t2() {, std::memory_order_relaxed);

void IRIW::t3(Result& read) {
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
    std::get<1>(read) = y_.load(std::memory_order_relaxed);

void IRIW::t4(Result& read) {
    std::get<2>(read) = y_.load(std::memory_order_relaxed);
    std::get<3>(read) = x_.load(std::memory_order_relaxed);


The allowed behavior of modern processors is very different than our mental model of a Von Neumann architecture computer. Each core can have a different view of memory, and without additional controls, writes and reads can break the illusion of a single unified memory. The C++ memory model gives the controls and guarantees about what happens when different threads read and write memory, and here I’ve deliberately used the weakest version available, relaxed, in order to allow the processors the wideest latitude in behavior. Relaxed is, for processors that have it, often just an unconstrained int, which means that you will get odd behavior if you are running shared state multithreaded code that uses plain native types. It is a particular problem with code that was originally written and tested on a x86 architecture because the native model is fairly strong. This frequently causes problems when porting to a mobile platform, where ARM is a very popular hardware choice.

Org-mode source and git repo

Exported from an org-mode doc. All of the source is available on github at SpinGate



  • [boehm2005threads] Boehm, Threads cannot be implemented as a library, 261-268, in in: ACM Sigplan Notices, edited by (2005)
  • [bacon2000double] @miscbacon2000double,
    title=The “double-checked locking is broken” declaration,
    author=Bacon, David and Bloch, Joshua and Bogda, Jeff and Click, Cliff and Haahr, Paul and Lea, Doug and May, Tom and Maessen, Jan-Willem and Mitchell, JD and Nilsen, Kelvin and others,
  • [IWPAug2007] @miscIWPAug2007,
    howpublished =
    note = Accessed: 2017-04-30,
    title = Intel® 64 Architecture Memory Ordering
    White Paper,
  • [owens2009better] Owens, Sarkar & Sewell, A better x86 memory model: x86-TSO, 391-407, in in: International Conference on Theorem Proving in Higher Order Logics, edited by (2009)
  • [maranget2012tutorial] Maranget, Sarkar & Sewell, A tutorial introduction to the ARM and POWER relaxed memory models, Draft available from http://www. cl. cam. ac. uk/\~ pes20/ppc-supplemental/test7. pdf, (2012).

Date: 2017-04-30 Sun 00:00

Author: Steve Downey

Created: 2018-06-17 Sun 14:07



Multithread Experiments

An Experiment Collects Samples

I’m modelling this in order to run bits of code like the various litmus tests used to describe multi-core architectures. A set of functions to be run in parallel that may or may not write to a result, which type is a property of the Test being run. The Experiment will run the Test collecting Samples. The Test type will provide a tuple of functions to run. They will be run under a spingate in all permutations in order to remove scheduling bias.

What a Test looks like

class MP { // Message Passing
    int x_;
    int y_;

    typedef std::tuple<int> Result;
    void t1();
    void t2(Result& read);

    auto actions() {
        return std::make_tuple([this]() { t1(); },
                               [this](Result& result) { t2(result); });

The Test interface must provide a Result type, and an actions() member that will produce a tuple of functions to run which either take no arguments or a reference to a result.

The test being defined here is the basic Message Passing litmus test.

MP::MP() : x_(0), y_(0) {}

void MP::t1() {
    x_ = 1;
    y_ = 1;

void MP::t2(Result& read) {
    while (!y) {
    std::get<0>(read) = x_;

Two variables are initialized to 0. One thread stores 1 to x first, then to 1 to y. The other thread loops until it reads a non-zero in y, and then reads x. The value in x is the message being passed between threads.

In an actual test, the variables would be atomics, specifiying load and store strength, and the variables might have constraints on layout to help sharing cache line updates.

An Experiment

An Experiment samples a test a number of times. It takes the result of each sample, and puts in a map of the results to count, incrementing the count for each distinct result. The actions to run are permuted each time, to help remove bias about which action is loaded behind the spingate first.

void Experiment::run(size_t count) {
    using Actions = decltype(std::declval<Test>().actions());
    auto getters = tupleutil::tuple_getters<Actions>();
    for (size_t i = 0; i < count; ++i) {
        Sample<Test> sample;;
        std::next_permutation(getters.begin(), getters.end());

tupleutil::tuple_getters returns an array of getters each of which returns a std::variant<Types…> with the same parameter pack as the tuple.

Sample runs all of the actions in a batch that locks them behind a spingate, and collects the results for each action.

template <class Test> class Sample {
    Batch                 batch_;
    Test                  test_;
    typename Test::Result result_;

    template <typename V, size_t I> void run(std::array<V, I> const& getters) {
        auto const& actions = test_.actions();
        add(actions, getters);;

Add is a templated member function that loops over the array, uses the getter to pull a function out of the tuple of actions and visits that with a lambda that will add either the function with no arguments, or that function with a reference to the results, to the batch.

    template <typename Tuple, typename Variant, size_t I>
    void add(Tuple const& actions, std::array<Variant, I> const& getters) {
        auto adder = [this](auto&& f) {
            using F = std::remove_cv_t<std::remove_reference_t<decltype(f)>>;
            if constexpr (std::is_invocable_v<F>) {
            } else {
                batch_.add(f, std::ref(result_));
        for (auto&& get_n : getters) {
            std::visit(adder, get_n(actions));

I am a bit dissatisfied with the else case not being constexpr if followed by a static assert, but getting the condition right didn’t work the obvious way, so I punted. There will be a compiler error if f(result_) can’t actually be called by the batch.

Batch recapped:

The key bit of code is

template <class Function, class... Args>
void Batch::add(Function&& f, Args&&... args) {
    workers_.emplace_back([ this, f = std::forward<Function>(f), args... ]() {

Batch has a spingate and runs all of the functions that are added sitting behind it. The run() function opens the gate and joins all the worker threads.

void Batch::run() {;
    for (auto& thr : workers_) {


With all the machinery in place, the test infrascructure can aggressively run multi-threaded tests, giving the thread scheduler the best opportunity to run all of the actions in any order. This allows multi thread bugs to be shaken out by looking for surprising results from the experiment.

Source Code

Exported from an org-mode doc,, which is available, with all of the source on github at SpinGate.

Why std::bind can’t be (formally) deprecated

Yes: std::bind should be replaced by lambda

For almost all cases, std::bind should be replaced by a lambda expression. It’s idiomatic, and results in better code. There is almost no reason post C++11 to use std::bind.

Doing so is quite straightforward, capture each bind argument by value in the lambda capture list, and provide auto parameters for each of the placeholders, then call the bound callable using std::invoke(). That will handle the cases of member function pointers, as well as regular functions. Now, this is how to do it mechanically, if you were doing this as part of a manual refactoring, the lambda can be made even clearer.

#include <functional>
#include <iostream>

void f(int n1, int n2, int n3) {
  std::cout << n1 << ' ' << n2 << ' ' << n3 << '\n';

int main() {
  using namespace std::placeholders;
  int n = 5;
  auto f1 = std::bind(f, 2, n, _1);
  f1(10); // calls f(2, 5, 10);

  auto l1 = [ p1 = 2, p2 = n ](auto _1) { return std::invoke(f, p1, p2, _1); };

  // idiomatically
  auto l1a = [=](auto _1){return f(2, n, _1);};

  auto f2 = std::bind(f, 2, std::cref(n), _1);
  auto l2 = [ p1 = 2, p2 = std::cref(n) ](auto _1) {
      return std::invoke(f, p1, p2, _1);
  // or
  auto l2a = [ p1 = 2, &p2 = n ](auto _1) {
      return std::invoke(f, p1, p2, _1);
  // more idiomatically
  auto l2b = [&](auto _1){f(2, n, _1);};

  n = 7;
  f2(10); // calls f(2, 7, 10);


Which results in:

2 5 10
2 5 10
2 5 10
2 7 10
2 7 10
2 7 10
2 7 10

No: std::bind provides one thing lambda doesn’t

The expression std::bind evaluates flattens std::bind sub-expressions, and passes the same placeholder parameters down. A nested bind is evaluated with the given parameters, and the result is passed in to the outer bind. So you can have a bind that does something like g( _1, f(_1)), and when you call it with a parameter, that same value will be passed to both g and f. The function g will receive f(_1) as its second parameter.

Now, you could rewrite the whole thing as a lambda, but auto potentially makes this a little more difficult. The result of std::bind is an unutterable type. They weren’t supposed to be naked. However, auto means the expression could be broken down into parts, meaning that the translation from a std::bind expression to a lambda expression is potentially not mechanical. Or, the bind could be part of a template, where the subexpression is a template parameter, which is likely working by accident, rather than design.

In any case, std::bind does not treat its arguments uniformly. It treats a bind expression distinctly differently. At the time, it made some sense. But it makes reasoning about bind expressions difficult.

Don’t do this. But it is why formally deprecating std::bind is difficult. They can be replaced, but not purely mechanically.

There isn’t a simple translation that works, unlike converting from std::auto_ptr to std::unique_ptr, or putting a space after a string where it now looks like a conversion. And, std::bind isn’t broken. It’s sub-optimal because of the complicated machinery to support all of the flexibility, where a lambda allows the compiler to do much better. Also, since the type isn’t utterable, it often ends up in a std::function, which erases the type, removing optimization options.

Example of fail code

#include <functional>
#include <iostream>

void f(int n1, int n2, int n3)
    std::cout << n1 << ' ' << n2 << ' ' << n3 << '\n';

int g(int n1) { return n1; }

int main()
    using namespace std::placeholders;

    auto g1 = std::bind(g, _1);
    auto f2 = std::bind(f, _1, g1, 4);
    f2(10); // calls f(10, g(10), 4);

    // auto l2 = [p1 = g1, p2 = 4](auto _1) {std::invoke(f, _1, p1, p2);};
    // l2(10);

    // The bind translation needs to be composed:
    auto l1 = [](auto _1){return g(_1);};
    auto l2 = [p1 = l1, p2 = 4](auto _1){f(_1, p1(_1), p2); };
    // idiomatically
    auto l2a = [](auto _1) { return f(_1, g(_1), 4);};

10 10 4
10 10 4
10 10 4


If someone can figure out a fixit recommendation that could be safely applied, transforming the old bind to a lambda, then std::bind could be deprecated in C++Next, and removed as soon as C++(Next++). But that right now is non-trivial in some cases.


  • Fix incorrect statement about type-erasure in std::bind. I was thinking std::function
  • Add more idiomatic transliterations of the std::bind lambdas

Building and running the examples


    -rm example1
    -rm example2

example1: example1.cpp
    clang++ --std=c++1z example1.cpp -o example1

example2: example2.cpp
    clang++ --std=c++1z example2.cpp -o example2

example3: example3.cpp
    clang++ --std=c++1z example3.cpp -o example3 2>&1

all: example1 example2


rm example1
rm example2
clang++ --std=c++1z example1.cpp -o example1
clang++ --std=c++1z example2.cpp -o example2

Original source

Original document is available on Github