A local CMake workflow with Docker

l#+BLOG: sdowney

An outline of a template that provides an automated workflow driving a CMake project in a docker container.

This post must be read in concert with https://github.com/steve-downey/scratch of which it is part.

Routine process should be automated

Building a project that uses cmake runs through a predictable lifecycle that you should be able to pick up where you left off without remembering, and for which you should be able to state your goal, not the step you are on. make is designed for this, and can drive the processs.

The workflow

  • Update any submodules
  • Create a build area specific to the toolchain
  • Run cmake with that toolchain in the build area
  • Run the build in the build area
  • Run tests, either dependent or independent of rebuild
  • Run the intall
  • (optionally) Clean the build
  • (optionally) Clean the configuration

All of the steps have recognizable filesystem artifacts, need to be run in order, and if there are any failures, the process should stop.

So we want a make system on top of our meta-make build system.

The one thing this system does, that plain cmake doesn’t automate, is making sure that any changes are incorporated into a build before tests are run. CMake splits these, in order to provide the option of running tests without a recompile. I think that is a mistake to automate, but I do provide a target that just runs ctest.

My normal target is test

make test

This will run through all of the steps, but only those, necessary to compile and run tests. The core commands for the build driver are

Compile all out of date source
Install into the INSTALL_PREFIX
Run the currently build test suite
Build and run the test suite
run cmake again in the build area
Clean the build area
Delete the build area

There are several makefile variables controlling the details of what toolchain is used and where things are located. By default the build and install are completely out of the source tree, in ../cmake.bld and ../install respectively, with the build directory further refined by the project name, which defaults to the current directory name, and the toolchain if that is overridden.

The build is a Ninja Multi-config build, supporting RelWithDebInfo, Debug, Tsan, and Asan, with the particular flavor being selectable by the CONFIG variable. See targets.mk for the variables and details, such as they are.

Because other tools, unfortunately, rely on a compile_commands.json this system symlinks one from the build area when reconfiguration is done.

default: compile

    mkdir -p $(_build_path)

$(_build_path)/CMakeCache.txt: | $(_build_path) .gitmodules
    cd $(_build_path) && $(run_cmake)
    -rm compile_commands.json
    ln -s $(_build_path)/compile_commands.json

compile: $(_build_path)/CMakeCache.txt ## Compile the project
    cmake --build $(_build_path)  --config $(CONFIG) --target all -- -k 0

install: $(_build_path)/CMakeCache.txt ## Install the project
    DESTDIR=$(abspath $(DEST)) ninja -C $(_build_path) -k 0  install

ctest: $(_build_path)/CMakeCache.txt ## Run CTest on current build
    cd $(_build_path) && ctest

ctest_ : compile
    cd $(_build_path) && ctest

test: ctest_ ## Rebuild and run tests

cmake: |  $(_build_path)
    cd $(_build_path) && ${run_cmake}

clean: $(_build_path)/CMakeCache.txt ## Clean the build artifacts
    cmake --build $(_build_path)  --config $(CONFIG) --target clean

realclean: ## Delete the build directory
    rm -rf $(_build_path)

To Docker or Not to Docker

The reason for the separate targets.mk file is to simplify running the build purely locally in the host, or in a docker containter. The structure of the build is the same either way. In fact, before I dockerized this template project, there was a single makefile which was mostly the current contents of targets.mk. However, splitting does make the template easier, as project specific targets can naturally be placed in targets.

Tha outer Makefile is responsible for checking if Docker has been requested and for making sure the container is ready. The makefile has a handful of targets of its own, but otherwide defers everything to targets.mk.

set a flag file, USE_DOCKER_FILE, indicating to forward to docker
remove the flag file
rebuild the docker image
Clean volumes and rebuild image
Shell in the docker container

The docker container is build via docker-compose with the configuration docker-compose.yml. It uses the Dockerfile which uses steve-downey/cxx-dev:latest as the base image, and mounts the current source directory as a bind mount and a volume for ../cmake.bld.

I don’t publish steve-downey/cxx-dev:latest and you should build your own BASE. I do provide the recipe for the base image as a subprojct in docker-inf/docker-cxx-dev.

You running unknown things as root scares me.

The image is assumed to provide current version of gcc and clang as c++ or gcc, or clang++ respectively.

The intent of the image is to provide compilation services and operate as an lsp server using clangd. Mine doesn’t provide X, editors, IDEs, etc. The intent isn’t a VM, it’s a controlled compiler installation.

Compiler installations bleed in to each other. Mutliple compilers installed onto the same base system can’t be assumed to behave the same way as a compier installed as the only compiler. The ABI libraries vary, as do the standard libaries. Deployment just makes this all an even worse problem. As a Rule I use for production Red Hat’s DTS compilers and only deploy on later OSs than I’ve built on, with strict controls on OS deployments and statically linking everything I possibly can.

The base image I am using here, steve-downey/cxx-dev, works for me, and is avaiable at https://github.com/steve-downey/docker-cxx-dev as a definition as well.

It is based on current Ubuntu (jammy), installs gcc-12 from the ubuntu repositories, adds the LLVM repos and installs clang-14 from them based on how https://apt.llvm.org/llvm.sh does.

It then installs the current release of cmake from https://apt.kitware.com/ubuntu/ because using out of date build tools is a bad idea all around.

I also configure it to run as USER 1000, because running everything as root is strictly worse, and 1000 is a 99.99 percent solution/

    git submodule update --init --recursive
    touch .update-submodules

.gitmodules: .update-submodules

.PHONY: use-docker
use-docker: ## Create docker switch file so that subsequent `make` commands run inside docker container.
    touch $(USE_DOCKER_FILE)

.PHONY: remove-docker
remove-docker: ## Remove docker switch file so that subsequent `make` commands run locally.

.PHONY: docker-rebuild
docker-rebuild: ## Rebuilds the docker file using the latest base image.
    docker-compose build

.PHONY: docker-clean
docker-clean: ## Clean up the docker volumes and rebuilds the image from scratch.
    docker-compose down -v
    docker-compose build

.PHONY: docker-shell
docker-shell: ## Shell in container
    docker-compose run --rm dev

Work In Progress

I expect I will make many changes to all of this. I’m providing no facilities for you to pick them up. Sorry.

Please consider this as an exhibition of techniques rather than as a solution.

std::execution, Sender/Receiver, and the Continuation Monad

Some thoughts on the std::execution proposal and my understanding of the underlying theory.

What’s proposed

From the paper’s Introduction

This paper proposes a self-contained design for a Standard C++ framework for managing asynchronous execution on generic execution contexts. It is based on the ideas in [P0443R14] and its companion papers.

Which doesn’t tell you much.

It proposes a framework where the principle abstractions are Senders, Receivers, and Schedulers.

A composable unit of work.
Delimits work, handling completion, exceptions, or cancellation.
Arranges for the context work is done in.

The primary user facing concept is the sender. Values and functions can be lifted directly into senders. Senders can be stacked together, with a sender passing its value on to another function. Or stacking exception or cancellation handling the same way.

Receivers handle the three ways a sender can complete, by returning a value, throwing an exception, or being canceled. As described, receivers are most likely to be implemented within particular algorithms that combine senders, such as `then` or `retry`.

Schedulers provide access to execution contexts. Like inline, single thread, a thread pool, a GPU, and so on, would all have schedulers that provide for putting a sender into the context they manage.

There’s a fairly large API surface being proposed. But there’s an underlying theory about this, governing what algorithms need to be there and how the pieces fit together.

Continuation Passing Style and the Continuation Monad

Continuation passing style is a transformation from a normal function and a call stack to a direction to send the result to the “continuation” without returning. This means the functions context can be cleaned up. Delimited continuations are a slight variation, where instead of an unbounded “rest of the program”, the continuation has an end point and a value. It’s essentially a function, and can be handled as such. There is a purely mechanical method for converting all of the lambda calculus transforms into CPS form, and this can be profitable for compilers based on lambda, or related logics, like system F.

The mechanical transformation also means that all the control structures, like loops, gotos, coroutines, exceptions, have CPS equivalents.

CPS is tedious, though. Having to explicitly add a continuation to everything is complicated.

However, there’s also a typeclass, or concept, that allows you to convert regular functions into continuation passing style, automatically. It’s then rather straightforward to involve concerns like where work is being run for something that wraps up the entire work. Even being able to switch back and forth between contexts. That’s the continuation monad.

And unfortunately monads became an organizing principle in programming language theory one or two decades after most CS programs were standardized. So it’s all complicated and involves things we weren’t trained on. Fitting it into C++ has been an ongoing challenge, and until we had generic lambda was neither reasonbly concise nor idiomatic.

See, however, the new monadic interface additions for std::optional for why you want this. Or Ranges, which are solidly based in the ‘list’ or non-deterministic monad.

We have a poor relationship with Theory

There is no satisfactory PL theory for object oriented programming. There’s lots of work, but it mostly ends up describing something that OO programmers don’t think is quite the same as what they do. Even the ones who spend a lot of time doing theory.

Yet OO was, and is, a successful discipline. Working with identity, behavior, and state has produced remarkable results. Temporal calculi, not so much.

For a long while, we as a discipline thought that multi-threading was similar. There was poor theory, but we had hardware, and libraries that let us use that hardware to do concurrent work correctly.

That turned out not to be the case.

Concurrency can’t be just a library, unfortunately. Concurrency models that hardware vendors will commit to won’t promise not to violate causality. That makes producing a programming model programmers can use frighteningly difficult.

Which is why having a sound theory for std::execution is a good thing, even if the theory is unfamiliar.

But as a group, we learned the wrong lessons from the 80s and thought it was a researcher’s job to take the successes of practitioners and put a sound basis to them. Ignoring that it is a feedback loop. In the 60s and 70s, those researchers were also the practitioners. It’s not wrong to get out ahead of theory, but we do need to check back.

p2300 std::execution

Senders, via the Decorator pattern, lift ordinary functions into the continuation passing style. People writing functions only need to be concerned with handling the arguments they are passed, without concern for execution context or continuations. Functions used by senders act like, and are, normal functions.

Senders manage a bundle of channels, representing normal return of a value, throwing an exception, or an error channel to handle cancellation, or other errors not within the bound of ordinary functions. All of these channels can be composed taking the result to another function, or monadically with a function returning a sender, where that function can determine the kind of sender based on the values of the arguments. The channels can be combined or rerouted, connecting one to another, or presenting a variant containing either result, exception, and/or error to the continuation function.

Although senders form a logical graph of units of work, the physical type model is containment, much like expression templates. The result of binding senders together via an algorithm is a sender that contains the bound together senders. There are no nodes or allocations inherent to the model, just function calls.

C++ coroutines fit into this model. C++ coroutines are, from the outside, functions with rules about the interaction patterns with the returned value. Making a coroutine owning type a sender, and a sender co_awaitable, is possible and has been demonstrated.

std::execution takes the Continuation Monad and fits it to C++ control flow, return or exception, and adds cancellation, which incidentally allows a channel for failures from execution contexts. The thread pool can potentially signal failure via the error channel, without aliasing problems from application function code. However, for advanced users, these can be folded back into the normal function arguments and handled by application code. Policy decisions are not burned into the ROM of std::execution, but there are defaults that can be provided by application infrastructure authors.

Those infrastructure authors do not have to be std library vendors. The protocols, rendered as concepts, are available to normal users.

Network TS

Eppur si muove
And yet it moves

I do not believe ASIO’s model is a firm foundation for all async programming. However, it is well proven, and exists. It works.

And …

I have confidence that a networking library can and will be built using p2300. I am less confident that can be done in the timeframe for C++26. I do not believe for a moment we could have one for C++23, even with an existence proof a networking library appearing now. It’s simply too late to review and agree. We’re in the same place as coroutines. We can have the machinery, but without all of the application user facing infrastructure we should have.

I think this was the right choice with coroutines, and I think providing the machinery for general continuation based async in the standard library so that we can build on top of it is the right choice. The authors have committed to making sure all the facilities are available for programmers, in particular the pipe syntax (an issue for ranges) as well as providing bases or adapters for coroutine promises and typed senders. We can experiment and add existing practice as we go.


This is all my personal opinion, based on my own understanding. I’ve been in the meetings, I’ve been in discussions, asked questions. But if I’m wrong about some aspect of the proposal, that’s on me. Certainly not a formal opinion of Bloomberg, where I work. While we do lots of network services, and async programming, this isn’t what our tech looks like at all. Getting from here to there is an open question, but it would be for ASIO, too.

At least it isn’t CORBA.

Source For Blog

Standard Vocabulary for Algorithms

This is feedback after considering A Plan for C++23 Ranges

Disclosure: I voted in favor of this. It does not suggest work on views::maybe [P1255R6]​. I’m fine with that priority.

Vocabulary is not just Types between Components

There’s broad agreement that ‘vocabulary types’ belong in the standard. Domain independent types that can be used between otherwise uncoupled facilities. They each depend on facilities provided by Standard C++ and can thus be developed independently. Each component can rely on the API of a core vocabulary type not changing. This allows facilities to communicate effectively.

A component can have std::string or std::vector<int> in its interface with little concern. The STL parts of the standard library have always relied on concepts for constraining template parameters implicitly, requiring an Iterator of particular category in places, or specializing algorithms based on such. Composition of container types has also been natural. No one is confused by a std::unordered_map<std::string, std::vector<my::person>>.

The new C++20 range facilities allow composition of algorithms, which means that the names of algorithms will, and must, become part of the vocabulary of C++ programmers. Algorithms, and the names for them, are far broader than the linear containers supported by C++ today. It is a particular problem because C++ has named core algorithms badly, as they were not, until recently, first class entities.

Algorithms are now first class entities

Functions are a terribly overloaded term in C++. Overloaded is an overloaded term.

With respect to ranges and views, however, it should be clear that the adaptors, transformers, and other entities we apply to ranges and views to produce new ranges, views, and values, are now first class entities in their own right.

The important thing, for programmers, is that moving up the ladder of abstraction should not be conceptually more difficult than dealing with existing libraries of functions. We have algorithms that deal with generic types rather than functions that deal with concrete types, but keeping them in mind should not be fundamentally more difficult.

Naming is therefore critical.

As are chosing the algorithms that are primitive, that is, the ones that compose into others in useful ways.

I have some concerns about providing fused algorithms that a sufficiently smart compiler could optimize. On the other hand, I doubt how imminent sufficiently smart compilers are.

Why Haskell keeps coming up

Python was the answer to “Why can’t I just write pseudo-code?”

I have text books where the authors seriously discussed writing out your algorithm into pseudo code, and then translating it into actual programming language. Programming languages were too high ceremony to discuss what you wanted to do, while normal English was not structured enough. “Scripting” languages gave precise higher level semantics that were implicit in the pseudo code, or it would have been impossible to translate by hand to other computer languages.

Haskell was designed to do that for algorithm research. The primary goal was to be able to give direct executable form to a research paper. This was early replicable research. The paper provided, on its own, the mechanism to test and reproduce the result. Haskell source code can be embedded in LaTeX and given directly to a Haskell compiler. Every interesting algorithms paper for the last 30 years or so has implicitly included the prelude.

Haskell itself is not a terribly complex language. Much of it is specified as sugar where a form is translated into a more primitive expression, as range for loops are in C++. Its power for researchers is in its inherent laziness, that an expression is not evaluated unless it is required, and its purity, that values are immutable. These properties make it straightforward to reason with and about. The core property of lazy evaluation has forced it to be pure, not allowing side-effects. This has forced it to be exemplary, and evolutionary, in supporting value oriented programming.

These factors are why we end up often looking to Haskell.

Value oriented programming has been an emerging style in modern C++. The new Range facilities embody the style. Being able to crib proveably correct facilities from modern programming language research is a benefit.

There was a long period, in which C++ grew up, that academic research lagged behind the pragmatic craft of programming. This has, in retrospect, not been an unqualified success. Parallelism and concurrency were widely implemented without theoretic underpnnings. We are still trying to put C++’s memory model on firm footing. Many constructs regularly used in the past are now understood to be unsound, in explicable ways. Multi-threading should not be an experimental science.

Object orientation is an area which still does not have satisfying theoretic backing. There are various theories that provide many of the same benefits, allowing abstraction and encapsulation. In OO, however, objects have identity and state, which implies that equational reasoning can fail, as two equivalent object may have distinct identity and then different behavior. Temporal calculus formalisms are still open research.

The manifest success of OO{A,D,P} with limited underlying theory has allowed many practitioners to believe that academic research is irrelevant. To the extent of dismissing rigor. Jokes about the “M-word” are emblematic.

Sidebar: Monads are boring

A monad is a generic type with a few operations, some of which can be expressed in terms of the others, that follow a few fundamental and fairly trivial rules. They allow predictable composition of entities that model monad.

It is not about `then`. Although `then` does tend to fall out.

The interesting thing is that algorithms can be built that take generic types that follow some “Laws”, and they will do, something, correctly. For some kinds of generic type, what happens may be surprising, even interesting. The surprises will, however, not break anything. That’s the point of the semantic guarantees surrounding the operations.

C++ has philosophically adopted semantics surrounding Concepts. A Concept is entitled to demand behavior and laws in addition to the checkable syntax that can be enforced.

The utility for an algorithm writer is that reasoning can be applied at a much higher level. Even beyond the abstraction that this is a Container, or Range, or View, being able reason about the operations for anything that provides a set of well defined operations means that the algorithm is correct, even if the details when applied to a new kind are obscure.

Iterators provide a way for algorithms to work with many containers. Those same, correct, algorithms should be able to work with things that aren’t Iterable, or Containers. It turns out having a named way of converting a Type<Type<Value>> to a Type<Value> opens up many algorithms to broader use. Flattening a vector of vectors into a linear vector is not the only use of a `concat`.

Fold: In Which I Become Cranky About Names

One of the most primitive and yet powerful algorithms is `fold`. Fold takes a binary operation and interposes it between elements in an iterable data type. It replaces the commas {a, b, c, d, …} with `op`, resulting in {a op b op c op d … }

Except that is slightly ambiguous. Is that (a op (b op (c op (d)))) or ((((a) op b) op c) op d)

Right or left evaluation of the terms.

Folding right is strictly more general, at least in the lambda calculus. Left fold can be implemented as a right fold, but not vice-versa.

However, in C++, the normal implementation of folds is not as general as it might be, because C++ is strict in evaluating function arguments. This means that even though a fold whose operation might not need to evaluate one of its arguments in order to return a value, will still do so, because the argument is evaluated before being passed to the operation. This mean, in particular, that we can’t simply undo fold by using a `cons` operation, reconstructing the list. The entire, indeterminate and possibly infinite, list must be evaluated even though all that is necesary is the first element.

Left folding is tail recursive, and so can be converted to a loop. This is why it seems natural in C++.

T fold_left(Range&& r, T init, BinaryOperation op) {
    range_iterator_t<Range> b = begin(r);
    range_sentinel_t<Range> e = end(r);
    for (; b != e; ++b) {
        init = op(std::move(init), *b);
    return init;

Right fold is not generally tail recursive, although for finite sequences in C++, using reverse iterators, foldr is also a loop.

auto fold_right_loop(Range&& r, T init, BinaryOperation op) -> result_of<>{
    range_iterator_t<Range> b = begin(r);
    range_sentinel_t<Range> e = end(r);
    for (; b != e; ++b) {
        init = op(std::move(init), *b);
    return init;

To make the diagrams below pretty, we assume that foldl over a sequence of T takes a single value of type Z and an op of Z -> T -> Z, while foldr over a sequence of T takes an op of type T -> Z -> Z and a value of type Z. For a left fold, the value is given to the op that takes the first element in the sequence, while for a right fold, it is given to the op that takes the last element in the sequence.



result = op(op(op(op(op(op(op(op(z, a), b), c), d), e), f), g), h);


result = op(a, op(b, op(c, op(d, op(e, op(f, op(g, op(h, z))))))));

So, left fold piles up the ops on the left and right fold piles them up on the right. The ‘zero’ or init value is with the last value for a right fold, and the first on a left fold. In converting to iteration, which is important, left fold is in the natural direction, and right fold is reversed, which means it can’t be used that way on some sequences.

So which one should be the fold. It’s an important question because short names are implicitly privileged. Programmers assume that the short name is the best one, and longer names indicate more specialized usage. This isn’t true for C++, but not really on purpose. We did think that `map` was a good choice. The programmers should probably reach for `unordered_map` instead is unfortunate, but at this point unfixable.

Since C++ is strict, it would seem that left fold is the fold you mostly want, so it would be OK to call it fold.

Haskell, as lazy as it is, has issues with folds on ranges with indeterminate bounds. foldl vs foldr vs foldl' is a constant question. Foldr Foldl Foldl’ Memory consumption of either stack or unevaluated expressions will surprise programmers.

But in the literature, fold is right fold. In the pure lambda calculus, right fold is strictly more expressive. Tail recursion is a subset of primitive recursion, and left is tail recursive while right is not. If we call our left fold, fold, we will confuse people.

Further, the forward iteration is somewhat balanced by the fact that the normal addition to a sequence is emplace_back or push_back, which means that left fold will tend to reverse lists if the operation is accumulation into a sequence, if for example the op were [](auto t){v.emplace_back(t);}.

More, we have started to introduce lazy features into the language, with coroutines. Although we won’t get them soon, the C++ standard library should eventually have good lazy algorithms. It would be unfortunate if we have to teach that you can’t use co_fold in place of fold because they are opposite, or that co_fold_left is how you spell the coroutine fold. [Note co_ prefixes are mildly sarcastic.]

C++ has a long history of setting defaults and realizing later that they are a problem.

I would much prefer we not make a choice about which fold is the fold, and just spell them as fold_left and fold_right.

We should all be able to get behind not calling it accumulate, even if we must live with transform.


Some other time I’ll vent about the various fold operations and how they get extended if your type is a monoid. See Chris Di Bella’s work such as A Concept Design for the Numeric Algorithms, although this rant suggests that algorithms that operate on different Concepts should have different names. Overloading on concepts with semantic differences can be confusing.

Blog Infrastructure Work

What feels like actual productive work, but is only work adjacent? Fixing up the blog.

I managed to gradually destroy the whole thing after a decade or more of platform migrations and upgrades. This is a new and from (mostly) scratch install, into which I’ve imported the contents of the old one. Good backups and some hints from the excellent staff at panix.com got me back in shape. I hope.

Notes to my future self:

Start with a clean and empty database after backing up. In this case moved to the new db server at mysql3, which runs a non-antique version of mysql.

Shell permissions are tight, so need to chmod anything I create by hand to a+r or it can’t be read.

Installed into root of ~/public_html so as to stop the weird rewrite rules.

Still have rewrite rules to handle ~sdowney vs sdowney.org.

Using the wp Twenty Nineteen theme because it’s not a narrow column format. Created a child theme, and pasted the custom css into the child/style.css.

CSS is generated from M-x org-html-htmlize-generate-css in order to get my defaults in code blocks and such from org export.

Tweaked the .pre.src block to use better fonts. Consider fixing the trendy fonts used in the main theme.

Latex is on via Jetpack plugin.

Testing out posting to blog from emacs

Using org2blog.

Test LaTex

  • The word LaTeX
    • \LaTeX
  • Inline
    • \sum_{i=0}^n i^2 = \frac{(n^2+n)(2n+1)}{6}
  • Equation
    • \sum_{i=0}^n i^2 = \frac{(n^2+n)(2n+1)}{6}

Src blocks


class windows_1252_convert_view
    : public rng::view_facade<windows_1252_convert_view<Range>, rng::unknown>;


print ('Hello, world!')

Value Oriented Programming: A Manifesto

Object Oriented Programming

The primitive entities are objects which have certain properties

  • Objects have Identity
  • Objects have State
  • Objects have Behavior

Values are not Objects

The primitive entities are values which have none of the properties of Objects

  • Values have no identity
  • Values have no State
  • Values have no Behavior

A positive definition of VOP

  • Values are Equivalent
  • Values are Immutable
  • Values are Side-effect Free

Values are Equivalent

One entity with a value will give the same results as a different entity with the same value. All ‘fives’ are the same. This allows equational reasoning. If `a` and `b` have the same value, `f(a)` and `f(b)` have the same value.

Values are Immutable

‘five’ never becomes ‘six’. Values do not change.

Values are Side-effect free

No operation on a value has an effect on other values. Temporal changes are explicit.

Value Oriented Programming does not deny Objects

  • There are things with identity
  • There are things that vary over time
  • There are things that affect the state of other things

Values are are a better basis

We can, and do, build objects out of values, but objects in the OO sense are impossible to reason about.

Values are.

AnyDuck : A Value Type Erased Type

A Constrained Duck Typed Value Type

For yak shaving reasons, I need a type able to hold any type conforming to a particular interface. I’d like this to act as a (Semi)Regular value type. That is, I’d like it to be copyable, assignable, and so forth, and not be sliced or otherwise mangeled in the process. I want it to be reasonably efficient, not significantly worse than traditional virtual function overhead. I also don’t want to be terribly creative in implementing, using existing std library types.

The overall pattern I’ve settled on for now is to hold the type in a std::any and dispatch to the held type through function pointers referring to lambdas. The lambda allows me to capture the type being stored into the AnyDuck and safely recover it. There’s some boilerplate to write to dispatch to the lambda. Perhaps one day, when we have reflection, that can be automated.

For purposes of this paper, I’ll assume I have an interface Duck that I want to model:

class Duck {
    void quack(int length) const;

Ducks are defined as things that quack, and quack is a const function. I want to be able to put any type that models Duck into an AnyDuck, and pass AnyDuck into any generic function expecting a Duck. I also want to be able to extend AnyDuck to unrelated types, as long as they model Duck. Mallard, for example:

class Mallard {
    void quack(int length) const;

The core of the idea, is to capture the Duck type in a templated constructor where I know the exact type, and create the appropriate lambda:

auto quack_ = [](std::any const& d, int i) {
    return std::any_cast<std::remove_reference_t<Duck>>(&d)->quack(i);

And then wrap the public facing call so that quackfn can be stored as a function pointer

void AnyDuck::quack(int length) const { return quack_(this->duck_, length); }

Here’s the whole thing:

class AnyDuck {
    std::any duck_;
    using quackfn = void (*)(std::any const&, int);
    quackfn quack_;

    AnyDuck(AnyDuck const&) = default;
    AnyDuck(AnyDuck&)       = default;

    template <typename Duck>
    AnyDuck(Duck&& duck)
        : duck_(std::forward<Duck>(duck)),
          quack_([](std::any const& d, int i) {
              return std::any_cast<std::remove_reference_t<Duck>>(&d)->quack(
          }) {}

    void quack(int length) const { return quack_(this->duck_, length); }

The copy constructors are there to be a better match than the templated constructor for copy by value. Codegen is surprisingly good. If the types are all present, the functions are inlined well, except for the overhead of storage into the any. For any unknown AnyDuck, there’s a dispatch via pointer indirection:

void test(AnyDuck a) {

results in something like

0000000000000050 <test(scratch::AnyDuck)>:
  50:   48 8b 47 10             mov    0x10(%rdi),%rax
  54:   be 01 00 00 00          mov    $0x1,%esi
  59:   ff e0                   jmpq   *%rax
  5b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

and the any_cast<> from the address of the passed in std::any is noexcept, but does in general have to check if the any has a value. Not as cheap as pure an interface type, but not terribly more expensive.

For the case where the quack is known, codegen is something like

scratch::AnyDuck::AnyDuck<Duck&>(Duck&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int): # @scratch::AnyDuck::AnyDuck<Duck&>(Duck&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int)
        movl    %esi, %edi
        jmp     bell(int)                # TAILCALL

If the implementation of the underlying quack is not available there’s a little more work

scratch::AnyDuck::AnyDuck<Mallard&>(Mallard&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int): # @scratch::AnyDuck::AnyDuck<Mallard&>(Mallard&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int)
        movl    $_ZNSt3any17_Manager_internalI7MallardE9_S_manageENS_3_OpEPKS_PNS_4_ArgE, %ecx
        xorl    %eax, %eax
        cmpq    %rcx, (%rdi)
        leaq    8(%rdi), %rcx
        cmoveq  %rcx, %rax
        movq    %rax, %rdi
        jmp     Mallard::quack(int) const    # TAILCALL

But far less than I would have expected. std::any is less awful than I thought.

You can take a look at the code so far here Compiler Explorer Link to see the results

I’ll clean up my scratch project and push it at some point.

Steve Downey’s Birthday (Observed)


Blooms Tavern
208 East 58th Street
Between 2nd & 3rd Ave
New York, NY 10022


6:00 PM ->


This is the sound
I bought a ticket to the world
But now I’ve come back again
Why do I find it hard to write the next line?
Oh, I want the truth to be said
I know this much is true


From: 110 W 14th St

Walk Walk

About 2 min , 335 ft

6:03 PM

14 Street Station

L Canarsie – Rockaway Pkwy

2 min (non-stop)

6:05 PM

14 Street – Union Sq Station

Walk Walk

About 1 min

6:09 PM

14 Street – Union Sq Station

5X Eastchester – Dyre Av

7 min (2 stops)

6:16 PM

59 St-Lexington Av Station

Walk Walk

About 4 min , 0.2 mi

6:20 PM

Blooms Tavern

208 E 58th St, New York, NY 10022

Walk Walk

About 2 min , 0.1 mi

6:03 PM

14 Street Station

M Forest Hills – 71 Av

13 min (6 stops)

6:16 PM

Lexington Av-53 St

Walk Walk

About 6 min , 0.3 mi

6:22 PM

Blooms Tavern

208 E 58th St, New York, NY 10022

From: 731 Lexington Avenue

Head southwest on Beacon Ct toward E 58th St
Restricted usage road
184 ft

Turn left onto E 58th St
Pass by Wells Fargo Bank (on the left in 115 ft)
Destination will be on the right
420 ft

Bloom’s Tavern

208 E 58th St, New York, NY 10022

Building Saar Raz’s clang concepts branch

A Recipe for building Saar Raz’s clang concepts branch

Saar Raz has been working on a Concepts implementation, available at https://github.com/saarraz/clang-concepts It’s not much harder to build it than clang usually is, it’s just a matter of getting things checked out into the right places before configuring the build. Just like LLVM and clang normally.

In order to double check how, I peeked at the shell script used by the compiler explorer image to build the clang-concepts compiler: https://github.com/mattgodbolt/compiler-explorer-image/blob/master/clang/build/build-concepts.sh

The really important bit is getting exactly the right commit from LLVM, 893a41656b527af1b00a1f9e5c8fcecfff62e4b6.

To get a working directory something like: Starting from where you want your working tree and build tree, e.g ~/bld/llvm-concepts

git clone https://github.com/llvm-mirror/llvm.git

pushd llvm
git reset --hard 893a41656b527af1b00a1f9e5c8fcecfff62e4b6

pushd llvm/tools
git clone https://github.com/saarraz/clang-concepts.git clang

pushd llvm/projects
git clone https://github.com/llvm-mirror/libcxx.git
git clone https://github.com/llvm-mirror/libcxxabi.git
# The sanitizers: this is optional but you want them
git clone https://github.com/llvm-mirror/compiler-rt.git

Then to build and install

mkdir build && cd build

cmake \
    -DCMAKE_INSTALL_PREFIX=~/install/llvm-concepts/ \
    -DCMAKE_BUILD_TYPE=Release  \
    -G Ninja  \


ninja check

ninja install

Note that I install into a separate prefix, keeping it isolated from everything else. The compiler can be invoked as ~/install/llvm-concepts/bin/clang++. There’s no particular reason to put the compiler on your PATH.

Should Unicode literals be guaranteed to be well-formed?

TL;DR Betteridge’s law applies: No.

Are you still here?

Unicode Literals

In C++ 20 there are 2 kinds and 6 forms of Unicode literals. Character literals and string literals, in UTF-8, UTF-16, and UTF-32 encodings. Each of them uses a distinct char type to signal in the type system what the encoding is for the data. For plain char literals, the encoding is the execution character set, which is implementation defined. It might be something useful, like UTF-8, or old, like UCS-2, or something distinctly unhelpful, like EBCDIC. Whatever it is, it was fixed at compilation time, and is not affected by things like LOCALE settings. The source character set is the encoding the compiler believes your source code is written in. Including the octets that happen to be between " and ' characters.

char32_t s1[] = U"\u0073tring";
char16_t s2[] = U"\u0073tring";
char8_t s2[] = U"\u0073tring";

char32_t c1 = U'\u0073';
char16_t c1 = u'\u0073';
char8_t c1 = u8'\u0073';

Unicode codepoint U+0073 is ‘s’. So all of the strings are “string”, and all of the characters are ‘s’, however the rendering of each in memory is different. For the string types, each unit in the string is 32, 16 or 8 bits respectively, and for the character literals, each is one code unit, again of 32, 16, or 8 bits.


This is due to Zach Laine, an actual expert, and sorting out what happened took several other experts, so the rest of us have little hope.

char8_t suprise[] = u8"ς";
assert(strlen(surprise) == 5);

This comes down to your editor and compiler having a minor disagreement about encoding. The compiler was under the impression that the encoding was cp-1252, where the editor believed it to be UTF-8. The underlying octets for ς are 0xCF 0x82, each of which is a character in cp-1252. All octets are valid characters in cp-1252. So each was converted to UTF-8, resulting in 0xC3 0x8F and 0xE2 0x80 0x9A.

Of course.

The contents of the string are still in the source character set, not in any of the Unicode forms. Unless that happens to be the source character set.

But at least it’s well formed.


The \u escapes, which identify Unicode codepoints by number, will produce well formed Unicode encoding in strings, and in character literals if they fit. That is, in a char32_t, you can put any code point. In a char16_t you can put any character from the basic multilingual plane. In a char8_t you can put 7 bit ASCII characters.

Or you can use hex or octal escapes, which will be widened to code unit size and the value placed into the resultant character or string. And no current compiler checks if that makes sense, although they will warn you if, for example you try to write something like:

char16_t w4 = u'\x0001f9ff'; // NAZAR AMULET - Unicode 11.0 (June 2018)
char16_t sw4[] = u"\x0001f9ff";

where you’re trying to put a value that won’t fit into a code unit into the result.

warning: hex escape sequence out of range

The \xnn and \nnn hex and octal escapes are currently a hole that lets you construct ill-formed string literals. For example

char8_t oops = u8"\xfe\xed";

But there are lots of ways of constructing ill-formed arrays of char8_t or char16_t. Just spell them out as arrays:

char8_t ill = {0xfe, 0xed};

The type system doesn’t provide that char8_t means well formed UTF-8. All it does is tell you that the intended encoding is UTF-8. Which is a huge improvement over char.

But it does not provide any guarantee to an API taking a char8_t*.

\0 is an octal escape sequence

You can spell it \u0000. But that’s weird, since we spell it as \0 everywhere, and that it’s an octal escape is a C++ trivia question.

I want to be told if I’m forming ill-formed Unicode.

Then don’t use escape sequences. If you use either well encoded source code encodings or universal character names you will get well formed Unicode.

The primary reason for wanting ill-formed Unicode is for testing. It’s a convenience. And there are straightforward workarounds.

But disallowing hex and octal escapes in Unicode strings makes the language less regular while preventing an error that you had to go out of your way to create, and does not actually produce more runtime safety.

Litmus Tests for Multithreaded Behavior

Litmus Tests for Multithreaded Behavior

Or How Processors Don’t Do What You Think

Modern multicore processors are entirely weirder than almost anyone thinks possible. They are somewhat weirder than chip makers were willing to admit until fairly recently. They are sufficiently weird enough that almost all multi-threaded programs, and many lock-free algorithms, had bugs because the hardware just does not work the way anyone would reasonably expect. And, in addition, optimizing compilers did not actually know how to not break your code. [boehm2005threads]

I’m in the (slow) process of writing some test cases for multi-threaded code. I eventually want to have some confidence that some code is executed once, and only once, in an efficient manner. It’s the ‘efficient’ part that worries me, because for efficient, it also has a tendency to be clever, and I’m learning that clever MT code is often subtly broken. [bacon2000double] So if smarter people than me make mistakes about MT code, I need tests to compensate. And ones that will cover occasional allowed but unexpected behavior. Which means the test framework should be able to detect them.

Also, fortunately, the RPi is a computer that exhibits some odd behavior, as it is an ARM system. X86 has a much stronger model. However, even the x86 model is allowed to perform in odd ways.

Starting in 2007, Intel has started publishing short snippets of assembly and documenting what are the allowed and disallowed results running them in parallel. [IWPAug2007] These snippets have come to be called litmus tests, and are used to confirm the behavior of hardware, and confirm models of the hardware behavior. A particularly important model for C++ programmers is the x86 Total Store Order model [owens2009better] which provides a way of mapping the C++11 memory model to X86 hardware. X86 hardware provides a strongly consistent memory model. Power and ARM provide fewer guarantees, and mapping the C++ memory model to these architectures is more challenging. [maranget2012tutorial]

Message Passing

The tests outlined in the Intel paper are short pieces of assembly to be exercised on different processors, with guarantees about behavior that will not happen. The first one essentially promises that message passing will work, and is now known as the MP test.

Processor 0 Processor 1
mov [ _x], 1 // M1 mov r1, [ _y] // M3
mov [ _y], 1 // M2 mov r2, [ _x] // M4

Initially x = y = 0

r1 = 1 and r2 = 0 is not allowed

That says that we can’t read the writes of x and y out of order, which ensures that if we’re waiting to see the write to the flag y, we can be guaranteed to see the payload in x. If we change it slightly to wait for the write to _y to be visible, we can pass a message from one thread to anothher in _x. This is also known as Dekards Algorithm.

ARM and Power do not provide that guarantee without additional synchronization instructions.

In C++ that looks something like the following, using the test framework I’m writing.

MP::MP() : x_(0), y_(0) {}
void MP::t1() {
    x_.store(1, std::memory_order_relaxed);
    y_.store(1, std::memory_order_relaxed);
void MP::t2(Result& read) {
    while (!y_.load(std::memory_order_relaxed)){}
    std::get<0>(read) = x_.load(std::memory_order_relaxed);

Here, x_ and y_ are atomic<int>s, and we’re using the lowest possible atomic guarantee in C++, relaxed. Relaxed guarantees that the operation happens atomically. but there are no synchronization properties with anything else. This usally corresponds to the basic int type. Unless you’re using a really insane processor that might let an int be partially written and observable. Like you might get the top half of the int, or the middle byte. The commercial processors that allowed this have pretty much died out.

The test spins on seeing the load of y to be complete. It loads the value of x_ into a result tuple. The tuple is used as the key to a map which accumulates how many times each result has been seen.
Running the above on my x86 laptop:

[ RUN      ] ExperimentTest.MPTest1
(1) : 2000000

on my Raspberry Pi 3:

[ RUN      ] ExperimentTest.MPTest1
(0) : 483
(1) : 1999517

Using objdump to check the generated assembly

00000088 <litmus::MP::MP()>:
  88:   mov     r2, #0
  8c:   str     r2, [r0]
  90:   str     r2, [r0, #64]   ; 0x40
  94:   bx      lr

00000098 <litmus::MP::t1()>:
  98:   mov     r3, #1
  9c:   str     r3, [r0]
  a0:   str     r3, [r0, #64]   ; 0x40
  a4:   bx      lr

000000a8 <litmus::MP::t2(std::tuple<int>&)>:
  a8:   ldr     r3, [r0, #64]   ; 0x40
  ac:   cmp     r3, #0
  b0:   beq     a8 <litmus::MP::t2(std::tuple<int>&)>
  b4:   ldr     r3, [r0]
  b8:   str     r3, [r1]
  bc:   bx      lr

So, out of the 2,000,000 times that I ran the experiment, there were 483 times that reading x_ resulted in 0, even though y_ was 1. ARM has a weaker memory model than x86. This has some advantages in processor implementation. It has distinct disadvantages in how our brains work. X86 tries to preserve the model that there is shared memory that everyone sees and works with. That’s not strictly true, even for X86, but ARM and Power don’t even come close. On the other hand, it’s also why it’s easier to add more cores to Power and ARM chips and systems. I routinely work with Power systems with 512 physical cores.

Store Buffering

Store buffering is the odd case that is allowed in the Intel memory model. When assigning locations in two threads, and then reading them on opposite threads, both threads are allowed to read the older state. The stores get buffered.
From the Intel White Paper:

Processor 0 Processor 1
mov [ _x], 1 // M1 mov [ _y], 1 // M3
mov r1, [ _y] // M2 mov r2, [ _x] // M4

Initially x = y = 0

r1 = 0 and r2 ==0 is allowed

Note, in particular, there is no interleaving of M1 – 4 that could result in r1 and r2 being 0. Not without interupting an instruction in the middle. But the instructions themselves are atomic, and indivisible. If they were actually operating on shared memory, this would not be possible. However, it does happen.

SB::SB() : x_(0), y_(0) {}
void SB::t1(Result& read) {
    y_.store(1, std::memory_order_relaxed);
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
void SB::t2(Result& read) {
    x_.store(1, std::memory_order_relaxed);
    std::get<1>(read) = y_.load(std::memory_order_relaxed);

That generates the x86 code

00000000000000f0 <litmus::SB::t1(std::__1::tuple<int, int>&)>:
  f0:   mov    DWORD PTR [rdi+0x40],0x1
  f7:   mov    eax,DWORD PTR [rdi]
  f9:   mov    DWORD PTR [rsi],eax
  fb:   ret

0000000000000100 <litmus::SB::t2(std::__1::tuple<int, int>&)>:
 100:   mov    DWORD PTR [rdi],0x1
 106:   mov    eax,DWORD PTR [rdi+0x40]
 109:   mov    DWORD PTR [rsi+0x4],eax
 10c:   ret

And on my x86 machine:

[ RUN      ] ExperimentTest.SBTest1
(0, 0) : 559
(0, 1) : 999858
(1, 0) : 999576
(1, 1) : 7

So 559 times neither core saw the other core’s store.

Load Buffering

Load Buffering is the dual of store buffering. Loads into registers might be delayed, or buffered, and actually performed after following instructions. It’s not allowed in the Intel architecture.

From the Intel White Paper

Processor 0 Processor 1
mov r1, [ _x] // M1 mov r2, [ _y] // M3
mov [ _y], 1 // M2 mov [ _x], 1 // M4

Initially x = y = 0

r1 = 1 and r2 = 1 is not allowed

LB::LB() : x_(0), y_(0) {}
void LB::t1(Result& read) {
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
    y_.store(1, std::memory_order_relaxed);
void LB::t2(Result& read) {
    std::get<1>(read) = y_.load(std::memory_order_relaxed);
    x_.store(1, std::memory_order_relaxed);

This is the x86 asm code

00000000000000c0 <litmus::LB::t1(std::__1::tuple<int, int>&)>:
  c0:   mov    eax,DWORD PTR [rdi]
  c2:   mov    DWORD PTR [rsi],eax
  c4:   mov    DWORD PTR [rdi+0x40],0x1
  cb:   ret
  cc:   nop    DWORD PTR [rax+0x0]

00000000000000d0 <litmus::LB::t2(std::__1::tuple<int, int>&)>:
  d0:   mov    eax,DWORD PTR [rdi+0x40]
  d3:   mov    DWORD PTR [rsi+0x4],eax
  d6:   mov    DWORD PTR [rdi],0x1
  dc:   ret
  dd:   nop    DWORD PTR [rax]

And the ARM code, at -O1

000000d0 <litmus::LB::t1(std::tuple<int, int>&)>:
  d0:   ldr     r3, [r0]
  d4:   str     r3, [r1, #4]
  d8:   mov     r3, #1
  dc:   str     r3, [r0, #64]   ; 0x40
  e0:   bx      lr

000000e4 <litmus::LB::t2(std::tuple<int, int>&)>:
  e4:   ldr     r3, [r0, #64]   ; 0x40
  e8:   str     r3, [r1]
  ec:   mov     r3, #1
  f0:   str     r3, [r0]
  f4:   bx      lr

ARM generally allows it, but per [maranget2012tutorial] it’s very sensitive, and dependencies will make it not appear. In my tests, I did not observe an instance of a buffering, but it may be due to the first store the compiler introduces, in order to actually get the data into the tuple. That it’s documented as possible is still exceedingly strange.

Independent Reads of Independent Writes

IRIW is a generalization of store buffering, where two reader threads each read different apparent orderings of writes from two distinct writer threads.

T1 T2 T3 T4
X = 1 Y = 1 R1 = X R3 = y
R2 = Y R4 = X

Initially X=Y=0
Allowed in ARM, not in x86 r1=1, r2=0, r3=1, r4=0 [maranget2012tutorial,owens2009better]

This is not observed in x86 processors, but is in some ARM and POWER, more often in POWER. X86 hardware has a consistent view of memory where other hardware can see memory writes in different orders on different threads. On my rPi, I didn’t observe any incidents of X and Y being read out of order, over 40 million runs.

IRIW::IRIW() : x_(0), y_(0) {}
void IRIW::t1() {
    x_.store(1, std::memory_order_relaxed);

void IRIW::t2() {
    y_.store(1, std::memory_order_relaxed);

void IRIW::t3(Result& read) {
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
    std::get<1>(read) = y_.load(std::memory_order_relaxed);

void IRIW::t4(Result& read) {
    std::get<2>(read) = y_.load(std::memory_order_relaxed);
    std::get<3>(read) = x_.load(std::memory_order_relaxed);


The allowed behavior of modern processors is very different than our mental model of a Von Neumann architecture computer. Each core can have a different view of memory, and without additional controls, writes and reads can break the illusion of a single unified memory. The C++ memory model gives the controls and guarantees about what happens when different threads read and write memory, and here I’ve deliberately used the weakest version available, relaxed, in order to allow the processors the wideest latitude in behavior. Relaxed is, for processors that have it, often just an unconstrained int, which means that you will get odd behavior if you are running shared state multithreaded code that uses plain native types. It is a particular problem with code that was originally written and tested on a x86 architecture because the native model is fairly strong. This frequently causes problems when porting to a mobile platform, where ARM is a very popular hardware choice.

Org-mode source and git repo

Exported from an org-mode doc. All of the source is available on github at SpinGate



  • [boehm2005threads] Boehm, Threads cannot be implemented as a library, 261-268, in in: ACM Sigplan Notices, edited by (2005)
  • [bacon2000double] @miscbacon2000double,
    title=The “double-checked locking is broken” declaration,
    author=Bacon, David and Bloch, Joshua and Bogda, Jeff and Click, Cliff and Haahr, Paul and Lea, Doug and May, Tom and Maessen, Jan-Willem and Mitchell, JD and Nilsen, Kelvin and others,
  • [IWPAug2007] @miscIWPAug2007,
    howpublished =
    note = Accessed: 2017-04-30,
    title = Intel® 64 Architecture Memory Ordering
    White Paper,
  • [owens2009better] Owens, Sarkar & Sewell, A better x86 memory model: x86-TSO, 391-407, in in: International Conference on Theorem Proving in Higher Order Logics, edited by (2009)
  • [maranget2012tutorial] Maranget, Sarkar & Sewell, A tutorial introduction to the ARM and POWER relaxed memory models, Draft available from http://www. cl. cam. ac. uk/\~ pes20/ppc-supplemental/test7. pdf, (2012).

Date: 2017-04-30 Sun 00:00

Author: Steve Downey

Created: 2018-06-17 Sun 14:07