AnyDuck : A Value Type Erased Type

A Constrained Duck Typed Value Type

For yak shaving reasons, I need a type able to hold any type conforming to a particular interface. I'd like this to act as a (Semi)Regular value type. That is, I'd like it to be copyable, assignable, and so forth, and not be sliced or otherwise mangeled in the process. I want it to be reasonably efficient, not significantly worse than traditional virtual function overhead. I also don't want to be terribly creative in implementing, using existing std library types. The overall pattern I've settled on for now is to hold the type in a std::any and dispatch to the held type through function pointers referring to lambdas. The lambda allows me to capture the type being stored into the AnyDuck and safely recover it. There's some boilerplate to write to dispatch to the lambda. Perhaps one day, when we have reflection, that can be automated. For purposes of this paper, I'll assume I have an interface Duck that I want to model:
class Duck {
  public:
    void quack(int length) const;
};
Ducks are defined as things that quack, and quack is a const function. I want to be able to put any type that models Duck into an AnyDuck, and pass AnyDuck into any generic function expecting a Duck. I also want to be able to extend AnyDuck to unrelated types, as long as they model Duck. Mallard, for example:
class Mallard {
  public:
    void quack(int length) const;
};

The core of the idea, is to capture the Duck type in a templated constructor where I know the exact type, and create the appropriate lambda:
auto quack_ = [](std::any const& d, int i) {
    return std::any_cast<std::remove_reference_t<Duck>>(&d)->quack(i);
}
And then wrap the public facing call so that quackfn can be stored as a function pointer
void AnyDuck::quack(int length) const { return quack_(this->duck_, length); }
Here's the whole thing:
class AnyDuck {
    std::any duck_;
    using quackfn = void (*)(std::any const&, int);
    quackfn quack_;

  public:
    AnyDuck(AnyDuck const&) = default;
    AnyDuck(AnyDuck&)       = default;

    template <typename Duck>
    AnyDuck(Duck&& duck)
        : duck_(std::forward<Duck>(duck)),
          quack_([](std::any const& d, int i) {
              return std::any_cast<std::remove_reference_t<Duck>>(&d)->quack(
                  i);
          }) {}

    void quack(int length) const { return quack_(this->duck_, length); }
};
The copy constructors are there to be a better match than the templated constructor for copy by value. Codegen is surprisingly good. If the types are all present, the functions are inlined well, except for the overhead of storage into the any. For any unknown AnyDuck, there's a dispatch via pointer indirection:
void test(AnyDuck a) {
    a.quack(1);
}
results in something like
0000000000000050 <test(scratch::AnyDuck)>:
  50:   48 8b 47 10             mov    0x10(%rdi),%rax
  54:   be 01 00 00 00          mov    $0x1,%esi
  59:   ff e0                   jmpq   *%rax
  5b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

and the any_cast<> from the address of the passed in std::any is noexcept, but does in general have to check if the any has a value. Not as cheap as pure an interface type, but not terribly more expensive. For the case where the quack is known, codegen is something like
scratch::AnyDuck::AnyDuck<Duck&>(Duck&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int): # @scratch::AnyDuck::AnyDuck<Duck&>(Duck&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int)
        movl    %esi, %edi
        jmp     bell(int)                # TAILCALL
If the implementation of the underlying quack is not available there's a little more work
scratch::AnyDuck::AnyDuck<Mallard&>(Mallard&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int): # @scratch::AnyDuck::AnyDuck<Mallard&>(Mallard&)::{lambda(std::any const&, int)#1}::__invoke(std::any const&, int)
        movl    $_ZNSt3any17_Manager_internalI7MallardE9_S_manageENS_3_OpEPKS_PNS_4_ArgE, %ecx
        xorl    %eax, %eax
        cmpq    %rcx, (%rdi)
        leaq    8(%rdi), %rcx
        cmoveq  %rcx, %rax
        movq    %rax, %rdi
        jmp     Mallard::quack(int) const    # TAILCALL
But far less than I would have expected. std::any is less awful than I thought. You can take a look at the code so far here Compiler Explorer Link to see the results I'll clean up my scratch project and push it at some point.

Steve Downey's Birthday (Observed)

LOCATION:

Blooms Tavern
208 East 58th Street
Between 2nd & 3rd Ave
New York, NY 10022

TIME:

6:00 PM ->

Description:

This is the sound
I bought a ticket to the world
But now I've come back again
Why do I find it hard to write the next line?
Oh, I want the truth to be said
I know this much is true

DIRECTIONS:

From: 110 W 14th St

Walk Walk

About 2 min , 335 ft

6:03 PM

14 Street Station

L Canarsie - Rockaway Pkwy

2 min (non-stop)

6:05 PM

14 Street - Union Sq Station

Walk Walk

About 1 min

6:09 PM

14 Street - Union Sq Station

5X Eastchester - Dyre Av

7 min (2 stops)

6:16 PM

59 St-Lexington Av Station

Walk Walk

About 4 min , 0.2 mi

6:20 PM

Blooms Tavern

208 E 58th St, New York, NY 10022

Walk Walk

About 2 min , 0.1 mi

6:03 PM

14 Street Station

M Forest Hills - 71 Av

13 min (6 stops)

6:16 PM

Lexington Av-53 St

Walk Walk

About 6 min , 0.3 mi

6:22 PM

Blooms Tavern

208 E 58th St, New York, NY 10022

From: 731 Lexington Avenue

Head southwest on Beacon Ct toward E 58th St
Restricted usage road
184 ft

Turn left onto E 58th St
Pass by Wells Fargo Bank (on the left in 115 ft)
Destination will be on the right
420 ft

Bloom's Tavern

208 E 58th St, New York, NY 10022

Building Saar Raz's clang concepts branch

A Recipe for building Saar Raz's clang concepts branch

Saar Raz has been working on a Concepts implementation, available at https://github.com/saarraz/clang-concepts It's not much harder to build it than clang usually is, it's just a matter of getting things checked out into the right places before configuring the build. Just like LLVM and clang normally. In order to double check how, I peeked at the shell script used by the compiler explorer image to build the clang-concepts compiler: https://github.com/mattgodbolt/compiler-explorer-image/blob/master/clang/build/build-concepts.sh The really important bit is getting exactly the right commit from LLVM, 893a41656b527af1b00a1f9e5c8fcecfff62e4b6. To get a working directory something like: Starting from where you want your working tree and build tree, e.g ~/bld/llvm-concepts
git clone https://github.com/llvm-mirror/llvm.git

pushd llvm
git reset --hard 893a41656b527af1b00a1f9e5c8fcecfff62e4b6
popd

pushd llvm/tools
git clone https://github.com/saarraz/clang-concepts.git clang
popd

pushd llvm/projects
git clone https://github.com/llvm-mirror/libcxx.git
git clone https://github.com/llvm-mirror/libcxxabi.git
# The sanitizers: this is optional but you want them
git clone https://github.com/llvm-mirror/compiler-rt.git
popd

Then to build and install
mkdir build && cd build

cmake \
    -DCMAKE_INSTALL_PREFIX=~/install/llvm-concepts/ \
    -DLLVM_ENABLE_LIBCXX=yes  \
    -DCMAKE_BUILD_TYPE=Release  \
    -DLLVM_ENABLE_ASSERTIONS=yes \
    -DLLVM_PARALLEL_LINK_JOBS=1 \
    -G Ninja  \
    ../llvm/

ninja

ninja check

ninja install
Note that I install into a separate prefix, keeping it isolated from everything else. The compiler can be invoked as ~/install/llvm-concepts/bin/clang++. There's no particular reason to put the compiler on your PATH.

Should Unicode literals be guaranteed to be well-formed?

TL;DR Betteridge's law applies: No. Are you still here?

Unicode Literals

In C++ 20 there are 2 kinds and 6 forms of Unicode literals. Character literals and string literals, in UTF-8, UTF-16, and UTF-32 encodings. Each of them uses a distinct char type to signal in the type system what the encoding is for the data. For plain char literals, the encoding is the execution character set, which is implementation defined. It might be something useful, like UTF-8, or old, like UCS-2, or something distinctly unhelpful, like EBCDIC. Whatever it is, it was fixed at compilation time, and is not affected by things like LOCALE settings. The source character set is the encoding the compiler believes your source code is written in. Including the octets that happen to be between " and ' characters.
char32_t s1[] = U"\u0073tring";
char16_t s2[] = U"\u0073tring";
char8_t s2[] = U"\u0073tring";

char32_t c1 = U'\u0073';
char16_t c1 = u'\u0073';
char8_t c1 = u8'\u0073';
Unicode codepoint U+0073 is 's'. So all of the strings are "string", and all of the characters are 's', however the rendering of each in memory is different. For the string types, each unit in the string is 32, 16 or 8 bits respectively, and for the character literals, each is one code unit, again of 32, 16, or 8 bits.

Suprises

This is due to Zach Laine, an actual expert, and sorting out what happened took several other experts, so the rest of us have little hope.
char8_t suprise[] = u8"ς";
assert(strlen(surprise) == 5);
This comes down to your editor and compiler having a minor disagreement about encoding. The compiler was under the impression that the encoding was cp-1252, where the editor believed it to be UTF-8. The underlying octets for ς are 0xCF 0x82, each of which is a character in cp-1252. All octets are valid characters in cp-1252. So each was converted to UTF-8, resulting in 0xC3 0x8F and 0xE2 0x80 0x9A. Of course. The contents of the string are still in the source character set, not in any of the Unicode forms. Unless that happens to be the source character set. But at least it's well formed.

Escapes

The \u escapes, which identify Unicode codepoints by number, will produce well formed Unicode encoding in strings, and in character literals if they fit. That is, in a char32_t, you can put any code point. In a char16_t you can put any character from the basic multilingual plane. In a char8_t you can put 7 bit ASCII characters. Or you can use hex or octal escapes, which will be widened to code unit size and the value placed into the resultant character or string. And no current compiler checks if that makes sense, although they will warn you if, for example you try to write something like:
char16_t w4 = u'\x0001f9ff'; // NAZAR AMULET - Unicode 11.0 (June 2018)
char16_t sw4[] = u"\x0001f9ff";
where you're trying to put a value that won't fit into a code unit into the result.
warning: hex escape sequence out of range
The \xnn and \nnn hex and octal escapes are currently a hole that lets you construct ill-formed string literals. For example
char8_t oops = u8"\xfe\xed";

But there are lots of ways of constructing ill-formed arrays of char8_t or char16_t. Just spell them out as arrays:
char8_t ill = {0xfe, 0xed};
The type system doesn't provide that char8_t means well formed UTF-8. All it does is tell you that the intended encoding is UTF-8. Which is a huge improvement over char. But it does not provide any guarantee to an API taking a char8_t*.

\0 is an octal escape sequence

You can spell it \u0000. But that's weird, since we spell it as \0 everywhere, and that it's an octal escape is a C++ trivia question.

I want to be told if I'm forming ill-formed Unicode.

Then don't use escape sequences. If you use either well encoded source code encodings or universal character names you will get well formed Unicode. The primary reason for wanting ill-formed Unicode is for testing. It's a convenience. And there are straightforward workarounds. But disallowing hex and octal escapes in Unicode strings makes the language less regular while preventing an error that you had to go out of your way to create, and does not actually produce more runtime safety.

Litmus Tests for Multithreaded Behavior

Litmus Tests for Multithreaded Behavior Or How Processors Don't Do What You Think

Modern multicore processors are entirely weirder than almost anyone thinks possible. They are somewhat weirder than chip makers were willing to admit until fairly recently. They are sufficiently weird enough that almost all multi-threaded programs, and many lock-free algorithms, had bugs because the hardware just does not work the way anyone would reasonably expect. And, in addition, optimizing compilers did not actually know how to not break your code. [boehm2005threads] I'm in the (slow) process of writing some test cases for multi-threaded code. I eventually want to have some confidence that some code is executed once, and only once, in an efficient manner. It's the 'efficient' part that worries me, because for efficient, it also has a tendency to be clever, and I'm learning that clever MT code is often subtly broken. [bacon2000double] So if smarter people than me make mistakes about MT code, I need tests to compensate. And ones that will cover occasional allowed but unexpected behavior. Which means the test framework should be able to detect them. Also, fortunately, the RPi is a computer that exhibits some odd behavior, as it is an ARM system. X86 has a much stronger model. However, even the x86 model is allowed to perform in odd ways. Starting in 2007, Intel has started publishing short snippets of assembly and documenting what are the allowed and disallowed results running them in parallel. [IWPAug2007] These snippets have come to be called litmus tests, and are used to confirm the behavior of hardware, and confirm models of the hardware behavior. A particularly important model for C++ programmers is the x86 Total Store Order model [owens2009better] which provides a way of mapping the C++11 memory model to X86 hardware. X86 hardware provides a strongly consistent memory model. Power and ARM provide fewer guarantees, and mapping the C++ memory model to these architectures is more challenging. [maranget2012tutorial]

Message Passing

The tests outlined in the Intel paper are short pieces of assembly to be exercised on different processors, with guarantees about behavior that will not happen. The first one essentially promises that message passing will work, and is now known as the MP test.
Processor 0 Processor 1
mov [ _x], 1 // M1 mov r1, [ _y] // M3
mov [ _y], 1 // M2 mov r2, [ _x] // M4
Initially x = y = 0 r1 = 1 and r2 = 0 is not allowed That says that we can't read the writes of x and y out of order, which ensures that if we're waiting to see the write to the flag y, we can be guaranteed to see the payload in x. If we change it slightly to wait for the write to _y to be visible, we can pass a message from one thread to anothher in _x. This is also known as Dekards Algorithm. ARM and Power do not provide that guarantee without additional synchronization instructions. In C++ that looks something like the following, using the test framework I'm writing.
MP::MP() : x_(0), y_(0) {}
void MP::t1() {
    x_.store(1, std::memory_order_relaxed);
    y_.store(1, std::memory_order_relaxed);
}
void MP::t2(Result& read) {
    while (!y_.load(std::memory_order_relaxed)){}
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
}

Here, x_ and y_ are atomic<int>s, and we're using the lowest possible atomic guarantee in C++, relaxed. Relaxed guarantees that the operation happens atomically. but there are no synchronization properties with anything else. This usally corresponds to the basic int type. Unless you're using a really insane processor that might let an int be partially written and observable. Like you might get the top half of the int, or the middle byte. The commercial processors that allowed this have pretty much died out. The test spins on seeing the load of y to be complete. It loads the value of x_ into a result tuple. The tuple is used as the key to a map which accumulates how many times each result has been seen. Running the above on my x86 laptop:
[ RUN      ] ExperimentTest.MPTest1
(1) : 2000000

on my Raspberry Pi 3:
[ RUN      ] ExperimentTest.MPTest1
(0) : 483
(1) : 1999517

Using objdump to check the generated assembly
00000088 <litmus::MP::MP()>:
  88:   mov     r2, #0
  8c:   str     r2, [r0]
  90:   str     r2, [r0, #64]   ; 0x40
  94:   bx      lr

00000098 <litmus::MP::t1()>:
  98:   mov     r3, #1
  9c:   str     r3, [r0]
  a0:   str     r3, [r0, #64]   ; 0x40
  a4:   bx      lr

000000a8 <litmus::MP::t2(std::tuple<int>&)>:
  a8:   ldr     r3, [r0, #64]   ; 0x40
  ac:   cmp     r3, #0
  b0:   beq     a8 <litmus::MP::t2(std::tuple<int>&)>
  b4:   ldr     r3, [r0]
  b8:   str     r3, [r1]
  bc:   bx      lr
So, out of the 2,000,000 times that I ran the experiment, there were 483 times that reading x_ resulted in 0, even though y_ was 1. ARM has a weaker memory model than x86. This has some advantages in processor implementation. It has distinct disadvantages in how our brains work. X86 tries to preserve the model that there is shared memory that everyone sees and works with. That's not strictly true, even for X86, but ARM and Power don't even come close. On the other hand, it's also why it's easier to add more cores to Power and ARM chips and systems. I routinely work with Power systems with 512 physical cores.

Store Buffering

Store buffering is the odd case that is allowed in the Intel memory model. When assigning locations in two threads, and then reading them on opposite threads, both threads are allowed to read the older state. The stores get buffered. From the Intel White Paper:
Processor 0 Processor 1
mov [ _x], 1 // M1 mov [ _y], 1 // M3
mov r1, [ _y] // M2 mov r2, [ _x] // M4
Initially x = y = 0 r1 = 0 and r2 ==0 is allowed Note, in particular, there is no interleaving of M1 - 4 that could result in r1 and r2 being 0. Not without interupting an instruction in the middle. But the instructions themselves are atomic, and indivisible. If they were actually operating on shared memory, this would not be possible. However, it does happen.
SB::SB() : x_(0), y_(0) {}
void SB::t1(Result& read) {
    y_.store(1, std::memory_order_relaxed);
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
}
void SB::t2(Result& read) {
    x_.store(1, std::memory_order_relaxed);
    std::get<1>(read) = y_.load(std::memory_order_relaxed);
}

That generates the x86 code
00000000000000f0 <litmus::SB::t1(std::__1::tuple<int, int>&)>:
  f0:   mov    DWORD PTR [rdi+0x40],0x1
  f7:   mov    eax,DWORD PTR [rdi]
  f9:   mov    DWORD PTR [rsi],eax
  fb:   ret

0000000000000100 <litmus::SB::t2(std::__1::tuple<int, int>&)>:
 100:   mov    DWORD PTR [rdi],0x1
 106:   mov    eax,DWORD PTR [rdi+0x40]
 109:   mov    DWORD PTR [rsi+0x4],eax
 10c:   ret

And on my x86 machine:
[ RUN      ] ExperimentTest.SBTest1
(0, 0) : 559
(0, 1) : 999858
(1, 0) : 999576
(1, 1) : 7

So 559 times neither core saw the other core's store.

Load Buffering

Load Buffering is the dual of store buffering. Loads into registers might be delayed, or buffered, and actually performed after following instructions. It's not allowed in the Intel architecture. From the Intel White Paper
Processor 0 Processor 1
mov r1, [ _x] // M1 mov r2, [ _y] // M3
mov [ _y], 1 // M2 mov [ _x], 1 // M4
Initially x = y = 0 r1 = 1 and r2 = 1 is not allowed
LB::LB() : x_(0), y_(0) {}
void LB::t1(Result& read) {
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
    y_.store(1, std::memory_order_relaxed);
}
void LB::t2(Result& read) {
    std::get<1>(read) = y_.load(std::memory_order_relaxed);
    x_.store(1, std::memory_order_relaxed);
}
This is the x86 asm code
00000000000000c0 <litmus::LB::t1(std::__1::tuple<int, int>&)>:
  c0:   mov    eax,DWORD PTR [rdi]
  c2:   mov    DWORD PTR [rsi],eax
  c4:   mov    DWORD PTR [rdi+0x40],0x1
  cb:   ret
  cc:   nop    DWORD PTR [rax+0x0]

00000000000000d0 <litmus::LB::t2(std::__1::tuple<int, int>&)>:
  d0:   mov    eax,DWORD PTR [rdi+0x40]
  d3:   mov    DWORD PTR [rsi+0x4],eax
  d6:   mov    DWORD PTR [rdi],0x1
  dc:   ret
  dd:   nop    DWORD PTR [rax]

And the ARM code, at -O1
000000d0 <litmus::LB::t1(std::tuple<int, int>&)>:
  d0:   ldr     r3, [r0]
  d4:   str     r3, [r1, #4]
  d8:   mov     r3, #1
  dc:   str     r3, [r0, #64]   ; 0x40
  e0:   bx      lr

000000e4 <litmus::LB::t2(std::tuple<int, int>&)>:
  e4:   ldr     r3, [r0, #64]   ; 0x40
  e8:   str     r3, [r1]
  ec:   mov     r3, #1
  f0:   str     r3, [r0]
  f4:   bx      lr

ARM generally allows it, but per [maranget2012tutorial] it's very sensitive, and dependencies will make it not appear. In my tests, I did not observe an instance of a buffering, but it may be due to the first store the compiler introduces, in order to actually get the data into the tuple. That it's documented as possible is still exceedingly strange.

Independent Reads of Independent Writes

IRIW is a generalization of store buffering, where two reader threads each read different apparent orderings of writes from two distinct writer threads.
T1 T2 T3 T4
X = 1 Y = 1 R1 = X R3 = y
R2 = Y R4 = X
Initially X=Y=0 Allowed in ARM, not in x86 r1=1, r2=0, r3=1, r4=0 [maranget2012tutorial,owens2009better] This is not observed in x86 processors, but is in some ARM and POWER, more often in POWER. X86 hardware has a consistent view of memory where other hardware can see memory writes in different orders on different threads. On my rPi, I didn't observe any incidents of X and Y being read out of order, over 40 million runs.
IRIW::IRIW() : x_(0), y_(0) {}
void IRIW::t1() {
    x_.store(1, std::memory_order_relaxed);
}

void IRIW::t2() {
    y_.store(1, std::memory_order_relaxed);
}

void IRIW::t3(Result& read) {
    std::get<0>(read) = x_.load(std::memory_order_relaxed);
    std::get<1>(read) = y_.load(std::memory_order_relaxed);
}

void IRIW::t4(Result& read) {
    std::get<2>(read) = y_.load(std::memory_order_relaxed);
    std::get<3>(read) = x_.load(std::memory_order_relaxed);
}

Summary

The allowed behavior of modern processors is very different than our mental model of a Von Neumann architecture computer. Each core can have a different view of memory, and without additional controls, writes and reads can break the illusion of a single unified memory. The C++ memory model gives the controls and guarantees about what happens when different threads read and write memory, and here I've deliberately used the weakest version available, relaxed, in order to allow the processors the wideest latitude in behavior. Relaxed is, for processors that have it, often just an unconstrained int, which means that you will get odd behavior if you are running shared state multithreaded code that uses plain native types. It is a particular problem with code that was originally written and tested on a x86 architecture because the native model is fairly strong. This frequently causes problems when porting to a mobile platform, where ARM is a very popular hardware choice.

Org-mode source and git repo

Exported from an org-mode doc. All of the source is available on github at SpinGate

References

Bibliography

  • [boehm2005threads] Boehm, Threads cannot be implemented as a library, 261-268, in in: ACM Sigplan Notices, edited by (2005)
  • [bacon2000double] @miscbacon2000double, title=The “double-checked locking is broken” declaration, author=Bacon, David and Bloch, Joshua and Bogda, Jeff and Click, Cliff and Haahr, Paul and Lea, Doug and May, Tom and Maessen, Jan-Willem and Mitchell, JD and Nilsen, Kelvin and others, year=2000
  • [IWPAug2007] @miscIWPAug2007, howpublished = \urlhttp://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf, note = Accessed: 2017-04-30, title = Intel® 64 Architecture Memory Ordering White Paper,
  • [owens2009better] Owens, Sarkar & Sewell, A better x86 memory model: x86-TSO, 391-407, in in: International Conference on Theorem Proving in Higher Order Logics, edited by (2009)
  • [maranget2012tutorial] Maranget, Sarkar & Sewell, A tutorial introduction to the ARM and POWER relaxed memory models, Draft available from http://www. cl. cam. ac. uk/\~ pes20/ppc-supplemental/test7. pdf, (2012).

Date: 2017-04-30 Sun 00:00

Author: Steve Downey

Created: 2018-06-17 Sun 14:07

Validate