I don’t really think out-of-order in hardware causes trouble for programmers that wasn’t already there when you use -O3.  Compilers will already promote memory to registers and do interprocedural optimization and reorder memory references.  You have to sprinkle 
asm volatile("" ::: "memory");
around like pixie dust to make sure the compiler does things in the order you want, nevermind the hardware.

x86 has wildly complex microarchitecture, but the model for a single thread is perfectly sensible.  It seems to work like it was in-order. OOO isn’t changing that model of execution at all.  I mean you care about it when performance tuning, but not for correctness.

Other architectures, ARM, IBM, Alpha, are far worse in this respect.

The real problems are when you have multicore in a shared memory space and you have to read all the fine print in the memory ordering chapter and understand the subtle differences between LFENCE and SFENCE and MFENCE.  I get that, but I also think shared memory is a failed experiment and we should have gone with distributed memory and clusters and message passing.  It is possible for mortals to program those.    Regular folks can make 1000 rank MPI programs work, where almost noone other than Maurice Herlihy or like that can reason about locks and threads and shared data structures.

My wish for a better world is to integrate messaging into the architecture rather than have an I/O device model for communications.  It is criminal that machine to machine comms are still stuck at 800 nanoseconds or so latency.  It takes 200 instructions or so to send a message under the best circumstances and a similar number to receive it, plus bus, adapter, wire, and switch time. 

-L