The first step for OCaml would be to be able to run multiple
communicating instances of the runtime bound to one core each in one
process and have them communicate via lock free queues.

We've done some experiments in this direction at Jane Street. On Linux, we've been able to get fast enough IPC channels for our purposes that slamming things into the same memory space has not in the end been necessary. (There is I agree some pain associated with running multiple runtimes in the same process. If you're interested, contact me off-list and I can try to get you some of the details of what we ran into.)

But have you tried using shared-memory segments for communicating between different processes? You say the latencies are too high, but do you have any measurements you could share? Have you tried queues using shared memory segments, in particular? Inter-thread communication has latency as well, and the performance issues depend on lots of things, OS and hardware platform included. It would help in understanding the tradeoffs.

As we go to higher-and-higher numbers of cores, I suspect that message-passing solutions are likely to scale better than shared memory, so I'm not so sure that OCaml is on the wrong path here. I think that most of the work that's needed is going to come in the form of libraries, with only a little work in the compiler and the runtime. Given that, I think this is an issue for the community to solve, not INRIA.