Ø  It seems it was completely broken C++ solution

 

For this particular problem (which is essentially building a stock exchange) C++ is only ok for the initial core. As soon as you start adding features that interact with each other a decent solution becomes intractable and you end up with 40+ developers writing millions of lines of code that deep copy data structures to avoid memory management problems, share heavily contended locks, uses thousands of OS threads (and you have to start tweaking the default stack size to fit them all into RAM) and so on. Everybody loses track of the big picture and after 15 years of this you’ve got an unmaintainable code base with runaway costs.

 

Ø  Do I correctly understand that OCaml is suitable for latencies ~10ms and worse?

 

I’m seeing 95% of messages handles in under 13 microseconds.

 

Ø  Also, there is still an issue with multithreading. Did you use any existing solution?

 

My OCaml solution is single threaded. Latency is great but throughput could be a lot better. In particular, serialization and deserialization to each client is embarrassingly parallel but done in series by OCaml.

 

Cheers,

Jon.

 

From: Stanislav Artemkin [mailto:artemkin@gmail.com] 
Sent: 10 June 2016 22:34
To: jon@ffconsultancy.com
Cc: Gabriel Scherer; caml users; Damien Doligez
Subject: Re: [Caml-list] Measuring GC latencies for OCaml program

 

Very interesting! It seems it was completely broken C++ solution. Wish I could use OCaml for current project, but we still have to use C++ to get microsecond latencies.

 

Do I correctly understand that OCaml is suitable for latencies ~10ms and worse?

 

Also, there is still an issue with multithreading. Did you use any existing solution?

 

Thank you

 

On Sat, Jun 11, 2016 at 12:35 AM, Jon Harrop <jon@ffconsultancy.com> wrote:


Very interesting, thank you!

We just implemented a substantial client and server system for the finance sector with the "low" latency server written in OCaml. I have done this before in other languages and seen it done in many more languages. The OCaml is by far the consistently-fastest solution I have ever seen. Orders of magnitude faster than the last C++ solution I saw. In particular, compared to Java and .NET where we see substantial latencies from the GC at around 100ms, with OCaml there is no visible peak at high latency due to the GC at all. And this project was implemented to a very short deadline with no time for optimisation at all.

On a related note, we used Jane St.'s Core and Async libraries as well as Cohttp and found them all to be *phenomenally* efficient and robust.

In case anyone is interested, the only pain point I had was the development environment. I actually prototyped all my hard code in simplified F# in Visual Studio on Windows and then ported to OCaml. Emacs and Merlin crash and hang a lot for me: maybe 50 times per day. Hence my other post. :-)

In terms of the language, OCaml was very well suited to this task. Lots of purely functional data structures forming in-memory databases that can be queried in different ways and have many different versions of them stored in different places at different times. Perhaps the main language feature I missed from F# was (surprisingly!) reflection. My F# client code uses reflection to serialize and deserialize messages. With no reflection I couldn't do that in OCaml so I used reflection in F# to autogenerate the OCaml code.

Cheers,
Jon.


-----Original Message-----
From: caml-list-request@inria.fr [mailto:caml-list-request@inria.fr] On Behalf Of Gabriel Scherer
Sent: 30 May 2016 20:48
To: caml users
Cc: Damien Doligez
Subject: [Caml-list] Measuring GC latencies for OCaml program

Dear caml-list,

You may be interested in the following blog post, in which I give instructions to measure the worst-case latencies incurred by the GC:

  Measuring GC latencies in Haskell, OCaml, Racket
  http://prl.ccs.neu.edu/blog/2016/05/24/measuring-gc-latencies-in-haskell-ocaml-racket/

In summary, the commands to measure latencies look something like:

    % build the program with the instrumented runtime
    ocamlbuild -tag "runtime_variant(i)" myprog.native

    % run with instrumentation enabled
    OCAML_INSTR_FILE="ocaml.log" ./main.native

    % visualize results from the raw log
    $(OCAML_SOURCES)/tools/ocaml-instr-graph ocaml.log
    $(OCAML_SOURCES)/tools/ocaml-instr-report ocaml.log

While the OCaml GC has had a good incremental mode with low latencies for most workloads for a long time, the ability to instrument it to actually measure latencies is still in its infancy: it is a side-result of Damien Doligez's stint at Jane Street last year, and
4.03.0 is the first release in which this work is available.

A practical consequence of this youth is that the "user experience" of actually performing these measurements is currently very bad. The GC measurements are activated in an instrumented runtime variant (OCaml supports having several variants of the runtime available, and deciding which one to use for a specific program at link-time), which is the right design choice, but right now this variant is not built by default by the compiler distribution -- building it is a configure-time option disabled by default. This means that, as a user interested in doing the measurements, you have to compile an alternative OCaml compiler.
Furthermore, processing the raw instrumented log requires tool that are also in their infancy, and are currently included in the compiler distribution sources but not installed -- so you have to have a source checkout available to use them. In contrast, GHC's instrumentation is enabled by just passing the option "+RTS -s" to the Haskell program of interest; this is superbly convenient and something we should aim at.

I discussed with Damien whether we should enable building the instrumented runtime by default (for example pass the --with-instrumented-runtime option to the opam switches people are using, and encourage distributions to use it in their packages as well). Of course there is a cost/benefit trade-off: currently virtually nobody is using this instrumentation, but enabling it by default would increase the compilation time of the compiler distribution for everyone. (On my machine it only adds 5 seconds to total build time.)

I personally think that we should aim for a rock-solid experience for profiling and instrumenting OCaml program enabled by default¹. It is worth making it slightly longer for anyone to install the compiler if we can make it vastly easier to measure GC pauses in our program when the need arises (even if it's not very often). But that is a discussion that should be had before making any choice.

Regarding the log analysis tools, before finding about Damien's included-but-not installed tools (a shell and an awk script, in the finest Unix tradition) I built a quick&dirty OCaml script to do some measurements, which can be found in the benchmark repository below. It would not be much more work to grow this in a reusable library to extract the current log format into a structured data structure -- the format is undocumented but the provided scripts in tools/ have enough information to infer the structure. Such a script/library would, of course, remain tightly coupled to the OCaml version, but I think it could be useful to have it packaged for users to play with.

  https://gitlab.com/gasche/gc-latency-experiment/blob/master/parse_ocaml_log.ml

¹: We cannot expect users to effectively write performant code if they don't have the tool support for it. The fact that lazyness in Haskell makes it harder for users to reason about efficiency or memory usage has made the avaibility of excellent performance tooling *necessary*, where it is merely nice-to-have in OCaml. Rather ironically, Haskell tooling is now much better than OCaml's in this area, to the point that it can be easier to write efficient code in Haskell.

Three side-notes on profiling tools:

1. `perf record --call-graph=dwarf` works fine for ocamlopt binaries
  (no need for a frame-pointers switch), and this is documented:
    https://ocaml.org/learn/tutorials/performance_and_profiling.html#UsingperfonLinux

2. Thomas Leonard has done excellent work on domain-specific profiling
   tools for Mirage, and this is the kind of tool support that I think
   should be available to anyone out of the box.
     http://roscidus.com/blog/blog/2014/08/15/optimising-the-unikernel/
     http://roscidus.com/blog/blog/2014/10/27/visualising-an-asynchronous-monad/

3. There is currently more debate than anyone could wish for around
   a pull request of Mark Shinwell for runtime support for dynamic call
   graph construction and its use for memory profiling.
     https://github.com/ocaml/ocaml/pull/585

4. Providing a good user experience for performance or space profiling
   is a fundamentally harder problem than for GC pauses. It may
   require specially-compiled versions of the libraries used by your
   program, and thus a general policy/agreement across the
   ecosystem. Swapping a different runtime at link-time is very easy.

--
Caml-list mailing list.  Subscription management and archives:
https://sympa.inria.fr/sympa/arc/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners

Bug reports: http://caml.inria.fr/bin/caml-bugs=



--
Caml-list mailing list.  Subscription management and archives:
https://sympa.inria.fr/sympa/arc/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs