caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Anil Madhavapeddy <anil@recoil.org>
To: "Daniel Bünzli" <daniel.buenzli@erratique.ch>
Cc: caml-list <caml-list@inria.fr>
Subject: Re: [Caml-list] Non-blocking IO interface design
Date: Mon, 9 Apr 2012 20:28:51 +0100	[thread overview]
Message-ID: <1341A6A9-0273-434A-BCA9-819DACA53FFF@recoil.org> (raw)
In-Reply-To: <83F8677AD5E142A3BB0EC27C78C0658B@erratique.ch>

On 8 Apr 2012, at 21:37, Daniel Bünzli wrote:
> 
> I think that the first five points are mostly met and the
> cps transformation of blocking into non-blocking is relatively
> straightforward and remains readable in my opinion. Regarding the 6th
> point, using the included `setrip.native` program on 32 Mo of randomly
> generated s-expressions seem to indicate that:
> 
>  The non-blocking decoder can be at least 1.35 slower than blocking
>  The non-blocking encoder can be at least 1.1 slower than blocking
> 
> Now I don't think these "bad" numbers should be taken to dismiss the
> approach since in the context of a larger reactive program a blocking
> codec may actually incur performance and scability issues that cannot
> be shown by this example program.

To track this down (as I'm learning about ocaml profiling at the moment),
I firstly ran the blocking and non-blocking decoder through ocamlcp to check 
call counts, and:

(non-blocking)
let decode_nb_unix usize fd =                                                                         
  (* 2 *) let rec loop d fd buf = (* 18039140 *) match Se.Nb.decode d with                            
  | `Lexeme l -> (* 18035048 *) loop d fd buf
  | `End -> (* 2 *) `Ok  
  | `Error -> (* 0 *) `Error                                                                          
  | `Await -> 
      (* 4090 *) let rc = unix_read fd buf 0 (String.length buf) in                                   
      Se.Nb.decode_src d buf 0 rc; loop d fd buf                                                      
  in
  loop (Se.Nb.decoder `Manual) fd (String.create usize)                                               

(blocking)
let decode_b src =
  (* 1 *) let rec loop d = (* 9017525 *) match Se.B.decode d with
  | `Lexeme l -> (* 9017524 *) loop d
  | `End -> (* 1 *) `Ok
  | `Error -> (* 0 *) `Error
  in
  loop (Se.B.decoder src)

The I/O loop is being called twice for the non-blocking version, as it receives 
the `Await signal, does the Unix syscall, and then jumps into decode_src. Presumably
a full non-blocking version would have to register with a select handler if it
gets an EAGAIN at this point,

In terms of the number of system calls, the non-blocking one is more efficient,
as it uses a 16KB buffer versus the 4K reads done by the blocking version. 

$ strace ./setrip.native -dec -unix  < foo.sexp  2>&1 | wc -l
2107
$ strace ./setrip.native -dec -unix -b  < foo.sexp  2>&1 | wc -l
8238

Running 'perf stat' on the decoder on Linux shows that the non-blocking version
is spending more CPU time doing something:

$ perf stat --log-fd 1 ./setrip.native -dec -unix -b  < foo.sexp  
 Performance counter stats for './setrip.native -dec -unix -b':

       1095.337180 task-clock                #    1.000 CPUs utilized          
                 1 context-switches          #    0.000 M/sec                  
                 0 CPU-migrations            #    0.000 M/sec                  
               800 page-faults               #    0.001 M/sec                  

$ perf stat --log-fd 1 ./setrip.native -dec -unix < foo.sexp  
 Performance counter stats for './setrip.native -dec -unix':

       1339.360940 task-clock                #    1.000 CPUs utilized          
                 2 context-switches          #    0.000 M/sec                  
                 0 CPU-migrations            #    0.000 M/sec                  
               823 page-faults               #    0.001 M/sec                  


Setting the GC to verbose shows that the non-blocking one is doing way more
heap allocations than the blocking version (14 heap checks, versus 3). So this
quick'n'dirty function is stuck in the source to bracket hot-looking calls and
see how many minor heap allocations are taking place.

let gcbracket fn a =                                                                                                                       
  let open Gc in
  compact ();
  let s1 = stat () in
  let r = fn a in
  let s2 = stat () in
  let r = fn a in
  Printf.eprintf "gc: %.0f [%.0f %.0f %.0f / %.0f %0.f %0.f]\n%!"
    (s2.minor_words -. s1.minor_words) s1.minor_words s1.promoted_words
    s1.major_words s2.minor_words s2.promoted_words s2.major_words;
  r

Looking at the two decoders in src/se.ml, it looks like the non-blocking one
allocates closures on every loop, which the blocking one doesn't. This is so it
can store the continuation in d.k for the next loop. So putting gcbracket around
the two decode calls confirms this. Switching to bytecode so we have accurate minor
heap states, we have:

For the blocking one:
gc: 31 [1450548 0 44224 / 1450579 0 44224]
gc: 31 [1451229 0 44255 / 1451260 0 44255]
gc: 31 [1452506 0 44278 / 1452537 0 44278]
gc: 31 [1453187 0 44309 / 1453218 0 44309]

The important number is the first one (difference in minor heap size before and
after the function, which is solidly 31.

With the non-blocking one, more words are allocated in the minor heap:
gc: 67 [5368 0 4684 / 5435 0 4684]
gc: 42 [6096 0 4723 / 6138 0 4723]
gc: 42 [7395 0 4762 / 7437 0 4762]
gc: 56 [8119 0 4809 / 8175 0 4809]

So to summarise, instead of storing a continuation closure, it would probably be better 
to explicitly store the state in d.k to minimise allocation?

The library looks very useful by the way: I have exactly the same issue with several
Lwt-only protocol libraries we're developing at the moment. Would love to use yours before
the first release of them to make them more independent of the underlying I/O mechanism...

Anyone else got any useful profiling tips? I used ocaml-4-trunk for this, and one useful
feature I just discovered (probably thanks to PR#5487 being closed) is that you can do:

$ <add 'debug' to _tags>
$ perf record ./foo.native
$ perf annotate

...and it will annotate hotspots in the binary, show you the x86 assembly, and also the
OCaml source code amidst the assembly (due to GDB debug symbols being present).

-anil

  reply	other threads:[~2012-04-09 19:28 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-08 20:37 Daniel Bünzli
2012-04-09 19:28 ` Anil Madhavapeddy [this message]
2012-04-10  8:21   ` Daniel Bünzli
2012-04-10 12:40     ` Yaron Minsky
2012-04-14  9:46       ` Daniel Bünzli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1341A6A9-0273-434A-BCA9-819DACA53FFF@recoil.org \
    --to=anil@recoil.org \
    --cc=caml-list@inria.fr \
    --cc=daniel.buenzli@erratique.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).