caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Non-blocking IO interface design
@ 2012-04-08 20:37 Daniel Bünzli
  2012-04-09 19:28 ` Anil Madhavapeddy
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Bünzli @ 2012-04-08 20:37 UTC (permalink / raw)
  To: caml-list

Hello,

This is problematic : 

  https://github.com/williamleferrand/xmlm
  http://ambassadortothecomputers.blogspot.com/2010/08/mixing-monadic-and-direct-style-code.html

To solve this problem I'm looking for a simple interface design to
make my IO modules compatible with monadic concurrency libraries (lwt,
async, [insert your own here]) and event based loops (select(2),
poll(2), etc.). The design should have the following properties:

1. Unified interface for blocking and non-blocking mode. 
2. The existence of the non-blocking mode should not significantly
   impact blocking mode users.
3. Input possible from in_channel, string, refillable fixed-size string 
   buffer (non-blocking mode). 
4. Output possible to out_channel, Buffer.t, flushable fixed-size string
   buffer (non-blocking mode).
5. No third-party IO libraries/paradigms so that the module can adapt
   to the one the user chooses.
6. Reasonably efficient.

I looked for some time into Haskell's enumerators, pipes and other
conduits but I eventually came back to a more ad-hoc approach that
abstracts as follows. I'll gladly take any feedback you may have.

Suppose we want to IO streams of value of `type t`. For example xmlm's
signals (lexemes as they should be called) if you are familiar with
that.

For input (decoding) we begin with a type for input sources, decoders
and a function to create them.

  type src = [ `Channel of in_channel | `String of string | `Manual ]
  type decoder 
  val decoder : src -> decoder

A [`Manual] source means that the client will provide the decoder with
chunks of bytes to decode at his own pace. The function for decoding
is :

  val decode : decoder -> [ `Await | `End | `Error of e | `Yield of t ]

[decode d] is : 

- [`Await] iff [d] has a [`Manual] input source and awaits for
  more input. The client must use [decode_src] (see below) to provide it.
- [`Yield v], if a value [v] of type [t] was decoded. 
- [`End], if the end of input was reached.
- [`Error e], if an error [e] occured. If you are interested in a
  best-effort decoding, you can still continue to decode after
  the error.

For [`Manual] sources the function [decode_src] is used to provide
the byte chunks to read from : 

  val decode_src : decoder -> string -> int -> int -> unit

[decode_src d s k l] provides [d] with [l] bytes to read, starting at
[k] in [s]. This byte range is read by calls to [decode] with [d]
until `Await is returned. To signal the end of input call the function
with [l = 0].

That's all what is needed for input. Just a note on the `Error
case. Decoders should report any decoding errors with [`Error] to
allow standard compliant decodings. However at that point they should
give the opportunity to the client to continue to perform a best
effort decoding. In that case [decode] should always eventually return
[`End] even if [`Error]s were reported before. I think best-effort
decoding on errors is a good thing: I was annoyed more than once with
xmlm simply failing with `Malformed_char_stream on files produced by
legacy software that gave invalid UTF-8 encodings for funky
characters. Rather than fail and block the client at that point it's
better to report an error and let it continue if it wishes so by
replacing the invalid byte sequence with U+FFFD.

For output (encoding) we begin with a type for output destinations,
encoders and a function to create them.

  type dst = [ `Channel of out_channel | `Buffer of Buffer.t | `Manual ]
  type encoder
  val encoder : dst -> encoder

A [`Manual] destination means that the client will provide to the
decoder the chunks of bytes to encode to at his own pace. The function
for encoding is :

  val encode : 
    encoder -> [ `Await | `End | `Yield of t ] -> [ `Ok | `Partial ]

[encode e v] is 

- [`Partial] iff [e] has a [`Manual] destination and needs more output
  storage. The client must use [encode_dst] (see below) to provide it
  and then call [encode e `Await] until [`Ok] is returned.
- [`Ok] when the encoder is ready to encode a new [`Yield] or [`End].

Raises [Invalid_argument] if a [`Yield] or [`End] is encoded after a 
[`Partial] encode (this is done to prevent the encoder from having 
to bufferize [`Yield]s). 

For [`Manual] destinations the function [encode_dst] is used to provide
the byte chunks to write to : 

  val encode_dst : encoder -> string -> int -> int -> unit

[encode_dst e s k l] provides [e] with [l] bytes to write, starting at
[k] in [s]. This byte range is written by calls to [encode] with [e]
until [`Partial] is returned. To know the remaining number of 
non-written free bytes in [s] the function [encode_dst_rem] can be
used:

  val encode_dst_rem : encoder -> int

[encoder_dst_rem e] is the remaining number of non-written, free bytes
in the last buffer provided with [encode_dst]. A well-behaved encoder
should always fill all the bytes it is given, except for the buffer
that encodes the `End.

One note on [`Manual] destinations, encoding [`End] always returns
[`Partial]. The client should then as usual use [encode_dst] and
continue with [`Await] until [`Ok] is returned at which point
[encode_dst_rem e] is guaranteed to be the size of the last provided
buffer (i.e. nothing was written, this is a good property for the
client's loops, see the example code).

To validate the approach and provide a blueprint for implementing the
interface I implemented both a blocking codec and a (cps) non-blocking
codec for a simplified grammar of s-expressions. It's available here :

  http://erratique.ch/repos/nbcodec
  git clone http://erratique.ch/repos/nbcodec.git

I think that the first five points are mostly met and the
cps transformation of blocking into non-blocking is relatively
straightforward and remains readable in my opinion. Regarding the 6th
point, using the included `setrip.native` program on 32 Mo of randomly
generated s-expressions seem to indicate that:

  The non-blocking decoder can be at least 1.35 slower than blocking
  The non-blocking encoder can be at least 1.1 slower than blocking

Now I don't think these "bad" numbers should be taken to dismiss the
approach since in the context of a larger reactive program a blocking
codec may actually incur performance and scability issues that cannot
be shown by this example program.

Thanks in advance for your input,

Daniel

P.S. 
Numbers above were gathered by timing these invocations :

  ./setrip.native -enc -unix -rseed 1067894368 > 1067894368.sexp
  ./setrip.native -enc -b -rseed 1067894368 > 1067894368.sexp

  ./setrip.native -dec -unix < 1067894368.sexp
  ./setrip.native -dec -b < 1067894368.sexp

This does however also compare two different IO mechanisms: pervasives
channels for blocking vs direct calls to Unix for non-blocking.
Remove the `-unix` flag to compare the timings with the same IO
mechanisms (I then get 1.45 for the decoder and still 1.1 for the
encoder).


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Non-blocking IO interface design
  2012-04-08 20:37 [Caml-list] Non-blocking IO interface design Daniel Bünzli
@ 2012-04-09 19:28 ` Anil Madhavapeddy
  2012-04-10  8:21   ` Daniel Bünzli
  0 siblings, 1 reply; 5+ messages in thread
From: Anil Madhavapeddy @ 2012-04-09 19:28 UTC (permalink / raw)
  To: Daniel Bünzli; +Cc: caml-list

On 8 Apr 2012, at 21:37, Daniel Bünzli wrote:
> 
> I think that the first five points are mostly met and the
> cps transformation of blocking into non-blocking is relatively
> straightforward and remains readable in my opinion. Regarding the 6th
> point, using the included `setrip.native` program on 32 Mo of randomly
> generated s-expressions seem to indicate that:
> 
>  The non-blocking decoder can be at least 1.35 slower than blocking
>  The non-blocking encoder can be at least 1.1 slower than blocking
> 
> Now I don't think these "bad" numbers should be taken to dismiss the
> approach since in the context of a larger reactive program a blocking
> codec may actually incur performance and scability issues that cannot
> be shown by this example program.

To track this down (as I'm learning about ocaml profiling at the moment),
I firstly ran the blocking and non-blocking decoder through ocamlcp to check 
call counts, and:

(non-blocking)
let decode_nb_unix usize fd =                                                                         
  (* 2 *) let rec loop d fd buf = (* 18039140 *) match Se.Nb.decode d with                            
  | `Lexeme l -> (* 18035048 *) loop d fd buf
  | `End -> (* 2 *) `Ok  
  | `Error -> (* 0 *) `Error                                                                          
  | `Await -> 
      (* 4090 *) let rc = unix_read fd buf 0 (String.length buf) in                                   
      Se.Nb.decode_src d buf 0 rc; loop d fd buf                                                      
  in
  loop (Se.Nb.decoder `Manual) fd (String.create usize)                                               

(blocking)
let decode_b src =
  (* 1 *) let rec loop d = (* 9017525 *) match Se.B.decode d with
  | `Lexeme l -> (* 9017524 *) loop d
  | `End -> (* 1 *) `Ok
  | `Error -> (* 0 *) `Error
  in
  loop (Se.B.decoder src)

The I/O loop is being called twice for the non-blocking version, as it receives 
the `Await signal, does the Unix syscall, and then jumps into decode_src. Presumably
a full non-blocking version would have to register with a select handler if it
gets an EAGAIN at this point,

In terms of the number of system calls, the non-blocking one is more efficient,
as it uses a 16KB buffer versus the 4K reads done by the blocking version. 

$ strace ./setrip.native -dec -unix  < foo.sexp  2>&1 | wc -l
2107
$ strace ./setrip.native -dec -unix -b  < foo.sexp  2>&1 | wc -l
8238

Running 'perf stat' on the decoder on Linux shows that the non-blocking version
is spending more CPU time doing something:

$ perf stat --log-fd 1 ./setrip.native -dec -unix -b  < foo.sexp  
 Performance counter stats for './setrip.native -dec -unix -b':

       1095.337180 task-clock                #    1.000 CPUs utilized          
                 1 context-switches          #    0.000 M/sec                  
                 0 CPU-migrations            #    0.000 M/sec                  
               800 page-faults               #    0.001 M/sec                  

$ perf stat --log-fd 1 ./setrip.native -dec -unix < foo.sexp  
 Performance counter stats for './setrip.native -dec -unix':

       1339.360940 task-clock                #    1.000 CPUs utilized          
                 2 context-switches          #    0.000 M/sec                  
                 0 CPU-migrations            #    0.000 M/sec                  
               823 page-faults               #    0.001 M/sec                  


Setting the GC to verbose shows that the non-blocking one is doing way more
heap allocations than the blocking version (14 heap checks, versus 3). So this
quick'n'dirty function is stuck in the source to bracket hot-looking calls and
see how many minor heap allocations are taking place.

let gcbracket fn a =                                                                                                                       
  let open Gc in
  compact ();
  let s1 = stat () in
  let r = fn a in
  let s2 = stat () in
  let r = fn a in
  Printf.eprintf "gc: %.0f [%.0f %.0f %.0f / %.0f %0.f %0.f]\n%!"
    (s2.minor_words -. s1.minor_words) s1.minor_words s1.promoted_words
    s1.major_words s2.minor_words s2.promoted_words s2.major_words;
  r

Looking at the two decoders in src/se.ml, it looks like the non-blocking one
allocates closures on every loop, which the blocking one doesn't. This is so it
can store the continuation in d.k for the next loop. So putting gcbracket around
the two decode calls confirms this. Switching to bytecode so we have accurate minor
heap states, we have:

For the blocking one:
gc: 31 [1450548 0 44224 / 1450579 0 44224]
gc: 31 [1451229 0 44255 / 1451260 0 44255]
gc: 31 [1452506 0 44278 / 1452537 0 44278]
gc: 31 [1453187 0 44309 / 1453218 0 44309]

The important number is the first one (difference in minor heap size before and
after the function, which is solidly 31.

With the non-blocking one, more words are allocated in the minor heap:
gc: 67 [5368 0 4684 / 5435 0 4684]
gc: 42 [6096 0 4723 / 6138 0 4723]
gc: 42 [7395 0 4762 / 7437 0 4762]
gc: 56 [8119 0 4809 / 8175 0 4809]

So to summarise, instead of storing a continuation closure, it would probably be better 
to explicitly store the state in d.k to minimise allocation?

The library looks very useful by the way: I have exactly the same issue with several
Lwt-only protocol libraries we're developing at the moment. Would love to use yours before
the first release of them to make them more independent of the underlying I/O mechanism...

Anyone else got any useful profiling tips? I used ocaml-4-trunk for this, and one useful
feature I just discovered (probably thanks to PR#5487 being closed) is that you can do:

$ <add 'debug' to _tags>
$ perf record ./foo.native
$ perf annotate

...and it will annotate hotspots in the binary, show you the x86 assembly, and also the
OCaml source code amidst the assembly (due to GDB debug symbols being present).

-anil

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Non-blocking IO interface design
  2012-04-09 19:28 ` Anil Madhavapeddy
@ 2012-04-10  8:21   ` Daniel Bünzli
  2012-04-10 12:40     ` Yaron Minsky
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Bünzli @ 2012-04-10  8:21 UTC (permalink / raw)
  To: Anil Madhavapeddy; +Cc: caml-list

Anil,

Thanks for the analysis. 

> The I/O loop is being called twice for the non-blocking version, as it receives
> the `Await signal, does the Unix syscall, and then jumps into decode_src. Presumably
> a full non-blocking version would have to register with a select handler if it
> gets an EAGAIN at this point,


Yes.

> In terms of the number of system calls, the non-blocking one is more efficient,
> as it uses a 16KB buffer versus the 4K reads done by the blocking version.


Yes, the 4K reads are a limitation of pervasives channels. For each mechanism I used the largest buffer that the OCaml runtime uses. 

> Looking at the two decoders in src/se.ml, it looks like the non-blocking one
> allocates closures on every loop, which the blocking one doesn't. This is so it
> can store the continuation in d.k for the next loop. 


Yes, that's a side effect of writing in continuation passing style in general since continuations are often partially applied functions. 

> So to summarise, instead of storing a continuation closure, it would probably be better
> to explicitly store the state in d.k to minimise allocation?


Maybe, but keep in mind that s-expressions are very simple to parse. It may be obvious in this case but depending on what you decode defining/storing the state may become complex. Cps is an easy and general way to solve the problem while keeping the whole thing reasonably readable. But do you maybe see another pattern that I don't ?

> The library looks very useful by the way: I have exactly the same issue with several
> Lwt-only protocol libraries we're developing at the moment. Would love to use yours before
> the first release of them to make them more independent of the underlying I/O mechanism...


That would be nice, I'm glad if you can somehow reuse the pattern.


Best,

Daniel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Non-blocking IO interface design
  2012-04-10  8:21   ` Daniel Bünzli
@ 2012-04-10 12:40     ` Yaron Minsky
  2012-04-14  9:46       ` Daniel Bünzli
  0 siblings, 1 reply; 5+ messages in thread
From: Yaron Minsky @ 2012-04-10 12:40 UTC (permalink / raw)
  To: Daniel Bünzli; +Cc: Anil Madhavapeddy, caml-list

On Tue, Apr 10, 2012 at 4:21 AM, Daniel Bünzli
<daniel.buenzli@erratique.ch> wrote:
> Anil,
>
> Thanks for the analysis.
>
>> The I/O loop is being called twice for the non-blocking version, as it receives
>> the `Await signal, does the Unix syscall, and then jumps into decode_src. Presumably
>> a full non-blocking version would have to register with a select handler if it
>> gets an EAGAIN at this point,
>
>
> Yes.
>
>> In terms of the number of system calls, the non-blocking one is more efficient,
>> as it uses a 16KB buffer versus the 4K reads done by the blocking version.
>
>
> Yes, the 4K reads are a limitation of pervasives channels. For each mechanism I used the largest buffer that the OCaml runtime uses.
>
>> Looking at the two decoders in src/se.ml, it looks like the non-blocking one
>> allocates closures on every loop, which the blocking one doesn't. This is so it
>> can store the continuation in d.k for the next loop.
>
>
> Yes, that's a side effect of writing in continuation passing style in general since continuations are often partially applied functions.

I believe this particular performance issue is fixed in the upcoming
4.0 release, based on some work by OCamlPro.

>> So to summarise, instead of storing a continuation closure, it would probably be better
>> to explicitly store the state in d.k to minimise allocation?
>
>
> Maybe, but keep in mind that s-expressions are very simple to parse. It may be obvious in this case but depending on what you decode defining/storing the state may become complex. Cps is an easy and general way to solve the problem while keeping the whole thing reasonably readable. But do you maybe see another pattern that I don't ?
>
>> The library looks very useful by the way: I have exactly the same issue with several
>> Lwt-only protocol libraries we're developing at the moment. Would love to use yours before
>> the first release of them to make them more independent of the underlying I/O mechanism...
>
>
> That would be nice, I'm glad if you can somehow reuse the pattern.
>
>
> Best,
>
> Daniel
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa-roc.inria.fr/wws/info/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Non-blocking IO interface design
  2012-04-10 12:40     ` Yaron Minsky
@ 2012-04-14  9:46       ` Daniel Bünzli
  0 siblings, 0 replies; 5+ messages in thread
From: Daniel Bünzli @ 2012-04-14  9:46 UTC (permalink / raw)
  To: caml-list; +Cc: Anil Madhavapeddy

Just a word on naming. 

I have now gathered all functions specific to `Manual sources and destinations in a Manual submodule by renaming as follows:  

Codec.decode_src -> Codec.Manual.src
Codec.encode_dst -> Codec.Manual.dst
Codec.encode_dst_rem -> Codec.Manual.dst_rem

The new names are as short and seem as clear to me. Maybe even clearer since the name of the module suggests it should only be used with `Manual sources and destinations. It also makes it easier for non-`Manual users to skip these functions and their documentation. 

The sample code at http://erratique.ch/repos/nbcodec/ was updated accordingly. 

Best,

Daniel




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-04-14  9:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-08 20:37 [Caml-list] Non-blocking IO interface design Daniel Bünzli
2012-04-09 19:28 ` Anil Madhavapeddy
2012-04-10  8:21   ` Daniel Bünzli
2012-04-10 12:40     ` Yaron Minsky
2012-04-14  9:46       ` Daniel Bünzli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).