caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Serialisation of PXP DTDs
@ 2008-10-22 20:11 Dario Teixeira
  2008-10-22 23:05 ` Sylvain Le Gall
  2008-10-23 14:55 ` [Caml-list] " Gerd Stolpmann
  0 siblings, 2 replies; 21+ messages in thread
From: Dario Teixeira @ 2008-10-22 20:11 UTC (permalink / raw)
  To: caml-list

Hi,

I am using PXP to parse the MathML2 DTD.  This is a fairly large DTD,
which even on a fast machine takes several seconds to parse.  I am
therefore looking at ways to serialise a parsed DTD, in a such a way
that it can be reused by other processes.

Does PXP already offer primitives for (un)serialising DTDs?  (I couldn't
find any).  Note that using Marshal is out of the question, because DTDs
are stored as objects, and we all know that objects cannot be serialised
across process boundaries.  But are there alternative solutions I'm
overlooking?

On a more general but related note, I think we should start an OSP
discussion about standardising serialisation methods.  The rationale
should be obvious.  Myself, I am partial to Sexplib, since it is
reasonably fast, very simple to use, human-readable, and future-proof.
I reckon that bin-prot could also be considered, as long as at some
point the binary format is "set in stone", or at least deserialisers
are always backwards compatible.  Any other opinions?

Thanks for your time!
Cheers,
Dario Teixeira






^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Serialisation of PXP DTDs
  2008-10-22 20:11 Serialisation of PXP DTDs Dario Teixeira
@ 2008-10-22 23:05 ` Sylvain Le Gall
  2008-10-23 15:34   ` [Caml-list] " Dario Teixeira
  2008-10-23 14:55 ` [Caml-list] " Gerd Stolpmann
  1 sibling, 1 reply; 21+ messages in thread
From: Sylvain Le Gall @ 2008-10-22 23:05 UTC (permalink / raw)
  To: caml-list

On 22-10-2008, Dario Teixeira <darioteixeira@yahoo.com> wrote:
> Hi,
>
> I am using PXP to parse the MathML2 DTD.  This is a fairly large DTD,
> which even on a fast machine takes several seconds to parse.  I am
> therefore looking at ways to serialise a parsed DTD, in a such a way
> that it can be reused by other processes.
>
> Does PXP already offer primitives for (un)serialising DTDs?  (I couldn't
> find any).  Note that using Marshal is out of the question, because DTDs
> are stored as objects, and we all know that objects cannot be serialised
> across process boundaries.  But are there alternative solutions I'm
> overlooking?
>
> On a more general but related note, I think we should start an OSP
> discussion about standardising serialisation methods.  The rationale
> should be obvious.  Myself, I am partial to Sexplib, since it is
> reasonably fast, very simple to use, human-readable, and future-proof.
> I reckon that bin-prot could also be considered, as long as at some
> point the binary format is "set in stone", or at least deserialisers
> are always backwards compatible.  Any other opinions?
>

You seem to have already some ideas. The best, before doing any
discussion on this topic is to try to implement/benchmark the different
solution (at least doing something partial).

Sexplib/bin-prot/json/marshal need to be compared on a real example. 

You seems to need this for a particular task. Could you try to implement
on your particular example the different approach and give us some
benchmark/ease of use/ease of implement level ?

Without this number, I think an OSP discussion is pointless.

(but with this number at least on a small example, if your use case is
not easy, I think an OSP discussion will be very interesting).

Regards,
Sylvain Le Gall


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Serialisation of PXP DTDs
  2008-10-22 20:11 Serialisation of PXP DTDs Dario Teixeira
  2008-10-22 23:05 ` Sylvain Le Gall
@ 2008-10-23 14:55 ` Gerd Stolpmann
  1 sibling, 0 replies; 21+ messages in thread
From: Gerd Stolpmann @ 2008-10-23 14:55 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list


Am Mittwoch, den 22.10.2008, 13:11 -0700 schrieb Dario Teixeira:
> Hi,
> 
> I am using PXP to parse the MathML2 DTD.  This is a fairly large DTD,
> which even on a fast machine takes several seconds to parse.  I am
> therefore looking at ways to serialise a parsed DTD, in a such a way
> that it can be reused by other processes.
> 
> Does PXP already offer primitives for (un)serialising DTDs?  (I couldn't
> find any).  Note that using Marshal is out of the question, because DTDs
> are stored as objects, and we all know that objects cannot be serialised
> across process boundaries.  But are there alternative solutions I'm
> overlooking?

No, there is currently no built-in function to serialize DTD's. The DTD
objects are, however, mostly containers, and you can get all their
properties by invoking methods of the object interface. That allows it
to do your own serialization. You are a bit dependent on the PXP version
then, but I don't think the interface of DTD's will change anytime soon.

Gerd

> 
> On a more general but related note, I think we should start an OSP
> discussion about standardising serialisation methods.  The rationale
> should be obvious.  Myself, I am partial to Sexplib, since it is
> reasonably fast, very simple to use, human-readable, and future-proof.
> I reckon that bin-prot could also be considered, as long as at some
> point the binary format is "set in stone", or at least deserialisers
> are always backwards compatible.  Any other opinions?
> 
> Thanks for your time!
> Cheers,
> Dario Teixeira
> 
> 
> 
>       
> 
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-22 23:05 ` Sylvain Le Gall
@ 2008-10-23 15:34   ` Dario Teixeira
  2008-10-23 16:37     ` Stefano Zacchiroli
  2008-10-23 16:46     ` Markus Mottl
  0 siblings, 2 replies; 21+ messages in thread
From: Dario Teixeira @ 2008-10-23 15:34 UTC (permalink / raw)
  To: caml-list, Sylvain Le Gall

Hi,

First, and concerning the more general problem of serialisation:
one often comes across a situation where a library encodes a very
complex and opaque value of type Foobar.t, but offers no dedicated
(de)serialisation functions.  The assumption is of course that users
can just use Marshal and be done with.  There are however situations
for which Marshal is badly suited.  Long term storage is one, and
portability is another.  Moreover, if the value is an object and you
wish to carry it across process boundaries, using Marshal is just
not possible.

Take also into consideration that the Sexplib syntax extension makes it
trivial to add (de)serialisers to a data structure.  The litmus test for
considering this task should be "is there any reasonable situation where
users of my Foobar library would like to serialise Foobar.t values in
a portable and long-term manner?".  If the answer is yes, I reckon it
wouldn't be too much to ask that the library adds support for Sexplib.

Note that in the paragraph above, "Sexplib" may be substituted by
another serialisation mechanism.  And this gets us to the question
of performance numbers you speak of.  In fact, besides performance I
would like to bring other variables to the table:

- ease of use
- "future-proofness"
- portability
- human-readability


Sexplib scores very good on ease of use, future-proofness, and
portability, and reasonably good on performance and human-readability.
My guess is that bin-prot has better performance but worse portability
and future-proofness, and nill human-readability.  Marshal gets
top scores in performance and ease of use, but fails miserably in
future-proofness, human-readability, and portability.

As for my particular problem with PXP DTDs, I will look at writing a
(de)serialiser by hand.  According to Gerd it shouldn't be too much
trouble.

Best regards,
Dario Teixeira







^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-23 15:34   ` [Caml-list] " Dario Teixeira
@ 2008-10-23 16:37     ` Stefano Zacchiroli
  2008-10-23 16:53       ` Markus Mottl
  2008-10-23 19:26       ` Dario Teixeira
  2008-10-23 16:46     ` Markus Mottl
  1 sibling, 2 replies; 21+ messages in thread
From: Stefano Zacchiroli @ 2008-10-23 16:37 UTC (permalink / raw)
  To: caml-list

On Thu, Oct 23, 2008 at 08:34:21AM -0700, Dario Teixeira wrote:
> - ease of use
> - "future-proofness"
> - portability
> - human-readability
> 
> Sexplib scores very good on ease of use, future-proofness, and
                                           ^^^^^^^^^^^^^^^^

Does it?

I mean, as long as types are as simples are pairs we will probably
write down the very same S-expression, but for more complex types you
hand up having to choose how to encode them in S-expressions. Such
design choices can need to be changed in the future as more types will
be supported.  I fail to see why the future-proofness of such choices
should be better than that of bin-prot.

Yes, in case of changes you can imagine writing converters from the
old format to the new one, but you can do that also for binary
representations. In fact, doing that in OCaml using bitmatch would
lead to the same code as per S-expressions, I believe.

Beside this comment,
thanks for the nice analysis.

-- 
Stefano Zacchiroli -*- PhD in Computer Science \ PostDoc @ Univ. Paris 7
zack@{upsilon.cc,pps.jussieu.fr,debian.org} -<>- http://upsilon.cc/zack/
Dietro un grande uomo c'è sempre /oo\ All one has to do is hit the right
uno zaino        -- A.Bergonzoni \__/ keys at the right time -- J.S.Bach


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-23 15:34   ` [Caml-list] " Dario Teixeira
  2008-10-23 16:37     ` Stefano Zacchiroli
@ 2008-10-23 16:46     ` Markus Mottl
  1 sibling, 0 replies; 21+ messages in thread
From: Markus Mottl @ 2008-10-23 16:46 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list, Sylvain Le Gall

On Thu, Oct 23, 2008 at 11:34 AM, Dario Teixeira
<darioteixeira@yahoo.com> wrote:
> Sexplib scores very good on ease of use, future-proofness, and
> portability, and reasonably good on performance and human-readability.
> My guess is that bin-prot has better performance but worse portability
> and future-proofness, and nill human-readability.  Marshal gets
> top scores in performance and ease of use, but fails miserably in
> future-proofness, human-readability, and portability.

Bin-prot is settled in its design.  We heavily rely on it here at Jane
Street and store TBs of data in it so there is no way it's going to
change.  I would say it is future-proof.

Portability could be improved, of course, e.g. to bigendian
architectures, etc., but that's not hard to do.  Performance is
definitely competitive to marshal: writing is noticably faster, and
reading only marginally slower.  It also requires a little less
storage space.  Main problem here is actually that it doesn't support
shared / cyclic datastructures.  I don't think anybody would blame it
for not being human-readable, because that's the nature of binary
protocols ;-)

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-23 16:37     ` Stefano Zacchiroli
@ 2008-10-23 16:53       ` Markus Mottl
  2008-10-23 19:26       ` Dario Teixeira
  1 sibling, 0 replies; 21+ messages in thread
From: Markus Mottl @ 2008-10-23 16:53 UTC (permalink / raw)
  To: caml-list

On Thu, Oct 23, 2008 at 12:37 PM, Stefano Zacchiroli <zack@upsilon.cc> wrote:
> I mean, as long as types are as simples are pairs we will probably
> write down the very same S-expression, but for more complex types you
> hand up having to choose how to encode them in S-expressions. Such
> design choices can need to be changed in the future as more types will
> be supported.  I fail to see why the future-proofness of such choices
> should be better than that of bin-prot.

Both the S-expression converters and the binary protocol already
support all extensionally defined datatypes in OCaml, and there are no
plans to change their representation.  I think it is fair to say that
both of them are reasonably future-safe.

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-23 16:37     ` Stefano Zacchiroli
  2008-10-23 16:53       ` Markus Mottl
@ 2008-10-23 19:26       ` Dario Teixeira
  2008-10-23 21:05         ` Mauricio Fernandez
  1 sibling, 1 reply; 21+ messages in thread
From: Dario Teixeira @ 2008-10-23 19:26 UTC (permalink / raw)
  To: caml-list, Stefano Zacchiroli

> I mean, as long as types are as simples are pairs we will
> probably write down the very same S-expression, but for more
> complex types you hand up having to choose how to encode them
> in S-expressions.  Such design choices can need to be changed
> in the future as more types will be supported.  I fail to see
> why the future-proofness of such choices
> should be better than that of bin-prot.

Hi,

Well, there's several types of "future-proofness".  If in the
far-future I was faced with the task of reverse-engineering
and deserialising a structure about whose contents I only had
a rough idea, then a human-readable text-format like that of S-expressions would simplify things enormously.  On a more
down-to-earth scenario, bear in mind that S-expressions offer
forward-compatibility as long as you are only adding to
a structure.

For example, suppose I have a type foobar_t with two
constructors:

type foobar_t = One | Two

If later on I add a third constructor "Three" to this type,
the deserialiser for the new version can still read S-expressions
written with the serialiser for the old version.

Cheers,
Dario Teixeira







^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-23 19:26       ` Dario Teixeira
@ 2008-10-23 21:05         ` Mauricio Fernandez
  2008-10-23 22:18           ` Gerd Stolpmann
  2008-10-23 22:21           ` Dario Teixeira
  0 siblings, 2 replies; 21+ messages in thread
From: Mauricio Fernandez @ 2008-10-23 21:05 UTC (permalink / raw)
  To: caml-list

On Thu, Oct 23, 2008 at 12:26:54PM -0700, Dario Teixeira wrote:
> > I mean, as long as types are as simples are pairs we will
> > probably write down the very same S-expression, but for more
> > complex types you hand up having to choose how to encode them
> > in S-expressions.  Such design choices can need to be changed
> > in the future as more types will be supported.  I fail to see
> > why the future-proofness of such choices
> > should be better than that of bin-prot.
> 
> Hi,
> 
> Well, there's several types of "future-proofness".  If in the far-future I
> was faced with the task of reverse-engineering and deserialising a structure
> about whose contents I only had a rough idea, then a human-readable
> text-format like that of S-expressions would simplify things enormously.  On
> a more down-to-earth scenario, bear in mind that S-expressions offer
> forward-compatibility as long as you are only adding to a structure.
> 
> For example, suppose I have a type foobar_t with two
> constructors:
> 
> type foobar_t = One | Two
> 
> If later on I add a third constructor "Three" to this type,
> the deserialiser for the new version can still read S-expressions
> written with the serialiser for the old version.

I have been working for a while on a self-describing, compact, extensible
binary protocol, along with an OCaml implementation which I intent to release
in not too long.

It differs from sexplib and that bin-prot in two main ways:
* the data model is deliberately more limited, as the format is meant to be
  de/encodable in multiple languages.
* it is extensible at several levels, achieving both forward and backward
  compatibility across changes in the data type

You can think of it as an extensible Protocol Buffers[1] with a richer data
model (albeit not in 1:1 accordance with OCaml's for the above mentioned
reason).

In the criteria you gave in another message, namely
(1) ease of use
(2) "future-proofness"
(3) portability
(4) human-readability,

it does fairly well at the 3 first ones --- especially at (2) and (3), which
were poorly supported by existing solutions (I looked into bin-prot, sexplib,
Google's Protocol Buffers, Thrift and XDR; I also referred to IIOP and ITU-T
X.690 DER during the design). Being a binary format, it obviously doesn't do
that well at (4), but it is possible to get a human-readable dump of the
binary data even in the absence of the interface definition, making
reverse-engineering no harder than sexplib (and arguably easier in some ways).

For example, here's a bogus message definition to illustrate (2) and (4).
This protocol definition is fed to the compiler, which generates the OCaml
type definitions, as well as the encoders/decoders and pretty-printers (as you
can see, the specification uses a mix of OCaml, Haskell and C++ syntax, but
it's pretty clear IMO)

    type sum_type 'a 'b 'c = A 'a | B 'b | C 'c

    message complex_rtt =
      A {
	a1 : [(int * [|bool|])];
	a2 : [ sum_type<int, string, long> ]
	}
    | B {
	b1 : bool;
	b2 : (string * [int])
      }

The protocol is extensible in the sense that you can add new constructors to a
sum or message type, add new elements to a tuple, and replace any primitive
type by a sum type including the original type. For instance, if at some point
in time we find that the b1 field should have a different type, we can do

    type bool_or_something 'a = Orig unboxed_bool | New_constructor 'a

and then 
   ...
   | B { b1 : bool_or_something<some_type>; ... }

This, along with a way to specify default values, allows both forward and
backward compatibility.

The compiler generates a pretty printer for these structures, useful for
debugging. Here's a message generated randomly:

{
  Complex_rtt.a1 =
   [ ((-5378), [| false; false; false; true; true |]);
     (3942717140522000971, [| false; true; true; true; false |]);
     ((-6535386320450295), [| false |]); ((-238860767206), [|  |]);
     (1810196202, [| false; false; true; true |]) ];
  Complex_rtt.a2 =
   [ Sum_type.A (-13830); Sum_type.A 369334576; Sum_type.A 83;
     Sum_type.A (-3746796577167465774); Sum_type.A (-1602586945) ] }

Now, this is the information decoded in the absence of the above definitions
(iow., what you'd have to work with if you were reverse-engineering the
protocol):

T0 {
     T0 [
          T0 { Vint_t0 (-5378);
               T0 [ Vint_t0 0; Vint_t0 0; Vint_t0 0; Vint_t0 (-1);
                    Vint_t0 (-1)]};
          T0 { Vint_t0 3942717140522000971;
               T0 [ Vint_t0 0; Vint_t0 (-1); Vint_t0 (-1); Vint_t0 (-1);
                    Vint_t0 0]};
          T0 { Vint_t0 (-6535386320450295); T0 [ Vint_t0 0]};
          T0 { Vint_t0 (-238860767206); T0 [ ]};
          T0 { Vint_t0 1810196202;
               T0 [ Vint_t0 0; Vint_t0 0; Vint_t0 (-1); Vint_t0 (-1)]}];
     T0 [ T0 { Vint_t0 (-13830)}; T0 { Vint_t0 369334576}; T0 { Vint_t0 83};
          T0 { Vint_t0 (-3746796577167465774)}; T0 { Vint_t0 (-1602586945)}]}

(I'm still changing some details so it might look better than this shortly.)

It's not a drop-in solution like sexplib's "with sexp", by design (since it is
meant to allow interoperability between different languages), but it's still
fairly easy to use.

If you're interested in this, tell me and I'll let you know when it's ready for
serious usage.

[1] http://code.google.com/p/protobuf/

-- 
Mauricio Fernandez  -   http://eigenclass.org


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-23 21:05         ` Mauricio Fernandez
@ 2008-10-23 22:18           ` Gerd Stolpmann
  2008-10-23 22:50             ` Mauricio Fernandez
  2008-10-23 22:21           ` Dario Teixeira
  1 sibling, 1 reply; 21+ messages in thread
From: Gerd Stolpmann @ 2008-10-23 22:18 UTC (permalink / raw)
  To: Mauricio Fernandez; +Cc: caml-list


Am Donnerstag, den 23.10.2008, 23:05 +0200 schrieb Mauricio Fernandez:
> I have been working for a while on a self-describing, compact, extensible
> binary protocol, along with an OCaml implementation which I intent to release
> in not too long.
> 
> It differs from sexplib and that bin-prot in two main ways:
> * the data model is deliberately more limited, as the format is meant to be
>   de/encodable in multiple languages.
> * it is extensible at several levels, achieving both forward and backward
>   compatibility across changes in the data type
> 
> You can think of it as an extensible Protocol Buffers[1] with a richer data
> model (albeit not in 1:1 accordance with OCaml's for the above mentioned
> reason).

Have you looked at ICEP (see zeroc.com)? It has bindings for many
languages, even for Ocaml (http://oss.wink.com/hydro/).

It is, however, not self-describing. Anyway, you may find there ideas
for portability.

Gerd

> In the criteria you gave in another message, namely
> (1) ease of use
> (2) "future-proofness"
> (3) portability
> (4) human-readability,
> 
> it does fairly well at the 3 first ones --- especially at (2) and (3), which
> were poorly supported by existing solutions (I looked into bin-prot, sexplib,
> Google's Protocol Buffers, Thrift and XDR; I also referred to IIOP and ITU-T
> X.690 DER during the design). Being a binary format, it obviously doesn't do
> that well at (4), but it is possible to get a human-readable dump of the
> binary data even in the absence of the interface definition, making
> reverse-engineering no harder than sexplib (and arguably easier in some ways).
> 
> For example, here's a bogus message definition to illustrate (2) and (4).
> This protocol definition is fed to the compiler, which generates the OCaml
> type definitions, as well as the encoders/decoders and pretty-printers (as you
> can see, the specification uses a mix of OCaml, Haskell and C++ syntax, but
> it's pretty clear IMO)
> 
>     type sum_type 'a 'b 'c = A 'a | B 'b | C 'c
> 
>     message complex_rtt =
>       A {
> 	a1 : [(int * [|bool|])];
> 	a2 : [ sum_type<int, string, long> ]
> 	}
>     | B {
> 	b1 : bool;
> 	b2 : (string * [int])
>       }
> 
> The protocol is extensible in the sense that you can add new constructors to a
> sum or message type, add new elements to a tuple, and replace any primitive
> type by a sum type including the original type. For instance, if at some point
> in time we find that the b1 field should have a different type, we can do
> 
>     type bool_or_something 'a = Orig unboxed_bool | New_constructor 'a
> 
> and then 
>    ...
>    | B { b1 : bool_or_something<some_type>; ... }
> 
> This, along with a way to specify default values, allows both forward and
> backward compatibility.
> 
> The compiler generates a pretty printer for these structures, useful for
> debugging. Here's a message generated randomly:
> 
> {
>   Complex_rtt.a1 =
>    [ ((-5378), [| false; false; false; true; true |]);
>      (3942717140522000971, [| false; true; true; true; false |]);
>      ((-6535386320450295), [| false |]); ((-238860767206), [|  |]);
>      (1810196202, [| false; false; true; true |]) ];
>   Complex_rtt.a2 =
>    [ Sum_type.A (-13830); Sum_type.A 369334576; Sum_type.A 83;
>      Sum_type.A (-3746796577167465774); Sum_type.A (-1602586945) ] }
> 
> Now, this is the information decoded in the absence of the above definitions
> (iow., what you'd have to work with if you were reverse-engineering the
> protocol):
> 
> T0 {
>      T0 [
>           T0 { Vint_t0 (-5378);
>                T0 [ Vint_t0 0; Vint_t0 0; Vint_t0 0; Vint_t0 (-1);
>                     Vint_t0 (-1)]};
>           T0 { Vint_t0 3942717140522000971;
>                T0 [ Vint_t0 0; Vint_t0 (-1); Vint_t0 (-1); Vint_t0 (-1);
>                     Vint_t0 0]};
>           T0 { Vint_t0 (-6535386320450295); T0 [ Vint_t0 0]};
>           T0 { Vint_t0 (-238860767206); T0 [ ]};
>           T0 { Vint_t0 1810196202;
>                T0 [ Vint_t0 0; Vint_t0 0; Vint_t0 (-1); Vint_t0 (-1)]}];
>      T0 [ T0 { Vint_t0 (-13830)}; T0 { Vint_t0 369334576}; T0 { Vint_t0 83};
>           T0 { Vint_t0 (-3746796577167465774)}; T0 { Vint_t0 (-1602586945)}]}
> 
> (I'm still changing some details so it might look better than this shortly.)
> 
> It's not a drop-in solution like sexplib's "with sexp", by design (since it is
> meant to allow interoperability between different languages), but it's still
> fairly easy to use.
> 
> If you're interested in this, tell me and I'll let you know when it's ready for
> serious usage.
> 
> [1] http://code.google.com/p/protobuf/
> 
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-23 21:05         ` Mauricio Fernandez
  2008-10-23 22:18           ` Gerd Stolpmann
@ 2008-10-23 22:21           ` Dario Teixeira
  2008-10-23 23:36             ` Mauricio Fernandez
  1 sibling, 1 reply; 21+ messages in thread
From: Dario Teixeira @ 2008-10-23 22:21 UTC (permalink / raw)
  To: caml-list, Mauricio Fernandez

Hi,

> This protocol definition is fed to the compiler, which
> generates the OCaml type definitions, as well as the 
> encoders/decoders and pretty-printers (as you can see,
> the specification uses a mix of OCaml, Haskell and C++
> syntax, but it's pretty clear IMO)

Basically the XDR approach, but with a syntax inspired
by more modern, functional languages, right?


> It's not a drop-in solution like sexplib's "with sexp",
> by design (since it is meant to allow interoperability between
> different languages), but it's still fairly easy to use.

Personally, I think that a sexplib-like syntax extension
is the killer feature for serialisation libraries, and the
reason why I was immediately swayed by sexplib.  However,
writing a sexplib-like syntax extension for your serialisation
library would entail solving the reverse problem now handled
by your compiler.  This might not always be possible because
some features of Ocaml's type system might not map neatly
into your format.  Nevertheless, the sheer convenience of
the syntax extension approach makes it worth while having,
even if on occasion the preprocessor were to produce an
error message stating that it could not convert a certain
structure.  For reference purposes, you could even have the
syntax extension output to an external file the inferred
structure definition in your language format!  (I know this
would be a very complex project, but it does illustrate the
power of Camlp4).


Anyway, what you described looks very interesting.
Keep us posted!

Cheers,
Dario Teixeira






^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-23 22:18           ` Gerd Stolpmann
@ 2008-10-23 22:50             ` Mauricio Fernandez
  0 siblings, 0 replies; 21+ messages in thread
From: Mauricio Fernandez @ 2008-10-23 22:50 UTC (permalink / raw)
  To: Gerd Stolpmann

On Fri, Oct 24, 2008 at 12:18:50AM +0200, Gerd Stolpmann wrote:
> 
> Am Donnerstag, den 23.10.2008, 23:05 +0200 schrieb Mauricio Fernandez:
> > I have been working for a while on a self-describing, compact, extensible
> > binary protocol, along with an OCaml implementation which I intent to release
> > in not too long.
> > 
> > It differs from sexplib and that bin-prot in two main ways:
> > * the data model is deliberately more limited, as the format is meant to be
> >   de/encodable in multiple languages.
> > * it is extensible at several levels, achieving both forward and backward
> >   compatibility across changes in the data type
> > 
> > You can think of it as an extensible Protocol Buffers[1] with a richer data
> > model (albeit not in 1:1 accordance with OCaml's for the above mentioned
> > reason).
> 
> Have you looked at ICEP (see zeroc.com)? It has bindings for many
> languages, even for Ocaml (http://oss.wink.com/hydro/).
> 
> It is, however, not self-describing. Anyway, you may find there ideas
> for portability.

I've just taken a quick look at the manual (in particular, the definition of
the Slice language and the Data Encoding section of the Ice protocol). Even
though it solves a different problem, it looks very interesting --- both as a
source of inspiration, as you say, and for its intended use as a middleware
technology.  Thanks a lot for the reference.

Regards,
-- 
Mauricio Fernandez  -   http://eigenclass.org


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-23 22:21           ` Dario Teixeira
@ 2008-10-23 23:36             ` Mauricio Fernandez
  2008-10-24  9:11               ` Mikkel Fahnøe Jørgensen
  0 siblings, 1 reply; 21+ messages in thread
From: Mauricio Fernandez @ 2008-10-23 23:36 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list

On Thu, Oct 23, 2008 at 03:21:01PM -0700, Dario Teixeira wrote:
> Hi,
> 
> > This protocol definition is fed to the compiler, which
> > generates the OCaml type definitions, as well as the 
> > encoders/decoders and pretty-printers (as you can see,
> > the specification uses a mix of OCaml, Haskell and C++
> > syntax, but it's pretty clear IMO)
> 
> Basically the XDR approach, but with a syntax inspired
> by more modern, functional languages, right?

Yes, something like XDR (and Google's Protocol Buffers, and Facebook's Thrift,
and and :) with richer data types (algebraic and polymorphic types, etc.) and a
self-describing encoding that allows you to extend the type definitions while
ensuring interoperability.

> > It's not a drop-in solution like sexplib's "with sexp",
> > by design (since it is meant to allow interoperability between
> > different languages), but it's still fairly easy to use.
> 
> Personally, I think that a sexplib-like syntax extension is the killer
> feature for serialisation libraries, and the reason why I was immediately
> swayed by sexplib.  However, writing a sexplib-like syntax extension for
> your serialisation library would entail solving the reverse problem now
> handled by your compiler.  This might not always be possible because some
> features of Ocaml's type system might not map neatly into your format. 
> Nevertheless, the sheer convenience of the syntax extension approach makes
> it worth while having, even if on occasion the preprocessor were to produce
> an error message stating that it could not convert a certain structure.  For
> reference purposes, you could even have the syntax extension output to an
> external file the inferred structure definition in your language format!  (I
> know this would be a very complex project, but it does illustrate the power
> of Camlp4).

In fact, the wire format easily supports all of OCaml's type system (bin-prot
does, after all, and this is essentially a self-describing, extensible
bin-prot). I introduced limitations in the data schema to ensure extensibility
and portability. Any OCaml type can be encoded easily, but not all possible
changes to an OCaml type are safe with regard to protocol compatibility. Using
a separate language makes it easier to prevent altogether (by making them
impossible to express) or catch such errors.

Leaving unsafe protocol modifications aside (which just means that you have to
be careful when you change a type), the approach you suggest (supporting only
a subset of OCaml's type system in a "with protocol"-style syntax extension)
seems very doable. However, sexplib seems to be the safest option for
convenient, more or less future-proof serialization in OCaml, for the time
being.

Cheers,
-- 
Mauricio Fernandez  -   http://eigenclass.org


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-23 23:36             ` Mauricio Fernandez
@ 2008-10-24  9:11               ` Mikkel Fahnøe Jørgensen
  2008-10-24 14:03                 ` Markus Mottl
  2008-10-24 21:39                 ` Mauricio Fernandez
  0 siblings, 2 replies; 21+ messages in thread
From: Mikkel Fahnøe Jørgensen @ 2008-10-24  9:11 UTC (permalink / raw)
  To: Dario Teixeira, caml-list

I guess this discussion is an overkill for the problem at hand, but
speaking of binary extensible protocols, have you looked at ASN.1? It
is an abstraction over any number of encodings. At least one binary
encoding has extension bits to allow future growth of object
collections and similar.

Mikkel

2008/10/24 Mauricio Fernandez <mfp@acm.org>:
> On Thu, Oct 23, 2008 at 03:21:01PM -0700, Dario Teixeira wrote:
>> Hi,
>>
>> > This protocol definition is fed to the compiler, which
>> > generates the OCaml type definitions, as well as the
>> > encoders/decoders and pretty-printers (as you can see,
>> > the specification uses a mix of OCaml, Haskell and C++
>> > syntax, but it's pretty clear IMO)
>>
>> Basically the XDR approach, but with a syntax inspired
>> by more modern, functional languages, right?
>
> Yes, something like XDR (and Google's Protocol Buffers, and Facebook's Thrift,
> and and :) with richer data types (algebraic and polymorphic types, etc.) and a
> self-describing encoding that allows you to extend the type definitions while
> ensuring interoperability.
>
>> > It's not a drop-in solution like sexplib's "with sexp",
>> > by design (since it is meant to allow interoperability between
>> > different languages), but it's still fairly easy to use.
>>
>> Personally, I think that a sexplib-like syntax extension is the killer
>> feature for serialisation libraries, and the reason why I was immediately
>> swayed by sexplib.  However, writing a sexplib-like syntax extension for
>> your serialisation library would entail solving the reverse problem now
>> handled by your compiler.  This might not always be possible because some
>> features of Ocaml's type system might not map neatly into your format.
>> Nevertheless, the sheer convenience of the syntax extension approach makes
>> it worth while having, even if on occasion the preprocessor were to produce
>> an error message stating that it could not convert a certain structure.  For
>> reference purposes, you could even have the syntax extension output to an
>> external file the inferred structure definition in your language format!  (I
>> know this would be a very complex project, but it does illustrate the power
>> of Camlp4).
>
> In fact, the wire format easily supports all of OCaml's type system (bin-prot
> does, after all, and this is essentially a self-describing, extensible
> bin-prot). I introduced limitations in the data schema to ensure extensibility
> and portability. Any OCaml type can be encoded easily, but not all possible
> changes to an OCaml type are safe with regard to protocol compatibility. Using
> a separate language makes it easier to prevent altogether (by making them
> impossible to express) or catch such errors.
>
> Leaving unsafe protocol modifications aside (which just means that you have to
> be careful when you change a type), the approach you suggest (supporting only
> a subset of OCaml's type system in a "with protocol"-style syntax extension)
> seems very doable. However, sexplib seems to be the safest option for
> convenient, more or less future-proof serialization in OCaml, for the time
> being.
>
> Cheers,
> --
> Mauricio Fernandez  -   http://eigenclass.org
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-24  9:11               ` Mikkel Fahnøe Jørgensen
@ 2008-10-24 14:03                 ` Markus Mottl
  2008-10-25 18:58                   ` Mauricio Fernandez
  2008-10-24 21:39                 ` Mauricio Fernandez
  1 sibling, 1 reply; 21+ messages in thread
From: Markus Mottl @ 2008-10-24 14:03 UTC (permalink / raw)
  To: Mikkel Fahnøe Jørgensen; +Cc: Dario Teixeira, caml-list

On Fri, Oct 24, 2008 at 5:11 AM, Mikkel Fahnøe Jørgensen
<mikkel@dvide.com> wrote:
> I guess this discussion is an overkill for the problem at hand, but
> speaking of binary extensible protocols, have you looked at ASN.1? It
> is an abstraction over any number of encodings. At least one binary
> encoding has extension bits to allow future growth of object
> collections and similar.

Note that it is perfectly safe to grow sum types with bin-prot.  It
was designed that way intentionally.  It's just not safe to reorder or
remove elements.  Nobody needs to reorder elements, because it doesn't
make any operational difference in the program.  Backward
compatibility of protocols you define necessarily requires the
presence of old constructors in sum types anyway so you may not want
to remove those in any case.  There is hardly any harm from the
protocol perspective in leaving old constructors in there.

Note, too, that polymorphic variants even allow reordering with
bin-prot.  They are also generally safer, because they are always
encoded as 32bit integers, thus making it extremely unlikely to get
accidental "good" matches when reading incompatible protocols (at the
expense of space and a tiny bit of performance).

Except for human-readability, I think bin-prot should scale very well
on the other requirements of serialization protocols once it has been
ported to architectures with unusual endianness (almost all machines
are little endian nowadays so hardly anybody on this list should be
affected).

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-24  9:11               ` Mikkel Fahnøe Jørgensen
  2008-10-24 14:03                 ` Markus Mottl
@ 2008-10-24 21:39                 ` Mauricio Fernandez
  2008-10-24 22:27                   ` Mikkel Fahnøe Jørgensen
  1 sibling, 1 reply; 21+ messages in thread
From: Mauricio Fernandez @ 2008-10-24 21:39 UTC (permalink / raw)
  To: caml-list

On Fri, Oct 24, 2008 at 11:11:10AM +0200, Mikkel Fahnøe Jørgensen wrote:
> I guess this discussion is an overkill for the problem at hand, but
> speaking of binary extensible protocols, have you looked at ASN.1? It
> is an abstraction over any number of encodings. At least one binary
> encoding has extension bits to allow future growth of object
> collections and similar.

Yes, I referred to it indirectly in my previous message. Indeed, ASN.1
supports disjoint unions ("tagged types") that would allow to extend a type.
It is obviously possible to build extensible protocols with ASN.1, but if I
understand it correctly, not all protocols expressed in ASN.1's abstract
syntax are automatically extensible --- it requires some care when designing
them (i.e., tagging).

My main problem with ASN.1 is that even the distinguished encoding rules are
fairly complex; also, explicit tagging results in relatively heavy
serialization too. My protocol family is both substantially simpler and better
adapted for extensibility. For example, the generic pretty-printer (able to
decode any message) takes ~40 lines of code.

-- 
Mauricio Fernandez  -   http://eigenclass.org


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-24 21:39                 ` Mauricio Fernandez
@ 2008-10-24 22:27                   ` Mikkel Fahnøe Jørgensen
  2008-10-25 19:19                     ` Mauricio Fernandez
  0 siblings, 1 reply; 21+ messages in thread
From: Mikkel Fahnøe Jørgensen @ 2008-10-24 22:27 UTC (permalink / raw)
  To: caml-list

> serialization too. My protocol family is both substantially simpler and better
> adapted for extensibility. For example, the generic pretty-printer (able to
> decode any message) takes ~40 lines of code.
>

I see - somehow it reminds me of stackish - kind of S expressions
backwards I guess - apparently with good performance, but also tag'ed
I reckon.

http://www.zedshaw.com/essays/stackish_xml_alternative.html

More specifically regarding DTD's:
Since I have been playing around with Ragel: http://www.complang.org/ragel/
I was also wondering about converting DTD's to state-machines with a
stack, then feed them to a Ragel input file and have Ragel produce a
table that can be run by a small interpreter.
I did something similar for an XML parser as a kind of DTD
replacement, although I manually wrote the state-machines and compiled
to C, not a table.
For OCaml you would link in the C interpreter, or rewrite it in OCaml.

Mikkel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-24 14:03                 ` Markus Mottl
@ 2008-10-25 18:58                   ` Mauricio Fernandez
  2008-10-26 18:15                     ` Markus Mottl
  0 siblings, 1 reply; 21+ messages in thread
From: Mauricio Fernandez @ 2008-10-25 18:58 UTC (permalink / raw)
  To: caml-list

On Fri, Oct 24, 2008 at 10:03:47AM -0400, Markus Mottl wrote:
> On Fri, Oct 24, 2008 at 5:11 AM, Mikkel Fahnøe Jørgensen
> <mikkel@dvide.com> wrote:
> > I guess this discussion is an overkill for the problem at hand, but
> > speaking of binary extensible protocols, have you looked at ASN.1? It
> > is an abstraction over any number of encodings. At least one binary
> > encoding has extension bits to allow future growth of object
> > collections and similar.
> 
> Note that it is perfectly safe to grow sum types with bin-prot.  It
> was designed that way intentionally.  It's just not safe to reorder or
> remove elements.  Nobody needs to reorder elements, because it doesn't
> make any operational difference in the program.  Backward
> compatibility of protocols you define necessarily requires the
> presence of old constructors in sum types anyway so you may not want
> to remove those in any case.  There is hardly any harm from the
> protocol perspective in leaving old constructors in there.
> 
> Note, too, that polymorphic variants even allow reordering with
> bin-prot. (...)
> 
> Except for human-readability, I think bin-prot should scale very well
> on the other requirements of serialization protocols once it has been
> ported to architectures with unusual endianness (almost all machines
> are little endian nowadays so hardly anybody on this list should be
> affected).

Unfortunately, growing sum types is far from being the only protocol extension
of interest. There's a trivial extension which, I suspect, will be at
least as common in practice, namely adding new fields to a record (or new
elements to a tuple). bin-prot is unable to handle it adequately --- a
self-describing format like the one I'm working on is required.

You might argue that this extension is subsumed by the ability to grow sum types,
since you can go from

    type record = { a : int } with bin_io
    type msg = A of record

to 

    type record1 = { a : int } with bin_io
    type record2 = { a' : int; b : int } with bin_io
    type msg = A of record1 | B of record2

(Note how special care has to be taken to tag the record --- "explicit
tagging" in ASN.1 parlance.)

However, this merely solves a part of a problem: that all serializations
according to an old type belong to the possible serializations for an
updated type, or, in other words, that new consumers be able to read data
written by old producers. Even with the above encoding (not with any arbitrary
type definition, but with a carefully constructed one), with bin-prot, this
implies that producers not be updated before consumers.

My design lifts that restriction and allows an old consumer to read the data
from a new producer when new fields have been added to a record or a tuple. 
It even allows a node to operate on data it doesn't understand completely
(e.g., when a new constructor is used): it can for instance update one
field it does know while leaving those it is unable to interpret (or doesn't
even know about!) unmodified. I think this is very important in many of the
scenarios where one would need an extensible binary protocol. Google's
Protocol Buffers support this; I'm not sure this is explicitly supported by
Facebook's Thrift compiler, but IIRC the protocol should allow it.

AFAICS the ability to process data not understood in full requires the use of
a self-describing format like the one I'm working on.

-- 
Mauricio Fernandez  -   http://eigenclass.org


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-24 22:27                   ` Mikkel Fahnøe Jørgensen
@ 2008-10-25 19:19                     ` Mauricio Fernandez
  0 siblings, 0 replies; 21+ messages in thread
From: Mauricio Fernandez @ 2008-10-25 19:19 UTC (permalink / raw)
  To: caml-list

On Sat, Oct 25, 2008 at 12:27:08AM +0200, Mikkel Fahnøe Jørgensen wrote:
> > serialization too. My protocol family is both substantially simpler and better
> > adapted for extensibility. For example, the generic pretty-printer (able to
> > decode any message) takes ~40 lines of code.
> >
> 
> I see - somehow it reminds me of stackish - kind of S expressions
> backwards I guess - apparently with good performance, but also tag'ed
> I reckon.
> 
> http://www.zedshaw.com/essays/stackish_xml_alternative.html

heh, I read about Stackish a while ago (a few years?). 
Besides being human-readable, Stackish uses tags in a different way.
Whereas Stackish uses tags for the node names (behaving like Google's Protocol
Buffers or Facebook's Thrift in this regard), in my design tags are like
OCaml's: a way to encode different constructors for a given field.

For instance, if you have a field 
   ...
   length : float  
   ...
and latter decide that a mere float is not enough, and it should be actually

    type len = Cm of float | Inch of float
      
      ...
      length : float
      ...
my system assigns a tag to each constructor, the way OCaml does (the original
type definition carries a default tag which corresponds to the Cm
constructor). AFAICS this can only be encoded in a roundabout way in Stackish,
since it doesn't have sum types.

> More specifically regarding DTD's:
> Since I have been playing around with Ragel: http://www.complang.org/ragel/
> I was also wondering about converting DTD's to state-machines with a
> stack, then feed them to a Ragel input file and have Ragel produce a
> table that can be run by a small interpreter.
> I did something similar for an XML parser as a kind of DTD
> replacement, although I manually wrote the state-machines and compiled
> to C, not a table.
> For OCaml you would link in the C interpreter, or rewrite it in OCaml.

Turning each data schema into a state-machine sounds like a fair amount of
work. What I was looking for and ended up implementing is similar in spirit
to bin-prot's "with bin_io" extension, with the difference that the type is
specified using a language-independent abstract syntax instead of OCaml's type
language, and that the wire format is designed to allow extensions happening
in both producers and consumers non-atomically. 

-- 
Mauricio Fernandez  -   http://eigenclass.org


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-25 18:58                   ` Mauricio Fernandez
@ 2008-10-26 18:15                     ` Markus Mottl
  2008-10-26 19:47                       ` Mauricio Fernandez
  0 siblings, 1 reply; 21+ messages in thread
From: Markus Mottl @ 2008-10-26 18:15 UTC (permalink / raw)
  To: caml-list

On Sat, Oct 25, 2008 at 2:58 PM, Mauricio Fernandez <mfp@acm.org> wrote:
> Unfortunately, growing sum types is far from being the only protocol extension
> of interest. There's a trivial extension which, I suspect, will be at
> least as common in practice, namely adding new fields to a record (or new
> elements to a tuple). bin-prot is unable to handle it adequately --- a
> self-describing format like the one I'm working on is required.

If you add a tag to a sum type, previous protocol implementations
cannot read these values, whereas new implementations will be able to
read both protocols.  With records / tuples it is exactly the other
way round: you could, in principle, read both in the old
implementation, which just needs to drop new, unknown fields, whereas
the new implementation requires these fields and hence cannot parse
old protocols.

I don't see how any approach could "hande" the respective unsolvable
case.  If a receiver doesn't know how to handle a tag, or if it
requires data that is not there, you'll be stuck.

Note, too, that even if you created an implementation which allows
handling extended records in old protocols, this would undoubtly come
at a pretty hefty cost.  The only efficient way to do that would be to
exchange protocols and generate code at runtime to translate quickly
between protocols.  I don't think it's worth it.

> You might argue that this extension is subsumed by the ability to grow sum types,
> since you can go from
>
>    type record = { a : int } with bin_io
>    type msg = A of record
>
> to
>
>    type record1 = { a : int } with bin_io
>    type record2 = { a' : int; b : int } with bin_io
>    type msg = A of record1 | B of record2
>
> (Note how special care has to be taken to tag the record --- "explicit
> tagging" in ASN.1 parlance.)

This is surely a clean way to extend protocols without losing backward
compatibility.

> My design lifts that restriction and allows an old consumer to read the data
> from a new producer when new fields have been added to a record or a tuple.

I'd probably bet that simply putting a protocol translator in front of
some old application you don't want to / cannot recompile would be
about as efficient.  Unless, of course, you go for the "generate
efficient translation code from a new protocol specifications at
runtime" approach, which seems very hard to implement.  And it
wouldn't even be as general, since an intermediate translator could
translate between previously completely unrelated, arbitrary protocols
(as long as you can define a meaningful translation).  It's hard to
imagine that anybody wouldn't want to use a type safe language with
pattern matching (like OCaml) to specify that part...

> AFAICS the ability to process data not understood in full requires the use of
> a self-describing format like the one I'm working on.

I'd go for the protocol translator.  Especially if two protocols share
a lot of structure, it should be trivial to define translations.
Another very reasonable approach, which does not diminish performance,
would be to exchange protocol versions.  Assuming that one side is
always more recent than the other, they should be able to support old
protocols directly.

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Caml-list] Re: Serialisation of PXP DTDs
  2008-10-26 18:15                     ` Markus Mottl
@ 2008-10-26 19:47                       ` Mauricio Fernandez
  0 siblings, 0 replies; 21+ messages in thread
From: Mauricio Fernandez @ 2008-10-26 19:47 UTC (permalink / raw)
  To: caml-list

On Sun, Oct 26, 2008 at 02:15:18PM -0400, Markus Mottl wrote:
> On Sat, Oct 25, 2008 at 2:58 PM, Mauricio Fernandez <mfp@acm.org> wrote:
> > Unfortunately, growing sum types is far from being the only protocol extension
> > of interest. There's a trivial extension which, I suspect, will be at
> > least as common in practice, namely adding new fields to a record (or new
> > elements to a tuple). bin-prot is unable to handle it adequately --- a
> > self-describing format like the one I'm working on is required.
(...) 
> With records / tuples it is exactly the other way round: you could, in
> principle, read both in the old implementation, which just needs to drop
> new, unknown fields, whereas the new implementation requires these fields
> and hence cannot parse old protocols.

This (having old consumers ignore extra fields) is what bin-prot doesn't
support because records/tuples aren't self-delimited. It can only be done at
the outermost level if you prepend the length of the message, and breaks as
soon as you have a nested record or tuple type.  In my format, records and
tuples are self-delimited, so this is supported trivially.

Note that it is possible for a new implementation to read old protocols by
specifying default values for missing fields. This basically amounts to
turning newly added fields into generalized option types (the only diff being
whether the 'a option -> 'a conversion is controlled at the level of the type
definition or distributed throughout the code). New code has to cope with the
possibility that the fields might be None, that's all. Old code never sees
those fields and works unmodified.

> I don't see how any approach could "hande" the respective unsolvable
> case.  If a receiver doesn't know how to handle a tag, or if it
> requires data that is not there, you'll be stuck.

The former case is indeed unsolvable if the reader is to operate with that
field in a specific (not polymorphic) way. (It can still do things involving
only other fields, though.) In the second case, however, the receiver has got
the advantage of hindsight: it knows that the extra data might not be present,
and the code can cope with that.

> Note, too, that even if you created an implementation which allows
> handling extended records in old protocols, this would undoubtly come
> at a pretty hefty cost.  The only efficient way to do that would be to
> exchange protocols and generate code at runtime to translate quickly
> between protocols.  I don't think it's worth it.

? I haven't optimized the generated code yet, but I'm seeing only a 25% drop
in decoding speed compared to Marshal in my preliminary tests. Extra fields
aren't even decoded, just saved in encoded form and appended to the output
when serializing again.

> > You might argue that this extension is subsumed by the ability to grow sum types,
> > since you can go from
> >
> >    type record = { a : int } with bin_io
> >    type msg = A of record
> >
> > to
> >
> >    type record1 = { a : int } with bin_io
> >    type record2 = { a' : int; b : int } with bin_io
> >    type msg = A of record1 | B of record2
> >
> > (Note how special care has to be taken to tag the record --- "explicit
> > tagging" in ASN.1 parlance.)
> 
> This is surely a clean way to extend protocols without losing backward
> compatibility.

It's bothersome for the programmer (picture 
  type msg = ... | F of record6  
  and record6 = { a''''' : int; b'''': int; c''': float; d'': foo; e': bar; f : baz),
and arguably worse than extending the record directly, because, as you said
above, the receiver will not know how to handle the "B" tag, even though it
would be perfectly able to decode the subset of the record it understands.
It's safe only in one direction (new code can read old data).

> > My design lifts that restriction and allows an old consumer to read the data
> > from a new producer when new fields have been added to a record or a tuple.
> 
> I'd probably bet that simply putting a protocol translator in front of
> some old application you don't want to / cannot recompile would be
> about as efficient. 

It's not always a matter of not recompiling the application, but rather of not
having recompiled it *yet*: in a system with multiple nodes, it is hard to
migrate them all to the updated code atomically...  Putting a protocol
translator in front of the old code is just as hard as updating it: it also
means that all exchanges have to stop while the protocol translators are put
in place --- hardly any advantage over just migrating to updated code.

> > AFAICS the ability to process data not understood in full requires the use of
> > a self-describing format like the one I'm working on.
> 
> I'd go for the protocol translator.  Especially if two protocols share
> a lot of structure, it should be trivial to define translations.
> Another very reasonable approach, which does not diminish performance,
> would be to exchange protocol versions.  Assuming that one side is
> always more recent than the other, they should be able to support old
> protocols directly.

Protocol negotiation is not always possible. Consider the case of data stored
on disk (or on any dummy server that only knows about files, not protocols)
and accessed directly without an intermediate translation layer.

-- 
Mauricio Fernandez  -   http://eigenclass.org


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2008-10-26 19:47 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-22 20:11 Serialisation of PXP DTDs Dario Teixeira
2008-10-22 23:05 ` Sylvain Le Gall
2008-10-23 15:34   ` [Caml-list] " Dario Teixeira
2008-10-23 16:37     ` Stefano Zacchiroli
2008-10-23 16:53       ` Markus Mottl
2008-10-23 19:26       ` Dario Teixeira
2008-10-23 21:05         ` Mauricio Fernandez
2008-10-23 22:18           ` Gerd Stolpmann
2008-10-23 22:50             ` Mauricio Fernandez
2008-10-23 22:21           ` Dario Teixeira
2008-10-23 23:36             ` Mauricio Fernandez
2008-10-24  9:11               ` Mikkel Fahnøe Jørgensen
2008-10-24 14:03                 ` Markus Mottl
2008-10-25 18:58                   ` Mauricio Fernandez
2008-10-26 18:15                     ` Markus Mottl
2008-10-26 19:47                       ` Mauricio Fernandez
2008-10-24 21:39                 ` Mauricio Fernandez
2008-10-24 22:27                   ` Mikkel Fahnøe Jørgensen
2008-10-25 19:19                     ` Mauricio Fernandez
2008-10-23 16:46     ` Markus Mottl
2008-10-23 14:55 ` [Caml-list] " Gerd Stolpmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).