caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* SafeUnmarshal: questions/problems/timings
@ 2006-08-29 23:39 Hendrik Tews
  2006-08-31 10:01 ` [Caml-list] " Grégoire Henry
  0 siblings, 1 reply; 3+ messages in thread
From: Hendrik Tews @ 2006-08-29 23:39 UTC (permalink / raw)
  To: caml-list

Dear all,

I used the safeUnmarshal module
(http://www.pps.jussieu.fr/~henry/marshal/) in order to check if
some marshaled ocaml data is compatible with its type (the data
is generated in a mixed C++/ocaml program). Here are my
experiences:

1. I made some measurements that suggest that the 
   unmarshal has quadratic complexity in the data size, see

   http://www.cs.ru.nl/~tews/marshal-plot.eps
   http://www.cs.ru.nl/~tews/marshal-plot-detail.eps

   If my (simple-minded) estimations are correct it would take
   more than 2 hours to check 4 MB of marshaled data (which I
   generate in less than 3 seconds).

   Is there any hope that the time complexity will improve?


2.	    Objective Caml version 3.09.3+dev0+ty1

    # SafeUnmarshal.copy [^nativeint^] 4;;
    Segmentation fault


3. Would it be possible to put an english version of
   http://www.pps.jussieu.fr/~henry/marshal/docTy/Ty.html online?


4. Instead of type-safe unmarshal functions, I am more interested
   in checking if some value that has been constructed outside
   ocaml is type correct. Therefore I would suggest to make
   Check.check available in come way. I am using now

     let check obj ty = Check.check (Obj.repr obj) (Ty.dump ty)

   with type 

     val check : 'a -> 'a tyrepr -> bool

   Am I right that the type parameter of tyrepr is a kind of
   phantom type that is mainly used to restrict the type of
   SafeUnmarshal.from_channel? Then I could also use 

     val check : 'a -> 'b tyrepr -> bool  ?

   It would be great if (as a debugging feature) this check could
   produce some sort of trace that helps to locate those parts
   that violate the given type.

5. nativeint, int32, and int64 are not supported yet (I would
   suggest to make the documentation a bit more precise in that
   point). As fix I use (in Check.check_block):

    | Tnativeint -> 
	tag = Obj.custom_tag && size = 2 && 
	((Obj.field obj 0) == (Obj.field (Obj.repr Nativeint.zero) 0))
    | Tint32 ->
	tag = Obj.custom_tag && size = 2 && 
	((Obj.field obj 0) == (Obj.field (Obj.repr Int32.zero) 0))
    | Tint64 ->
	tag = Obj.custom_tag && size = 3 && 
	((Obj.field obj 0) == (Obj.field (Obj.repr Int64.zero) 0))

   Any comments? On a 64 bit architecture the int64 size should be
   required to be 2.

   I would strongly suggest to replace the catch all cases

    | _ -> false

   in check.ml with some more precise code (it took me several
   hours of bug hunting until I found out that the above line
   made my unmarshal fail without even looking at the
   nativeints). 


6. Thanks for SafeUnmarshal, it helped me a lot when checking the
   data created inside C++!


Bye,

Hendrik


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Caml-list] SafeUnmarshal: questions/problems/timings
  2006-08-29 23:39 SafeUnmarshal: questions/problems/timings Hendrik Tews
@ 2006-08-31 10:01 ` Grégoire Henry
  2006-09-01  9:23   ` Hendrik Tews
  0 siblings, 1 reply; 3+ messages in thread
From: Grégoire Henry @ 2006-08-31 10:01 UTC (permalink / raw)
  To: Hendrik Tews; +Cc: caml-list, michel

[-- Attachment #1: Type: text/plain, Size: 4845 bytes --]

Hello,

> 1. I made some measurements that suggest that the
>    unmarshal has quadratic complexity in the data size, see
>
>    http://www.cs.ru.nl/~tews/marshal-plot.eps
>    http://www.cs.ru.nl/~tews/marshal-plot-detail.eps
>
>    If my (simple-minded) estimations are correct it would take
>    more than 2 hours to check 4 MB of marshaled data (which I
>    generate in less than 3 seconds).
>
>    Is there any hope that the time complexity will improve?

If your value contains no sharing (hence no cycle), the type checking
of safe-unmarshalling should be linear in the size of the data.

If there is some sharing (but still no cycle), and if we assume that
looking for shared type information takes constant time, then type
checking should still be linear.

With cycles, things are a bit worse because of the computation of
strongly connected components.

Therefore, if your data contains shared parts but no cycle, then it
should be close to linear. Currently, mapping shared blocks to their
type is represented by an association list: accessing such type
information could explain this quadratic behaviour.

If you are in this situation, this could be the reason and the
implementation needs a more clever way of retrieving types of shared
blocks. Note that a simple workaround such as using the Map module is
incorrect because a memory compaction may occur and break the order
used for storing blocks.

With cycles, the `height' of cycles (cycles inside cycles) comes as an
extra complexity factor. But in order to produce a quadratic
behaviour, you need data with a very particular shape.

Could you please send us more information about the shape of your data
(sharing or not, cycles or not, and wether sharing increase linearly
or not with the size)?

> 2.	    Objective Caml version 3.09.3+dev0+ty1
>
>     # SafeUnmarshal.copy [^nativeint^] 4;;
>     Segmentation fault

See your fifth question :)

> 3. Would it be possible to put an english version of
>    http://www.pps.jussieu.fr/~henry/marshal/docTy/Ty.html online?

Certainly. I'm joining a translated ty.mli, I will put it online too.

> 4. Instead of type-safe unmarshal functions, I am more interested
>    in checking if some value that has been constructed outside
>    ocaml is type correct. Therefore I would suggest to make
>    Check.check available in come way. I am using now
>
>      let check obj ty = Check.check (Obj.repr obj) (Ty.dump ty)
>
>    with type
>
>      val check : 'a -> 'a tyrepr -> bool
>
>    Am I right that the type parameter of tyrepr is a kind of
>    phantom type that is mainly used to restrict the type of
>    SafeUnmarshal.from_channel? Then I could also use
>
>      val check : 'a -> 'b tyrepr -> bool  ?

Yes and yes.

>    It would be great if (as a debugging feature) this check could
>    produce some sort of trace that helps to locate those parts
>    that violate the given type.

My hope for reducing the complexity is to make some "pre-processing"
on the value when the unmarshaller reallocates it in the memory.
In this case, the Check.check function can't operate on "standard"
values, and should not be exported.

Yet I don't know if the "preprocessor" would be tightly bound to the
unmarshaller or have a chance to be exported too.

We can keep exporting a quadratic check for "standard" values.

> 5. nativeint, int32, and int64 are not supported yet (I would
>    suggest to make the documentation a bit more precise in that
>    point). As fix I use (in Check.check_block):
>
>     | Tnativeint ->
> 	tag = Obj.custom_tag && size = 2 &&
> 	((Obj.field obj 0) == (Obj.field (Obj.repr Nativeint.zero) 0))
>     | Tint32 ->
> 	tag = Obj.custom_tag && size = 2 &&
> 	((Obj.field obj 0) == (Obj.field (Obj.repr Int32.zero) 0))
>     | Tint64 ->
> 	tag = Obj.custom_tag && size = 3 &&
> 	((Obj.field obj 0) == (Obj.field (Obj.repr Int64.zero) 0))
>
>    Any comments? On a 64 bit architecture the int64 size should be
>    required to be 2.

let custom_int64_size = 1 + 64 / Sys.word_size;;

I will add those cases.

>    I would strongly suggest to replace the catch all cases
>
>     | _ -> false
>
>    in check.ml with some more precise code (it took me several
>    hours of bug hunting until I found out that the above line
>    made my unmarshal fail without even looking at the
>    nativeints).

Something like printing the type clash (looking for ty_xx, found ty_yy)?

Or pretty-printing part of the value that have been checked before  
failure?

> 6. Thanks for SafeUnmarshal, it helped me a lot when checking the
>    data created inside C++!

This is quite an original use for unmarshalling functions :-)

Many thanks for your reports and suggestions !

-- Grégoire Henry


[-- Attachment #2: ty.mli --]
[-- Type: application/octet-stream, Size: 4477 bytes --]

(*
 *
 * A run-time representation for OCaml types
 *
 *)

(** Functions for de-constructing values of type ['a tyrepr]. *)

(** {6 Type declaration} *)

(** A value of type [declaration] is produced by the compiler for each
    type declaration in a module. This allows at runtime to ``break
    the abstraction barriers'' imposed by the type-checker... *)
type declaration  

(** [name decl] returns the name of the type constructor defined by [decl]. *)
val name: declaration -> string

(** [fullname decl] returns the name of the type constructor defined by [decl],
    as well as the line and file name of this decleration. *)
val fullname: declaration -> string

(** [arity decl] returns the arity of the type constructor defined by [decl]. *)
val arity: declaration -> int

(** [is_alias decl] returns [true] if [decl] is an alias type declaration.
    When [decl] is abstract both on the structure and the signature, it
    returns [false]. *)
val is_alias: declaration -> bool

(** [is_datatype decl] returns [true] if [decl] is a record or sum type 
    declaration. When [decl] is abstract both on the structure and the 
    signature, it returns [false]. *)
val is_datatype: declaration -> bool

(** {6 Type expression} *)

(** This is the ``untyped'' equivalent of the predefined type ['a tyrepr]. *)
type expression

(** The [dump] function allows to project a type expression  [([^ ty ^] :
    ty tyrepr)] to the type [expression]. *)
val dump: 'a tyrepr -> expression

(** [body decl] returns a type expression describing the body of the 
    type declaration [decl]. *)
val body: declaration -> expression

(** [canonical ty] resolves type aliases and returns a type expression
    that corresponds to a datatype declaration.

    Therefore,  [desc (canonical ty)]
    - is a predefined type ;
    - or is an abstract type ;
    - or has shape [Tconstr(decl,_)] such that  [is_datatype(decl)] is true.
*)
val canonical: expression -> expression

(** Type equality modulo aliases. (It is  useless to call [canonical] on
    [equal] arguments.) *)
val equal: expression -> expression -> bool

(** Pretty-printing, FIXME ... *)
val print: Format.formatter -> expression -> unit

(** {6 Description of type expression } *)

(** Description of type expressions. *)
type description = 
  | Tunit
  | Tbool
  | Tint
  | Tnativeint
  | Tint32
  | Tint64
  | Tchar
  | Tstring
  | Tfloat
  | Texn
  | Tarray of expression
  | Tlist of expression
  | Toption of expression
  | Tlazy of expression
  | Ttyrepr of expression
  | Tformat4 of expression * expression * expression * expression
  | Tconstr of declaration * expression array 
  | Ttuple of expression array
  | Tarrow of string * expression * expression
  | Tvariant of bool * (string * int * expression option) array

  | Tabstract 
  | Tsum of string array * (string * expression array) array
  | Tdoublerecord of (string * bool) array
  | Trecord of (string * bool * expression) array
  | Targ of int
  
  | Tpoly of int * expression
  | Tvar of int 

(** [desc ty] returns a description of the type expression [ty].

    It corresponds to the type expression that we can write using
    OCaml concrete syntax. It therefore excludes constructors such as
    [Tabstract], [Tsum], [Tdoublerecord] or [Trecord]. See [repr]. *)
val desc: expression -> description

(** [repr ty] returns the description of possible memory
    representations for the type expression [ty].

    By contrast with [desc ty], it can be [Tabstract], [Tsum],
    [Tdoublerecord] ou [Trecord], but not [Tconstr]. *)
val repr: expression -> description

(** {6 Safe unmarshalling} *)

(** The following functions are specific to safe unmarshalling. They
    operate on implicitly universally quantified type expression. The
    constant [Tvar i] represents the constant {i Top} of the
    (currently only in French) formalization. *)

(** [cut_head_quantification ty] removes hypothetical headings
    [Tpoly _] from [ty]. *)
val cut_head_quantification: expression -> expression

(** [instance ty1 ty2] returns true when [ty1] is a polymorphic
    instance [ty2]. *) 
val instance: expression -> expression -> bool

(** Memoryless anti-unification (also sometimes called weak anti-unification).

    We use [Tvar (-1)] for resolving conflicts. *)
val antiunif: expression -> expression -> expression

(** [is_monomorphic ty] returns [true] when [ty] contains no
    occurrence of [Tvar (-1)]. *)
val is_monomorphic: expression -> bool

[-- Attachment #3: Type: text/plain, Size: 1 bytes --]



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Caml-list] SafeUnmarshal: questions/problems/timings
  2006-08-31 10:01 ` [Caml-list] " Grégoire Henry
@ 2006-09-01  9:23   ` Hendrik Tews
  0 siblings, 0 replies; 3+ messages in thread
From: Hendrik Tews @ 2006-09-01  9:23 UTC (permalink / raw)
  To: caml-list

Grégoire Henry <gregoire.henry@pps.jussieu.fr> writes:

   Hendrik wrote:

   > 1. I made some measurements that suggest that the
   >    unmarshal has quadratic complexity in the data size, see

   Therefore, if your data contains shared parts but no cycle, then it
   should be close to linear. Currently, mapping shared blocks to their
   type is represented by an association list: accessing such type
   information could explain this quadratic behaviour.

That would explain something: An association list hardly scales
to several megabyte of data.

   Could you please send us more information about the shape of your data
   (sharing or not, cycles or not, and wether sharing increase linearly
   or not with the size)?

The data contains sharing and cycles. I have no idea about the
height of the cycles. It might well be the case that the amount
of shared data grows faster than the overal size. But almost all
shared data has type string * int * int.

The data are abstract syntax trees of C++ programs (produced by a
modified elsa/elkhound parser). Its type is defined with about 40
recursive variant and record types. Its monomorphic, no aliasing
or other fancy features. I would say the type is bigger, but not
more complex than "int list". I would be really surprised if the
stronly connected components are necessary for my application.

I am working on a release of the modified elsa parser (hopefully
early next week). Then you can examine the data and its type
yourself.

Maybe you can follow the approach of the Marshal module, where
the user can choose (via the flags) between different algorithms?

   > 2.	    Objective Caml version 3.09.3+dev0+ty1
   >
   >     # SafeUnmarshal.copy [^nativeint^] 4;;
   >     Segmentation fault

   See your fifth question :)

I don't quite understand. My point is that even if nativeint is
not supported yet, it should not end in a segmentation fault,
should it?

   My hope for reducing the complexity is to make some "pre-processing"
   on the value when the unmarshaller reallocates it in the memory.
   In this case, the Check.check function can't operate on "standard"
   values, and should not be exported.

   Yet I don't know if the "preprocessor" would be tightly bound to the
   unmarshaller or have a chance to be exported too.

   We can keep exporting a quadratic check for "standard" values.

My application scenario is as follows: I have this big C++
program (a C++ parser) that eventually generates an ocaml value
(the abstract syntax tree of a C++ program, possibly > 10MB). I
would then like to enter the clean world of ocaml to
manipulate/process those abstract syntax trees. 

If everything goes wrong I would first like to check the
integrety of the abstract syntax tree. If the integrety check
fails I need to know which node and which of its fields contain
the bad data (otherwise I can't fix the problem). To locate the
bad data I currently invoke check on the top node, and if it
fails I call it on the child nodes and descend into one for which
the check fails. That is, I invoke check very often, but always
on the same unmodified data.

It would be nice if the interface of SafeUnmarshal permits this
kind of use. What's necessary is 
- either a fast unmarshal function with the ability to output
  some error trace if requested
- or a check function that is fast as long as one works on the
  same unmodified data.


   >    I would strongly suggest to replace the catch all cases
   >
   >     | _ -> false
   >
   >    in check.ml with some more precise code (it took me several
   >    hours of bug hunting until I found out that the above line
   >    made my unmarshal fail without even looking at the
   >    nativeints).

   Something like printing the type clash (looking for ty_xx, found ty_yy)?

   Or pretty-printing part of the value that have been checked before
   failure?

No, we probably have some misunderstanding here. I meant the
following: Instead of 

      | _ -> false

enumerate all possible constructors and carefully distinguish
those cases for check should return false from those cases that
have not been implemented yet and which should therefore trigger
an assertion. (Because: your original code delivered false on a
nativeint. If it delivered an assertion instead I would have
immediately known that nativeints are not supported. Instead I
spent several hours to search for the inconsistency in my data.)


I tell you how to reproduce those unmarshals that run for hours
when I made the release of the modified elsa parser. I would be
very much interested in trying a new version of SafeUnmarshal,
especially one where the association list problem has been fixed
(I still have a problem in one ast of 4.7MB, it's only that check
runs from more than 2 hours on it, so I can't track it)

Bye,

Hendrik


PS. I assume that Michel Mouny and Grégoire Henry are subscribed
to the caml list, so I don't cc them.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-09-01  9:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-29 23:39 SafeUnmarshal: questions/problems/timings Hendrik Tews
2006-08-31 10:01 ` [Caml-list] " Grégoire Henry
2006-09-01  9:23   ` Hendrik Tews

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).