caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Implicits for the masses
@ 2019-09-04 15:25 Oleg
  2019-09-04 20:41 ` Ivan Gotovchits
  0 siblings, 1 reply; 4+ messages in thread
From: Oleg @ 2019-09-04 15:25 UTC (permalink / raw)
  To: caml-list


This is to announce a simple, plain OCaml library to experiment with
type-class/implicits resolution, which can be thought of as evaluating
a Prolog-like program. One may allow `overlapping instances' -- or
prohibit them, insisting on uniqueness. One may make the search fully
deterministic, fully non-deterministic, or something in-between.
There is an immediate, albeit modest practical benefit: the facility
like "#install_printer", which was restricted to top-level, is now
available for all -- as a small, self-contained, plain OCaml library
with no knowledge or dependencies on the compiler internals. We show
an example at the end of the message.

This message has been inspired by the remarkable paper
        Canonical Structures for the working Coq user
        Assia Mahboubi, Enrico Tassi
        DOI: 10.1007/978-3-642-39634-2_5
Its introduction is particularly insightful: the power of
(mathematical) notation is in eliding distracting details. Yet to
formally check a proof, or to run a program, the omitted has to be
found. When pressed to fill in details, people `skillful in the art'
look in the database of the `state of the art', with the context as
the key. Computers can be programmed similarly; types well represent
the needed context to guide the search.

Mahboubi and Tassi's paper explains very well how this eliding and
filling-in is realized, as a programmable unification, and used in
Coq. Yet their insights go beyond Coq and deserve to be known better.
This message and the accompanying code is to explain them in
plain OCaml and encourage experimentation. It could have been titled
`Canonical Structures for the working OCaml (meta) programmer'.

The rudiment of canonical structures is already present in OCaml, in
the form of the registry of printers for user-defined types. This
facility is available only at the top-level, however, and deeply
intertwined with it. As a modest practical benefit, this facility is
now available for all programs, as a plain, small, self-contained
library, with no compiler or other magic. The full potential of the
method is realized however in (multi-) staged programming. In fact, I'm
planning to use it in the upcoming version of MetaOCaml to implement
`lifting' from a value to the code that produces it -- letting the
users register lifting functions for their own data types.


http://okmij.org/ftp/ML/canonical.ml
        The implementation and the examples, some of which are noted below.
http://okmij.org/ftp/ML/trep.ml
        A generic library of type representations: something like 
        Typeable in Haskell. Some day it may be built-in into the compiler
http://okmij.org/ftp/ML/canonical_leadup.ml
        A well-commented code that records the progressive development of
        canonical.ml. It is not part of the library, but may serve as
        its explanation.

Here are a few examples, starting with the most trivial one
   module Show = MakeResolve(struct type 'a dict = 'a -> string end)
   let () = Show.register Int string_of_int    (* Define `instances' *)
   let () = Show.register Bool string_of_bool
   Show.find Int 1;;

However contrived and flawed, it is instructive. Here (Int : int trep)
is the value representing the type int. The type checker can certainly
figure out that 1 is of the type int, and could potentially save us
trouble writing Int explicitly. What the type checker cannot do by
itself is to find out which function to use to convert an int to a
string. After all, there are many of them. Show.register lets us
register the *canonical* int->string function. Show.find is to search
the database of such canonical functions: in effect, finding *the*
evidence that the type int->string is populated. Keeping Curry-Howard
in mind, Show.find does a _proof search_.

The type of Show.find is 'a trep -> ('a -> string). Compare with
Haskell's show : Show a => a -> String (or, desuraging => and Show)
show : ('a -> string) -> ('a -> string).  Haskell's show indeed does
not actually do anything: it is the identity function. All the hard
work -- finding out the right dictionary (the string producing
function) -- is done by the compiler. If one does not like the way the
compiler goes about it -- tough luck. There is little one may do save
complaining on reddit. In contrast, the first argument of Show.find is
trivial: it is a mere reflection of the type int, with no further
information. Hence Show.find has to do a non-trivial work.  In the
case of int, this work is the straightforward database search --
or, if you prefer, running the query ?- dict(int,R) against a logic
program
     dict(int,string_of_int).
     dict(bool,string_of_bool).
The program becomes more interesting when it comes to pairs:
     dict(T,R) :- T = pair(X,Y), !,  
         dict(X,DX), dict(Y,DY), R=make_pair_dict(DX,DY).
Here is how it is expressed in OCaml:
let () = 
  let open Show in
  let pat : type b. b trep -> b rule_body option = function
    | Pair (x,y) -> 
        Some (Arg (x, fun dx -> Arg (y, fun dy -> 
          Fact (fun (x,y) -> "(" ^ dx x ^ "," ^ dy y ^ ")"))))
    | _      -> None
  in register_rule {pat}

let () = Show.register (Pair(Bool,Bool)) 
           (fun (x,y) -> string_of_bool x ^ string_of_bool y)

Our library permits `overlapping instances'. We hence registered the
printer for generic pairs, and a particular printer just for pairs of
booleans.

The library is extensible with user-defined data types, for example:
   type my_fancy_datatype = Fancy of int * string * (int -> string)

After registering the type with trep library, and the printer
   type _ trep += MyFancy : my_fancy_datatype trep
   let () = Show.register MyFancy (function Fancy(x,y,_) ->
     string_of_int x ^ "/" ^ y ^ "/" ^ "<fun>")

one can print rather complex data with fancy, with no further ado:
   Show.find (List(List(Pair(MyFancy,Int)))) [[(Fancy ...,5)];[]]

As Mahboubi and Tassi would say, proof synthesis at work!

We should stress that what we have described is not a type-class
facility for OCaml. It is *meta* type-class facility.  Show.find has
many drawbacks: we have to explicitly pass the trep argument like
Int. The resolution happens at run time, and hence the failure of the
resolution is a run-time exception. But the canonical instance
resolution was intended to be a part of a type checker. There, the
resolution failure is a type checking error. The trep argument,
representing the type in the object program, is also at
hand. Likewise, the drawbacks of Show.find disappear when we use the
library in a meta-program (code generator). The library then becomes a
type-class/implicits facility, for the generated code -- the facility,
we can easily (re)program.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Implicits for the masses
  2019-09-04 15:25 [Caml-list] Implicits for the masses Oleg
@ 2019-09-04 20:41 ` Ivan Gotovchits
  2019-09-10 14:40   ` [Caml-list] Type-indexed heterogeneous collections (Was: Implicits for the masses) Oleg
  0 siblings, 1 reply; 4+ messages in thread
From: Ivan Gotovchits @ 2019-09-04 20:41 UTC (permalink / raw)
  To: Oleg, caml-list

[-- Attachment #1: Type: text/plain, Size: 12124 bytes --]

Very interesting and thought-provoking writeup, thank you!

Incidentally, we're investigating the same venues, in our CMU BAP project,
as we found out that we need the extensibility in the style of type
classes/canonical structures to decouple complex dependencies which arise
in the program analysis domain.
In fact, we build our new BAP 2.0 framework largely on your
[tagless-final][1] style which, let's admit it, works much better with type
classes. Therefore we ended up implementing extensible type representations
along with registries for our type classes. Unfortunately, the idea of
storing rules in the registry didn't visit us, but we're now thinking about
how to incorporate it (the classes that we have are very nontrivial,
usually having hundreds of methods, so we're currently using functors to
manually derive on class from another, and registering the resulting
structures - but using your approach we can register functors as well and
automate the derivation). We also didn't generalize the type class
instantiation, so our solutions do have some boilerplate (but I have to
admit, that the total number of type classes that we need is not very big,
so it really never bothered us). What could be surprising is that the
universe of types actually grew quite large, that large that the linear
search in the registry is not an option for us anymore. In fact, we have so
many comparisons between treps, that instead of extracting the extension
constructor number from an extensible variant we had to rely on our own
freshly generated identifier. But I'm running in front of myself, an
important lesson that we have learned is that treps should not only be
equality comparable but also ordered (and even hashable) so that we can
implement our registries as hash tables. It is also better to keep them
abstract so that we can later extend them without breaking user code (to
implement introspection as well as different resolution schemes). This is
basically an elaboration of your approach (which is also could be commonly
found in Janestreet's Core (Type_equal.Uid.t) and other implementations of
existentials). In our case, we ended up with the following implementation
```
    type 'a witness = ..

    module type Witness = sig
      type t
      type _ witness += Id : t witness
    end

    type 'a typeid = (module Witness with type t = 'a)

    type 'a key = {
      ord : int;
      key : 'a typeid;
      name : string; (* for introspection *)
      show : 'a -> Sexp.t; (* also for introspection *)
    }
```
Now, we can use the `ord` field to order types, compare them, store in
maps, hash tables, and even arrays. E.g., this is how our `teq` function
looks like,
```
    let same (type a b) x y : (a,b) Type_equal.t =
      if x.id =  y.id then
        let module X = (val x.key : Witness with type t = a) in
        let module Y = (val y.key : Witness with type t = b) in
        match X.Id with
        | Y.Id -> Type_equal.T
        | _ -> failwith "broken type equality"
      else failwith "types are not equal"
```

It is often used in the context where we already know that `x.id = y.id`,
e.g., when we already found an entry, so we just need to obtain the
equality witness (we use Janestreet's Type_equal.T, which is the same as
yours eq type).

Concerning the representation of the registry, we also experimented with
different approaches (since we have a few ways to make a type existential
in OCaml), and found out the following to be the most efficient and easy to
work with,

```
type ordered = {
    order : 'a. 'a key -> 'a -> 'a -> int;
  } [@@unboxed]
```

Notice, that thanks to `[@@unboxed]` we got a free unpacked existential. We
will next store `ordered` in our registry, which is a hash table,

```
let ordered : ordered Hashtbl.M(Int).t = Hashtbl.create (module Int)
```
and register it as simple as,
```
  let register: type p. p Key.t -> (p -> p -> int) -> unit = fun key order
->
    Hashtbl.add_exn vtables ~key:(uid key) ~data:{
      order = fun (type a) (k : a key) (x : a) (y : a) ->
        let T = same k key in (* obtain the witness that we found the right
structure *)
        order x y
     }
```

Instead of a hashtable, it is also possible to use `ordered array ref`
(since our `ord` is just an integer which we increment every time a new
class is declared). This will give us even faster lookup.

I hope that this was interesting. And if yes, I'm ready to elaborate more
on our design decision or to hear suggestions and critics. Here are a few
links:

- https://github.com/BinaryAnalysisPlatform/bap - the BAP project per se.
- https://binaryanalysisplatform.github.io/knowledge-intro-1  - a small
introductionary post about BAP 2.0 Knowledge representation
-
https://github.com/BinaryAnalysisPlatform/bap/blob/master/lib/knowledge/bap_knowledge.ml
- the implementation of the knowledge system
-
https://github.com/BinaryAnalysisPlatform/bap/tree/master/lib/bap_core_theory
- The Core Theory, an exemplar type class of the theory that we're
developing :)

Cheers,
Ivan Gotovchits
Research Scientist, CMU Cylab

[1]: http://okmij.org/ftp/tagless-final/index.html

On Wed, Sep 4, 2019 at 11:23 AM Oleg <oleg@okmij.org> wrote:

>
> This is to announce a simple, plain OCaml library to experiment with
> type-class/implicits resolution, which can be thought of as evaluating
> a Prolog-like program. One may allow `overlapping instances' -- or
> prohibit them, insisting on uniqueness. One may make the search fully
> deterministic, fully non-deterministic, or something in-between.
> There is an immediate, albeit modest practical benefit: the facility
> like "#install_printer", which was restricted to top-level, is now
> available for all -- as a small, self-contained, plain OCaml library
> with no knowledge or dependencies on the compiler internals. We show
> an example at the end of the message.
>
> This message has been inspired by the remarkable paper
>         Canonical Structures for the working Coq user
>         Assia Mahboubi, Enrico Tassi
>         DOI: 10.1007/978-3-642-39634-2_5
> Its introduction is particularly insightful: the power of
> (mathematical) notation is in eliding distracting details. Yet to
> formally check a proof, or to run a program, the omitted has to be
> found. When pressed to fill in details, people `skillful in the art'
> look in the database of the `state of the art', with the context as
> the key. Computers can be programmed similarly; types well represent
> the needed context to guide the search.
>
> Mahboubi and Tassi's paper explains very well how this eliding and
> filling-in is realized, as a programmable unification, and used in
> Coq. Yet their insights go beyond Coq and deserve to be known better.
> This message and the accompanying code is to explain them in
> plain OCaml and encourage experimentation. It could have been titled
> `Canonical Structures for the working OCaml (meta) programmer'.
>
> The rudiment of canonical structures is already present in OCaml, in
> the form of the registry of printers for user-defined types. This
> facility is available only at the top-level, however, and deeply
> intertwined with it. As a modest practical benefit, this facility is
> now available for all programs, as a plain, small, self-contained
> library, with no compiler or other magic. The full potential of the
> method is realized however in (multi-) staged programming. In fact, I'm
> planning to use it in the upcoming version of MetaOCaml to implement
> `lifting' from a value to the code that produces it -- letting the
> users register lifting functions for their own data types.
>
>
> http://okmij.org/ftp/ML/canonical.ml
>         The implementation and the examples, some of which are noted below.
> http://okmij.org/ftp/ML/trep.ml
>         A generic library of type representations: something like
>         Typeable in Haskell. Some day it may be built-in into the compiler
> http://okmij.org/ftp/ML/canonical_leadup.ml
>         A well-commented code that records the progressive development of
>         canonical.ml. It is not part of the library, but may serve as
>         its explanation.
>
> Here are a few examples, starting with the most trivial one
>    module Show = MakeResolve(struct type 'a dict = 'a -> string end)
>    let () = Show.register Int string_of_int    (* Define `instances' *)
>    let () = Show.register Bool string_of_bool
>    Show.find Int 1;;
>
> However contrived and flawed, it is instructive. Here (Int : int trep)
> is the value representing the type int. The type checker can certainly
> figure out that 1 is of the type int, and could potentially save us
> trouble writing Int explicitly. What the type checker cannot do by
> itself is to find out which function to use to convert an int to a
> string. After all, there are many of them. Show.register lets us
> register the *canonical* int->string function. Show.find is to search
> the database of such canonical functions: in effect, finding *the*
> evidence that the type int->string is populated. Keeping Curry-Howard
> in mind, Show.find does a _proof search_.
>
> The type of Show.find is 'a trep -> ('a -> string). Compare with
> Haskell's show : Show a => a -> String (or, desuraging => and Show)
> show : ('a -> string) -> ('a -> string).  Haskell's show indeed does
> not actually do anything: it is the identity function. All the hard
> work -- finding out the right dictionary (the string producing
> function) -- is done by the compiler. If one does not like the way the
> compiler goes about it -- tough luck. There is little one may do save
> complaining on reddit. In contrast, the first argument of Show.find is
> trivial: it is a mere reflection of the type int, with no further
> information. Hence Show.find has to do a non-trivial work.  In the
> case of int, this work is the straightforward database search --
> or, if you prefer, running the query ?- dict(int,R) against a logic
> program
>      dict(int,string_of_int).
>      dict(bool,string_of_bool).
> The program becomes more interesting when it comes to pairs:
>      dict(T,R) :- T = pair(X,Y), !,
>          dict(X,DX), dict(Y,DY), R=make_pair_dict(DX,DY).
> Here is how it is expressed in OCaml:
> let () =
>   let open Show in
>   let pat : type b. b trep -> b rule_body option = function
>     | Pair (x,y) ->
>         Some (Arg (x, fun dx -> Arg (y, fun dy ->
>           Fact (fun (x,y) -> "(" ^ dx x ^ "," ^ dy y ^ ")"))))
>     | _      -> None
>   in register_rule {pat}
>
> let () = Show.register (Pair(Bool,Bool))
>            (fun (x,y) -> string_of_bool x ^ string_of_bool y)
>
> Our library permits `overlapping instances'. We hence registered the
> printer for generic pairs, and a particular printer just for pairs of
> booleans.
>
> The library is extensible with user-defined data types, for example:
>    type my_fancy_datatype = Fancy of int * string * (int -> string)
>
> After registering the type with trep library, and the printer
>    type _ trep += MyFancy : my_fancy_datatype trep
>    let () = Show.register MyFancy (function Fancy(x,y,_) ->
>      string_of_int x ^ "/" ^ y ^ "/" ^ "<fun>")
>
> one can print rather complex data with fancy, with no further ado:
>    Show.find (List(List(Pair(MyFancy,Int)))) [[(Fancy ...,5)];[]]
>
> As Mahboubi and Tassi would say, proof synthesis at work!
>
> We should stress that what we have described is not a type-class
> facility for OCaml. It is *meta* type-class facility.  Show.find has
> many drawbacks: we have to explicitly pass the trep argument like
> Int. The resolution happens at run time, and hence the failure of the
> resolution is a run-time exception. But the canonical instance
> resolution was intended to be a part of a type checker. There, the
> resolution failure is a type checking error. The trep argument,
> representing the type in the object program, is also at
> hand. Likewise, the drawbacks of Show.find disappear when we use the
> library in a meta-program (code generator). The library then becomes a
> type-class/implicits facility, for the generated code -- the facility,
> we can easily (re)program.
>

[-- Attachment #2: Type: text/html, Size: 14810 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Caml-list] Type-indexed heterogeneous collections (Was: Implicits for the masses)
  2019-09-04 20:41 ` Ivan Gotovchits
@ 2019-09-10 14:40   ` Oleg
  2019-09-10 19:03     ` Ivan Gotovchits
  0 siblings, 1 reply; 4+ messages in thread
From: Oleg @ 2019-09-10 14:40 UTC (permalink / raw)
  To: ivg; +Cc: caml-list


> https://github.com/BinaryAnalysisPlatform/bap - the BAP project per se.
> ...
That is quite a project! And interesting, too. Thank you for letting
me know. Now I understand what you mean by ``universe of types
actually grew quite large, that large that the linear search in the
registry is not an option for us anymore.''

What your message has made it very clear, to me, is that we really
need the standard type representation library -- hopefully being part
of Stdlib and eventually integrated with the compiler. For one, it is
high time we standardized the spelling of the ('a,'b) eq type and its
constructor. Mainly, as your message has well demonstrated, an
efficient trep is actually non-trivial, and takes lots of experience
to design. Incidentally, the situation on Haskell is very similar:
Typeable is now well integrated with the compiler. It is quite a
popular library.

I'd also like to draw attention to truly type-indexed heterogeneous
collections where trep is a part of a key rather than a part of a
value. To explain what I mean, let's recall the state of the art of
heterogeneous collections. Let ('key,'v) coll be an ordinary
homogeneous collection (a map, a hashtable, or an associative list)
with a look-up function
        lookup : 'key -> ('key,'v) coll -> 'v option
We can turn it into a heterogeneous collection that stores data of
various types by using dynamics dyn: an existential that contains the
data of some type t along with t trep, the representation of the type t.
You have showed a very efficient dyn (which I will keep in mind, thank
you). The dyn extraction
        from_dyn_opt : 'a trep -> dyn -> 'a option
checks if the trep given as an argument matches trep stored inside
dyn, and if so, returns dyn's value, of the requested type. Related
        from_dyn : 'a trep -> dyn -> 'a
throws an exception on mismatch. The
heterogeneous collection is then ('key, dyn) coll, with the
look-up operation that is a `composition' of lookup and from_dyn

        let lookup_het : 'key -> 'a trep -> ('key,dyn) coll -> 'a option =
          fun key trep coll ->
            match lookup key coll with
            | None -> None
            | Some dyn -> Some (from_dyn trep dyn)

I grant the elegance of the implementation and the reuse of the
existing collections. What bothers me however is that the 'a trep is
a part of the element value (part of dyn) rather a part of the
key. The real lookup is done solely on the key, and the
trep comparison is used as a verification, just to make sure we found
the right dyn. If the code is correct, the check does not
fail. Somehow I am always bothered by operations whose results isn't
actually used but still have to be done. Wouldn't it be nice if all
operations advanced us towards the goal, with no need for looking back
and double-checking.

Heterogeneous collections, at least, could be designed that way. Here
is the gist:
        type 'key binding = B : 'key * 'a trep * 'a -> 'key binding
(a more efficient existential could be designed, following the ideas
of your message.) The binding is what its name says: an association of
the key, the trep, and the value. The simplest heterogeneous
collection, the analogue of association lists, is a list of bindings.
        type 'key halist = 'key binding list

let lookup_hetkey : type a key. key -> a trep -> key halist -> a option =
  fun key trep coll ->
    let rec loop : key halist -> a option = function
      | [] -> None
      | B (key',trep',v) :: t -> match (key=key',teq trep' trep) with
        | (true,Some Refl) -> Some v
        | _                -> loop t
    in loop coll

Now both key and trep equally participate in search: the mismatch of
trep and trep' is not a failure but an indicator to continue search.
The clear drawback is that the shown code uses a linear search -- and
we cannot use off-the-shelf Maps or hashtables to speed it up. Still,
an adaptation of an extant Hashtable is rather straightforward. In
fact, the following file

        http://okmij.org/ftp/ML/hetcoll.ml

shows the implementation of type-indexed heterogeneous hash tables. It
is almost the same as Stdlib.Hashtable, but stripped to bare minimum
and slightly adjusted. (The biggest adjustment has nothing to do with
types; I simply didn't like how resizing was implemented). 

module type TypedHashedType =
  sig
    type t                              (* key type *)
    val equal: t -> t -> bool           (* key equality *)
    val hash:  t -> 'a trep -> int      (* Hash the key and trep *)
  end
module type S =
  sig
    type key
    type t
    val create: int -> t
    val clear : t -> unit
    val add: t -> key -> 'a trep -> 'a -> unit
    val find_opt: t -> key -> 'a trep -> 'a option
    val length: t -> int
  end

Although the interface is very minimal, it is enough for registering
your types and classes and my canonical functions. Stdlib.Map could
likewise be adjusted.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Type-indexed heterogeneous collections (Was: Implicits for the masses)
  2019-09-10 14:40   ` [Caml-list] Type-indexed heterogeneous collections (Was: Implicits for the masses) Oleg
@ 2019-09-10 19:03     ` Ivan Gotovchits
  0 siblings, 0 replies; 4+ messages in thread
From: Ivan Gotovchits @ 2019-09-10 19:03 UTC (permalink / raw)
  To: Oleg, Ivan Gotovchits, caml-list

[-- Attachment #1: Type: text/plain, Size: 9327 bytes --]

On Tue, Sep 10, 2019 at 10:38 AM Oleg <oleg@okmij.org> wrote:

>
> > https://github.com/BinaryAnalysisPlatform/bap - the BAP project per se.
> > ...
> That is quite a project! And interesting, too. Thank you for letting
> me know. Now I understand what you mean by ``universe of types
> actually grew quite large, that large that the linear search in the
> registry is not an option for us anymore.''
>
> What your message has made it very clear, to me, is that we really
> need the standard type representation library -- hopefully being part
> of Stdlib and eventually integrated with the compiler. For one, it is
> high time we standardized the spelling of the ('a,'b) eq type and its
> constructor. Mainly, as your message has well demonstrated, an
> efficient trep is actually non-trivial, and takes lots of experience
> to design. Incidentally, the situation on Haskell is very similar:
> Typeable is now well integrated with the compiler. It is quite a
> popular library.
>

Definitely! So far the closest to what could be called a standard
representation is the Base.Type_equal [1] module (and its extensions [2]).
But unfortunately, it comes with strings attached and that hampers its
adoption in other projects.
Basically, we need at least to put the data constructor in a separate
library (as it was done for the `result` type previously). Going beyond
just defining the data type would require some consensus. A good example
would be deciding how to implement the total order relation on type
indices. The Base Library is using `Obj.extension_constructor` which has a
few problems:
1) It is in the Obj module with all the ramifications
2) The implementation is inefficient and overly cautious (so cautious that
it doesn't work with the Ancient module, for example)
3) The corresponding index is too global, not only shared by all extensible
variants and exceptions but every new object increments the identifier
(i.e., it is the same global counter which is used by `Oo.id`,
which is probably an error which should be fixed in the compiler),
therefore there is a chance to overflow it.

That's why we decided to implement our own indexing, for example.

[1]: https://github.com/janestreet/base/blob/master/src/type_equal.mli
[2]: https://github.com/janestreet/typerep


>
> I'd also like to draw attention to truly type-indexed heterogeneous
> collections where trep is a part of a key rather than a part of a
> value.


I like to treat such structures as a reification of OCaml records so that
trep is now a first-class field and a record is an n-tuple with named
fields. Since the fields are named, the n-tuple can grow and shrink, i.e.,
the size is not fixed. We can implement the n-tuple using heterogeneous
maps, hashtables, or just an assoc list. So far we had a few libraries
which provided heterogeneous maps indexed by the type index, the `Univ_map`
[3] from Janestreet and Hmap [4] from Daniel Bünzli, to name a few.

Both libraries follow the same approach, namely parametrizing the existing
non-heterogeneous Map functor with the order function induced by the age of
a key, with some minor differences, e.g., Base's version is using the int
type as the key (where the key denotes the extension constructor
identifier), while Daniel is using the trep itself as the key, and
extracting the order from the age of the key.
However, though semantically data is indeed indexed by the trpep,
syntactically it is represented with an extra layer of indirection, e.g.,
instead of having a single binding

     type 'key binding = Binding : 'a trep * 'a -> 'key binding

we have a non-heterogeneous binding

    type ('a,'b) binding = Binding of 'a * 'b

which packs a universal something like this as `Binding (trep,Pack
(trep,x))`.

This creates a lot of overhead memory-wise and (given that OCaml
performance is usually conditioned by the allocation rate)
performance-wise. OCaml AVL-trees are already having a lot of overhead,
about 5 extra words per bindings (Base has a little bit less, but still a
lot - about 3 words per binding). Therefore, (after lots of profiling) we
actually ended up implementing heterogenous AVL trees, which doesn't suffer
from this problem. We ended up with the following representation [5]
```
  type record =
    | T0
    | T1 : 'a key * 'a -> record
    | T2 : 'a key * 'a *
           'b key * 'b -> record
    | T3 : 'a key * 'a *
           'b key * 'b *
           'c key * 'c -> record
    | T4 : 'a key * 'a *
           'b key * 'b *
           'c key * 'c *
           'd key * 'd -> record
    | LL : record * 'a key * 'a * record -> record (* h(x) = h(y) - 1 *)
    | EQ : record * 'a key * 'a * record -> record (* h(x) = h(y) *)
    | LR : record * 'a key * 'a * record -> record (* h(x) = h(y) + 1 *)
```

Which stores the key and data directly on the branches of a tree without
any unnecessary indirection. And also unfolds n-tuples with n < 5, an
implementation detail which is not really important wrt to the current
discussion, (but was a dealbreaker for us as it significantly reduced the
memory consumption and the total number of rebalances, so that at the end
of the day we got a factor of 5 improvements both in time and space).

 Another interesting feature is that if implemented as a functor, this
implementation could be used to derive non-heterogeneous maps (e.g., when
`'a key = int`).

During the implementation, we had to find solutions to a few very
interesting problems, like how to represent an existential folding function
(or how to fold over heterogeneous map)[6]. Or more interesting, how to
update keys and merge two maps [7] and do this efficiently (hint: you have
to use CPS).


[3]: https://github.com/janestreet/core_kernel/blob/master/src/univ_map.ml
[4]: https://github.com/dbuenzli/hmap/blob/master/src/hmap.ml
[5]:
https://github.com/BinaryAnalysisPlatform/bap/blob/0352edfe42ba5560b178b6bf990cf15e81d78f57/lib/knowledge/bap_knowledge.ml#L766
[6]:
https://github.com/BinaryAnalysisPlatform/bap/blob/0352edfe42ba5560b178b6bf990cf15e81d78f57/lib/knowledge/bap_knowledge.ml#L824
[7]:
https://github.com/BinaryAnalysisPlatform/bap/blob/0352edfe42ba5560b178b6bf990cf15e81d78f57/lib/knowledge/bap_knowledge.ml#L1316

Treating heterogenous maps as a generalization of records brings us to
another interesting development. We do not want to completely forfeit types
of records and treat all records as a member of one universal record type.
We would like to be able to define type schemes so that we can distinguish
between a `student record` and a `course record` while keeping both
universal and extensible. In BAP we implemented this using classes and
slots (borrowing those ideas from frame languages of AI, any resemblance to
OO is purely coincidental). We turn our untyped `record` into a value
parameterized by its class, e.g.,

```
type 'a value = {
     cls : 'a cls;
     data : record;
}
```

The `'a cls` type could be phantom, e.g., `type 'a cls = 'a`, but we can
benefit from storing some type information to implement runtime reflection.
And an extra layer of indirection on top of treps enables the structural
definition of a record type that can span across multiple modules:
```
type ('a,'p) slot = {
   cls : 'a cls;
   key : 'p key;
}
```

so that now a class is defined by its slots, e.g.,

```
val name : (student, string) slot
val age : (student, age) slot
...
val teacher : (course, teacher) slot
```

and an abstract interface of the Value will prevent us from confusing
different fields:
```
module Value : sig
  type 'a t
  val get : ('a,'p) slot -> 'a t -> 'p
  val put : ('a,'p) slot -> 'a t -> p -> 'a t
end
```

The real implementation in BAP [8], is a little bit more convoluted since
we want to create hierarchies of classes (which we need for type-safe
representation of their denotational semantics) and make classes indexable
by runtime values. Therefore our class is actually a collection of sorts.
And we also have triggers (i.e., computable fields), as well as we use
domains (aka Scott-Ershov domains) to represent the properties of classes,
which gives us a nice feature - our `Value.get` function is total as for
each type of property we have the bottom value as the default.

Finally, we can store our values in a database and hide this storage under
a monad interface, so that we can implement different access schemes (and
put a fixed point computation inside), but this is a completely different
story. The takeaway is that OCaml as a language is now fully equipped for
implementing efficient heterogeneous data structures and can be used in
domains which previously were previously dominated by languages with
dynamic typing.

Besides, the Knowledge library is totally independent on BAP despite the
fact that it is currently distributed from the BAP repository. We've tried
not to leak any domain-specific details into the knowledge abstraction
albeit keeping it versatile enough to be able to represent the knowledge
representation of the program semantics.

[8]:
https://github.com/BinaryAnalysisPlatform/bap/blob/master/lib/knowledge/bap_knowledge.mli#L153

Cheers,
Ivan Gotovchits

[-- Attachment #2: Type: text/html, Size: 12168 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-09-10 19:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-04 15:25 [Caml-list] Implicits for the masses Oleg
2019-09-04 20:41 ` Ivan Gotovchits
2019-09-10 14:40   ` [Caml-list] Type-indexed heterogeneous collections (Was: Implicits for the masses) Oleg
2019-09-10 19:03     ` Ivan Gotovchits

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).