On Tue, Sep 10, 2019 at 10:38 AM Oleg <oleg@okmij.org> wrote:

> https://github.com/BinaryAnalysisPlatform/bap - the BAP project per se.
> ...
That is quite a project! And interesting, too. Thank you for letting
me know. Now I understand what you mean by ``universe of types
actually grew quite large, that large that the linear search in the
registry is not an option for us anymore.''

What your message has made it very clear, to me, is that we really
need the standard type representation library -- hopefully being part
of Stdlib and eventually integrated with the compiler. For one, it is
high time we standardized the spelling of the ('a,'b) eq type and its
constructor. Mainly, as your message has well demonstrated, an
efficient trep is actually non-trivial, and takes lots of experience
to design. Incidentally, the situation on Haskell is very similar:
Typeable is now well integrated with the compiler. It is quite a
popular library.

Definitely! So far the closest to what could be called a standard representation is the Base.Type_equal [1] module (and its extensions [2]). But unfortunately, it comes with strings attached and that hampers its adoption in other projects. 
Basically, we need at least to put the data constructor in a separate library (as it was done for the `result` type previously). Going beyond just defining the data type would require some consensus. A good example
would be deciding how to implement the total order relation on type indices. The Base Library is using `Obj.extension_constructor` which has a few problems:
1) It is in the Obj module with all the ramifications
2) The implementation is inefficient and overly cautious (so cautious that it doesn't work with the Ancient module, for example)
3) The corresponding index is too global, not only shared by all extensible variants and exceptions but every new object increments the identifier (i.e., it is the same global counter which is used by `Oo.id`, 
which is probably an error which should be fixed in the compiler), therefore there is a chance to overflow it. 

That's why we decided to implement our own indexing, for example. 

[1]: https://github.com/janestreet/base/blob/master/src/type_equal.mli
[2]: https://github.com/janestreet/typerep
 

I'd also like to draw attention to truly type-indexed heterogeneous
collections where trep is a part of a key rather than a part of a
value.

I like to treat such structures as a reification of OCaml records so that trep is now a first-class field and a record is an n-tuple with named fields. Since the fields are named, the n-tuple can grow and shrink, i.e., the size is not fixed. We can implement the n-tuple using heterogeneous maps, hashtables, or just an assoc list. So far we had a few libraries which provided heterogeneous maps indexed by the type index, the `Univ_map` [3] from Janestreet and Hmap [4] from Daniel Bünzli, to name a few.

Both libraries follow the same approach, namely parametrizing the existing non-heterogeneous Map functor with the order function induced by the age of a key, with some minor differences, e.g., Base's version is using the int type as the key (where the key denotes the extension constructor identifier), while Daniel is using the trep itself as the key, and extracting the order from the age of the key.
However, though semantically data is indeed indexed by the trpep, syntactically it is represented with an extra layer of indirection, e.g., instead of having a single binding 

     type 'key binding = Binding : 'a trep * 'a -> 'key binding

we have a non-heterogeneous binding

    type ('a,'b) binding = Binding of 'a * 'b
    
which packs a universal something like this as `Binding (trep,Pack (trep,x))`.

This creates a lot of overhead memory-wise and (given that OCaml performance is usually conditioned by the allocation rate) performance-wise. OCaml AVL-trees are already having a lot of overhead, about 5 extra words per bindings (Base has a little bit less, but still a lot - about 3 words per binding). Therefore, (after lots of profiling) we actually ended up implementing heterogenous AVL trees, which doesn't suffer from this problem. We ended up with the following representation [5]
```
  type record =
    | T0
    | T1 : 'a key * 'a -> record
    | T2 : 'a key * 'a *
           'b key * 'b -> record
    | T3 : 'a key * 'a *
           'b key * 'b *
           'c key * 'c -> record
    | T4 : 'a key * 'a *
           'b key * 'b *
           'c key * 'c *
           'd key * 'd -> record
    | LL : record * 'a key * 'a * record -> record (* h(x) = h(y) - 1 *)
    | EQ : record * 'a key * 'a * record -> record (* h(x) = h(y) *)
    | LR : record * 'a key * 'a * record -> record (* h(x) = h(y) + 1 *)
```

Which stores the key and data directly on the branches of a tree without any unnecessary indirection. And also unfolds n-tuples with n < 5, an implementation detail which is not really important wrt to the current discussion, (but was a dealbreaker for us as it significantly reduced the memory consumption and the total number of rebalances, so that at the end of the day we got a factor of 5 improvements both in time and space).

 Another interesting feature is that if implemented as a functor, this implementation could be used to derive non-heterogeneous maps (e.g., when `'a key = int`).

During the implementation, we had to find solutions to a few very interesting problems, like how to represent an existential folding function (or how to fold over heterogeneous map)[6]. Or more interesting, how to update keys and merge two maps [7] and do this efficiently (hint: you have to use CPS).


[3]: https://github.com/janestreet/core_kernel/blob/master/src/univ_map.ml
[4]: https://github.com/dbuenzli/hmap/blob/master/src/hmap.ml
[5]: https://github.com/BinaryAnalysisPlatform/bap/blob/0352edfe42ba5560b178b6bf990cf15e81d78f57/lib/knowledge/bap_knowledge.ml#L766
[6]: https://github.com/BinaryAnalysisPlatform/bap/blob/0352edfe42ba5560b178b6bf990cf15e81d78f57/lib/knowledge/bap_knowledge.ml#L824
[7]: https://github.com/BinaryAnalysisPlatform/bap/blob/0352edfe42ba5560b178b6bf990cf15e81d78f57/lib/knowledge/bap_knowledge.ml#L1316

Treating heterogenous maps as a generalization of records brings us to another interesting development. We do not want to completely forfeit types of records and treat all records as a member of one universal record type. We would like to be able to define type schemes so that we can distinguish between a `student record` and a `course record` while keeping both universal and extensible. In BAP we implemented this using classes and slots (borrowing those ideas from frame languages of AI, any resemblance to OO is purely coincidental). We turn our untyped `record` into a value parameterized by its class, e.g.,

```
type 'a value = {
     cls : 'a cls;
     data : record;
}
```

The `'a cls` type could be phantom, e.g., `type 'a cls = 'a`, but we can benefit from storing some type information to implement runtime reflection. And an extra layer of indirection on top of treps enables the structural definition of a record type that can span across multiple modules:
```
type ('a,'p) slot = {
   cls : 'a cls;
   key : 'p key;
}
```

so that now a class is defined by its slots, e.g.,

```
val name : (student, string) slot
val age : (student, age) slot
...
val teacher : (course, teacher) slot
```

and an abstract interface of the Value will prevent us from confusing different fields:
```
module Value : sig 
  type 'a t
  val get : ('a,'p) slot -> 'a t -> 'p
  val put : ('a,'p) slot -> 'a t -> p -> 'a t
end
```

The real implementation in BAP [8], is a little bit more convoluted since we want to create hierarchies of classes (which we need for type-safe representation of their denotational semantics) and make classes indexable by runtime values. Therefore our class is actually a collection of sorts. And we also have triggers (i.e., computable fields), as well as we use domains (aka Scott-Ershov domains) to represent the properties of classes, which gives us a nice feature - our `Value.get` function is total as for each type of property we have the bottom value as the default. 

Finally, we can store our values in a database and hide this storage under a monad interface, so that we can implement different access schemes (and put a fixed point computation inside), but this is a completely different story. The takeaway is that OCaml as a language is now fully equipped for implementing efficient heterogeneous data structures and can be used in domains which previously were previously dominated by languages with dynamic typing. 

Besides, the Knowledge library is totally independent on BAP despite the fact that it is currently distributed from the BAP repository. We've tried not to leak any domain-specific details into the knowledge abstraction albeit keeping it versatile enough to be able to represent the knowledge representation of the program semantics. 

[8]: https://github.com/BinaryAnalysisPlatform/bap/blob/master/lib/knowledge/bap_knowledge.mli#L153

Cheers,
Ivan Gotovchits