caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] ANN: unboxed-arrays-0.1: Unboxed arrays for OCaml
@ 2014-01-19 18:01 Bob Atkey
  2014-01-20  8:47 ` Mark Shinwell
  0 siblings, 1 reply; 4+ messages in thread
From: Bob Atkey @ 2014-01-19 18:01 UTC (permalink / raw)
  To: caml-list

Hello all,

Apropos the recent discussion about optimizing values of type 'a
option, I'd like to announce my little prototype library for the
related problem of defining unboxed array types in OCaml. The library
uses GADTs to describe the memory layout of unboxed elements, and the
module system to hide the unsafe operations used to implement the
arrays. Arrays of monomorphic and polymorphic values are supported.

The library is available on GitHub:
   https://github.com/bobatkey/unboxed-arrays

For your convenience, I have reproduced the README below (in markdown
format), which includes simple example of usage:


# Unboxed Arrays

An implementation of unboxed arrays for OCaml by Robert Atkey
<bob.atkey@gmail.com>.

The runtime representation of OCaml arrays usually stores the elements
using a 'boxed' representation, meaning that if the value of a
particular element cannot be stored within a single machine word
(i.e., it is not an argument-less constructor or an `int` value), then
the array actually contains a pointer to the representation of the
element elsewhere in the heap. (Note that `float` arrays are special
cased.) This boxing representation leads to an extra indirection when
accessing elements of the array, means that memory is wasted, and
decreases data locality, to the detriment of efficent cache usage.

The `unboxed-arrays` library implements functors that generate new
abstract types representing arrays, whose run-time representation is
'unboxed', so that each of the elements in stored inline in the
array. The library uses unsafe operations from the `Obj` module in its
implementation, but this unsafety is completely hidden from clients
behind abstract types.

Note that this implementation has not yet been benchmarked, and so can
only be regarded as a proof-of-concept for implementing unboxed
arrays. In particular, it is not clear to me that the overhead of
allocating `Data` constructor (see the `Element_Descriptor` signature
below) is not too expensive.

In the future, it would be nice to explore alternative representations
for more specialised representations of unboxed arrays. For example,
using bitmaps to representation occupancy information for unboxed `'a
option array`s.

## To build

````
$ ocaml setup.ml -build
````

There are no dependencies beyond plain OCaml (version 4.01.0 tested).

## Interface

New unboxed array implementations are generated by using the following
functor:

````ocaml
module Make (Elt : Element_Descriptor) : S with type elt = Elt.t
````

where the signature `S` describes a basic imperative array interface:

````ocaml
module type S = sig
  type t
  type elt
  val create : int -> elt -> t
  val init   : int -> (int -> elt) -> t
  val get    : t -> int -> elt
  val set    : t -> int -> elt -> unit
  val length : t -> int
end
````

The element type is described using modules matching the following
signature. Modules implementing this signature set out (a) the
allowable constructors for the elements; (b) the types of the data
associated with each constructor; (c) the maximum size of an element;
and (d) the specific width associated with each constructor.

````ocaml
module type Element_Descriptor = sig
  type 'a constructor
  type size
  val size : size is_a_natural
  val width_of : 'a constructor -> ('a, size) width
  type t = Data : 'a constructor * 'a -> t
end
````

The type `'a constructor` is usually a GADT (Generalised Algebraic
Datatype) that uses the type parameter to describe the types of the
data associated with that constructor. The *type* `size` is expected
to be of the form `zero succ ... succ`, and represents in unary
notation the maximum size of any element. The *value* `size` is a
runtime representation of the number represented by `size`. The
function `width_of` describes the necessary width to store the data
associated with each constructor, with a static check that all the
widths are less than `size`. Finally, the type `t` defines a GADT that
represents 'boxed' representations used to represent element values
outwith the array.

The types `is_a_natural` and `width` are defined by the `UnboxedArray`
library.

There is also another functor `Make1` that allows for the generation
of arrays with polymorphic element types.

## A Simple Example

The following implementation of `Element_Descriptor` describes a
datatype with three constructors:

````ocaml
module Elt = struct
  open UnboxedArray

  type 'a constructor =
    | Empty   : unit constructor
    | Deleted : unit constructor
    | Value   : (int * string) constructor

  type size = zero succ succ
  let size = Succ (Succ Zero)

  let width_of (type d) (c : d constructor) : (d,size) width =
      match c with
        | Empty   -> Width0
        | Deleted -> Width0
        | Value   -> Width2

  type t = Data : 'd constructor * 'd -> t
end

module A = UnboxedArray.Make (Elt)

open Elt
````

Values of type `A.t` now represent boxed values of type `Elt.t`, and
can be manipulated through the interface `S with type elt = Elt.t`, in
a fairly natural way:

````ocaml
# let arr = A.create 10 (Data (Empty, ()));;
val arr : A.t = <abstr>
# A.set arr 0 (Data (Value, (1, "foo")));;
- : unit = ()
# A.set arr 5 (Data (Deleted, ()));;
- : unit = ()
# let elt_to_string arr i =
    match A.get arr i with
      | Data (Empty, ()) -> "Empty"
      | Data (Deleted, ()) -> "Deleted"
      | Data (Value, (k,v)) -> Printf.sprintf "Value (%d,%S)" k v;;
val elt_to_string : A.t -> int -> string = <fun>
# elt_to_string arr 0 |> print_endline;;
Value (1,"foo")
- : unit = ()
# elt_to_string arr 1 |> print_endline;;
Empty
- : unit = ()
# elt_to_string arr 5 |> print_endline;;
Deleted
- : unit = ()
#
````

There is a more substantial example using unboxed arrays to implement
a hashtable with linear probing in the file `hashtable.ml`. This
example also demonstrates the use of the `UnboxedArray.Make1` functor
for generated polymorphic unboxed array types.



Bob

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] ANN: unboxed-arrays-0.1: Unboxed arrays for OCaml
  2014-01-19 18:01 [Caml-list] ANN: unboxed-arrays-0.1: Unboxed arrays for OCaml Bob Atkey
@ 2014-01-20  8:47 ` Mark Shinwell
  2014-01-20 12:24   ` Bob Atkey
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Shinwell @ 2014-01-20  8:47 UTC (permalink / raw)
  To: Bob Atkey; +Cc: caml-list

On 19 January 2014 18:01, Bob Atkey <bob.atkey@gmail.com> wrote:
> The runtime representation of OCaml arrays usually stores the elements
> using a 'boxed' representation, meaning that if the value of a
> particular element cannot be stored within a single machine word
> (i.e., it is not an argument-less constructor or an `int` value), then
> the array actually contains a pointer to the representation of the
> element elsewhere in the heap. (Note that `float` arrays are special
> cased.) This boxing representation leads to an extra indirection when
> accessing elements of the array, means that memory is wasted, and
> decreases data locality, to the detriment of efficent cache usage.

Thanks for sharing your code.  I also did an experiment
with this recently (although I would argue that it isn't always
clear that flat arrays are desirable).  My flat array module
is designed to hold a sequence of values of record type.  Pointers
to the records may be retrieved and then normal mutable field
updates may be used to change them.  (In case you are wondering,
if you put some other variety of value in the array, then it's
probably safe but won't be useful---since in particular there
is no "set" method provided.)

module Flat_array : sig
  type 'a t

  val create : num_elements:int -> default:'a -> 'a t
  val element_at : 'a t -> index:int -> 'a
end

Here's a simple example of its use.  The code of the
implementation is left as an interesting exercise for the
reader ;)

type t = {
  mutable foo : int;
  mutable bar : string;
}

let default = { foo = 42; bar = "hello world"; }

let test () =
  let arr = Flat_array.create ~num_elements:10 ~default in
  let r = Flat_array.element_at arr ~index:4 in
  Printf.printf "foo=%d, bar=%s\n%!" r.foo r.bar;
  Gc.compact (); (* make sure we haven't corrupted the heap *)
  r.foo <- 10;
  r.bar <- "bar";
  Gc.compact ();
  Printf.printf "foo=%d, bar=%s\n%!" r.foo r.bar

--

module Flat_array : sig
  type 'a t

  val create : num_elements:int -> default:'a -> 'a t
  val element_at : 'a t -> index:int -> 'a
end = struct
  type _ t = Obj.t

  let create ~num_elements ~default =
    let fields_per_elt = Obj.size (Obj.repr default) in
    let size = 1 + num_elements * (1 + fields_per_elt) - 1 in
    let t = Obj.new_block Obj.closure_tag size in
    let stride = Sys.word_size/8 * (fields_per_elt + 1) in
    Obj.set_field t (size - 1) (Obj.repr stride);
    let default = Obj.repr default in
    for elt = 0 to num_elements - 1 do
      let header_index_for_elt = (1 + fields_per_elt) * elt - 1 in
      for field = 0 to fields_per_elt - 1 do
        Obj.set_field t (header_index_for_elt + 1 + field)
          (Obj.field default field)
      done;
      if elt > 0 then begin
        let infix_header =
          let infix_offset = elt * (fields_per_elt + 1) in
          let actual_value =
            (infix_offset lsl 10) lor Obj.infix_tag
          in
          assert (actual_value land 1 = 1);
          actual_value lsr 1
        in
        Obj.set_field t header_index_for_elt (Obj.repr infix_header)
      end
    done;
    t

  let element_at t ~index =
    let offset =
      if index = 0 then 0
      else
        let t = ((Obj.magic t) : int array) in
        let stride = t.(Array.length t - 1) in
        stride * index
    in
    let record_is_at = Obj.add_offset t (Int32.of_int offset) in
    Obj.obj record_is_at
end

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] ANN: unboxed-arrays-0.1: Unboxed arrays for OCaml
  2014-01-20  8:47 ` Mark Shinwell
@ 2014-01-20 12:24   ` Bob Atkey
  2014-01-20 12:33     ` Mark Shinwell
  0 siblings, 1 reply; 4+ messages in thread
From: Bob Atkey @ 2014-01-20 12:24 UTC (permalink / raw)
  To: Mark Shinwell; +Cc: caml-list

On 20 January 2014 08:47, Mark Shinwell <mshinwell@janestreet.com> wrote:
> Thanks for sharing your code.  I also did an experiment
> with this recently (although I would argue that it isn't always
> clear that flat arrays are desirable).  My flat array module
> is designed to hold a sequence of values of record type.  Pointers
> to the records may be retrieved and then normal mutable field
> updates may be used to change them.

This looks really useful. Especially the fact that your implementation
has no allocation overhead when accessing an element in the array.
unboxed-arrays definitely has allocation overhead, and also an
interpretation overhead when reading and writing. Maybe these could be
mitigated with enough inlining though.

> (In case you are wondering,
> if you put some other variety of value in the array, then it's
> probably safe but won't be useful---since in particular there
> is no "set" method provided.)

Unfortunately, Flat_array seems to go a bit awry when you put strings
or arrays in as the default element:

(OCaml 4.01.0)

# let s = Flat_array.create 10 "hello";;
val s : string Flat_array.t = <abstr>
# Flat_array.element_at s ~index:0;;
- : string =
"hello\000\000\002\249\b\000\000\000\000\000\000hello\000\000\002\249\016\000\000\000\000\000\000hello\000\000\002\249\024\000\000\000\000\000\000hello\000\000\002\249
\000\000\000\000\000\000hello\000\000\002\249(\000\000\000\000\000\000hello\000\000\002\2490\000\000\000\000\000\000hello\000\000\002\2498\000\000\000\000\000\000hello\000\000\002\249@\000\000\000\000\000\000hello\000\000\002\249H\000\000\000\000\000\000hello\000\000\002!\000\000\000\000\000\000"

# let t = Flat_array.create 10 [|1;2;3|];;
val t : int array Flat_array.t = <abstr>
# Flat_array.element_at t ~index:0;;
- : int array =
[|1; 2; 3; 2172; 1; 2; 3; 4220; 1; 2; 3; 6268; 1; 2; 3; 8316; 1; 2; 3; 10364;
  1; 2; 3; 12412; 1; 2; 3; 14460; 1; 2; 3; 16508; 1; 2; 3; 18556; 1; 2; 3;
  32|]

Even worse, because it is accessing memory off the end of the block
allocated by Flat_array:

# Flat_array.element_at t ~index:9;;
- : int array =
[|1; 2; 3; 32; 1920; 1; 2; 3; 1019; 19742592; 2043; 70323357739924;
  70323353168284; 70323352779268; 2555; 70323357739208; 70323353168344;
  70323353168448; 0; 1019; 70323357739294; 1408; 70323358792352;
  70323358792376; 1408; 5224; 70323358798112; 1408; 5225; 70323352741540;
  1408; 70323358792388; 0; 1408; 5223; 70323358796124|]


I can see how one might detect at runtime that a string is being used
as the default element, but I'm not sure if there is a way to tell the
difference between arrays, tuples and records at runtime. I thought
that it might be possible to get a segfault during GC by putting a
string full of 0 bytes in, but I wasn't able to actually make this
happen.

Bob

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] ANN: unboxed-arrays-0.1: Unboxed arrays for OCaml
  2014-01-20 12:24   ` Bob Atkey
@ 2014-01-20 12:33     ` Mark Shinwell
  0 siblings, 0 replies; 4+ messages in thread
From: Mark Shinwell @ 2014-01-20 12:33 UTC (permalink / raw)
  To: Bob Atkey; +Cc: caml-list

On 20 January 2014 12:24, Bob Atkey <bob.atkey@gmail.com> wrote:
>> (In case you are wondering,
>> if you put some other variety of value in the array, then it's
>> probably safe but won't be useful---since in particular there
>> is no "set" method provided.)
>
> Unfortunately, Flat_array seems to go a bit awry when you put strings
> or arrays in as the default element

Ah, yeah, so thinking about this: any value whose accessors
depend on the size of the value [that is stored within the
value itself] isn't going to work.  The "sub-values" (e.g.
the individual records) inside a single [Flat_array] value
don't have the usual size field in the header.  Strings are
an example that will hit this problematic case.

Mark

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-01-20 12:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-19 18:01 [Caml-list] ANN: unboxed-arrays-0.1: Unboxed arrays for OCaml Bob Atkey
2014-01-20  8:47 ` Mark Shinwell
2014-01-20 12:24   ` Bob Atkey
2014-01-20 12:33     ` Mark Shinwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).