caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Gabriel Scherer <gabriel.scherer@gmail.com>
To: Radu Grigore <radugrigore@gmail.com>
Cc: fa.caml@googlegroups.com, Caml List <caml-list@inria.fr>
Subject: Re: [Caml-list] Writing the function Set.map using first-class modules and 4.00 inference
Date: Mon, 5 Nov 2012 11:12:11 +0100	[thread overview]
Message-ID: <CAPFanBFF-onqLFqSaa200kKS9Bxx2roE47YorfQd4L2BoUzkPg@mail.gmail.com> (raw)
In-Reply-To: <9da48315-639b-4de2-a3d5-f86685602425@googlegroups.com>

You haven't tested the bad case, where the list is nearly sorted but
not exactly -- random lists are a good-case behavior as unsortedness
is detected very early. I don't expect the performances to be that bad
("len" avoids any allocation). Another problem is that you call the
user-defined comparison more than necessary, which can be problematic
if it is costly -- you can solve that by rather using of_sorted_list
to build the partial result list before raising Unsolved rather than
rebuilding it from the ground up.

I have played with such dynamic algorithm selection in the Map
implementation of Batteries¹, but I'm not really satisfied with it:
you end up making guesses about the user situation which are hard to
evaluate (we have no real-world-looking data to measure against) and
pay costs that a demanding user could request to skip by having
of_sorted_list and of_list exposed directly, and making the choice
herself.

¹: the use case was similar: there is an implementation of
non-functorized maps inherited from Extlib that some user find
convenient to use, where maps are records carrying their own
comparison operation, but then the implementation of binary functions
such as union/diff/intersect and merge are very different depending on
whether the two comparison operations provided are compatible or not.
  https://github.com/ocaml-batteries-team/batteries-included/blob/master/src/batMap.ml#L571

A note on API design: the type abstraction of Set and Map are
inconvenient when users want to implement better performing operations
on top of the balanced tree model. My personal opinion is that we
should preserve the abstraction of Set and Map, reject all operations
that rely on a representation assumption of sorted balanced trees
(unfortunately some of them, such as split, have already crept in),
and provide as a different module a "purely algorithmic" transparent
definition of balanced binary trees that users could extend easily and
on which they would build their own data structures. I insist on
having both abstract purpose-oriented modules and concrete
algorithm-oriented data structures, and not mixing them as is
currently done.

Not relying on ordered representations would be useful if we
considered moving to a different representation for sets and maps.
Thibault Suzanne has been experimenting with Hash Array Mapped Tries
(HAMT) in OCaml and the results are relatively promising, but this
implementation would have a hard time transparently replacing stdlib's
Set and Map because of their exposed balanced implementation details.
  https://gitorious.org/ocaml-hamt


On Mon, Nov 5, 2012 at 9:05 AM, Radu Grigore <radugrigore@gmail.com> wrote:
> Just to clarify what I meant, see the attached (hackish) benchmarks and patch.
>
> -- begin benchmarks.ml --
> (* Times (in seconds)
>
> 1000000 random lists of length 10
>   of_list1 9.23
>   of_list2 9.80
>   S.of_list 10.15 (+3.6% compared to of_list2)
>
> 1 random list of length 10000000
>   of_list1 Stack_overflow
>   of_list2 102.39
>   S.of_list 104.15 (+1.8% compared to of_list2)
>
> 1000000 sorted lists of length 10
>   of_list1 11.14
>   of_list2 12.51
>   S.of_list 4.53 (-63% compared to of_list2)
>
> 1 sorted list of length 10000000
>   of_list1 Stack_overflow
>   of_list2 71.91
>   S.of_list 6.73 (-90% compared to of_list2)
>
> *)
> open Printf
>
> module S = Set.Make (struct type t = int let compare = compare end)
>
> let of_list1 xs = List.fold_right S.add xs S.empty
> let of_list2 xs = List.fold_left (fun x y -> S.add y x ) S.empty xs
>
> let mk_list n =
>   let xs = ref [] in
>   for i = n downto 1 do xs := i (*Random.int max_int*) :: !xs done;
>   !xs
>
> let time s f xss =
>   let a = Sys.time () in
>   List.iter (fun xs -> ignore (f xs)) xss;
>   let b = Sys.time () in
>   printf "%s %.2f\n" s (b -. a)
>
> let _ =
>   Random.init 0;
>   let n = 1 in
>   let xss = ref [] in
>   for i = 1 to n do xss := mk_list 10000000 :: !xss done;
> (*   time "of_list1" of_list1 !xss; *)
>   time "of_list2" of_list2 !xss;
>   time "S.of_list" S.of_list !xss
> -- end benchmarks.ml --
>
>
>
> --begin of_list.patch--
> Index: stdlib/moreLabels.mli
> ===================================================================
> --- stdlib/moreLabels.mli       (revision 13061)
> +++ stdlib/moreLabels.mli       (working copy)
> @@ -155,6 +155,7 @@
>        val partition : f:(elt -> bool) -> t -> t * t
>        val cardinal : t -> int
>        val elements : t -> elt list
> +      val of_list : elt list -> t
>        val min_elt : t -> elt
>        val max_elt : t -> elt
>        val choose : t -> elt
> Index: stdlib/set.ml
> ===================================================================
> --- stdlib/set.ml       (revision 13061)
> +++ stdlib/set.ml       (working copy)
> @@ -43,6 +43,7 @@
>      val partition: (elt -> bool) -> t -> t * t
>      val cardinal: t -> int
>      val elements: t -> elt list
> +    val of_list: elt list -> t
>      val min_elt: t -> elt
>      val max_elt: t -> elt
>      val choose: t -> elt
> @@ -343,6 +344,29 @@
>          Empty -> accu
>        | Node(l, v, r, _) -> elements_aux (v :: elements_aux accu r) l
>
> +    exception Unsorted
> +
> +    let rec of_sorted_list n xs =
> +      if n = 0 then (empty, xs) else begin
> +        let l, xs = of_sorted_list (n / 2) xs in
> +        let v, xs = (match xs with v :: xs -> v, xs | [] -> assert false) in
> +        let r, xs = of_sorted_list (n - n / 2 - 1) xs in
> +        (create l v r, xs)
> +      end
> +
> +    let of_list = function
> +        [] -> empty
> +      | (x :: xs) as ys -> begin
> +          let rec len n x = function
> +              [] -> n
> +            | z :: zs ->
> +                if Ord.compare x z < 0
> +                then len (n + 1) z zs
> +                else raise Unsorted in
> +          try let n = len 1 x xs in let set, _ = of_sorted_list n ys in set
> +          with Unsorted -> List.fold_left (fun t elt -> add elt t) empty ys
> +        end
> +
>      let elements s =
>        elements_aux [] s
>
> Index: stdlib/set.mli
> ===================================================================
> --- stdlib/set.mli      (revision 13061)
> +++ stdlib/set.mli      (working copy)
> @@ -121,6 +121,9 @@
>         to the ordering [Ord.compare], where [Ord] is the argument
>         given to {!Set.Make}. *)
>
> +    val of_list : elt list -> t
> +    (** Makes a set out of a list. *)
> +
>      val min_elt: t -> elt
>      (** Return the smallest element of the given set
>         (with respect to the [Ord.compare] ordering), or raise
> --end of_list.patch--

  reply	other threads:[~2012-11-05 10:12 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <fa.yP8czrxqEGCABee05wzAlISIlN4@ifi.uio.no>
     [not found] ` <fa.rve4x46ugIvZs4BizovvjOCeTG0@ifi.uio.no>
     [not found]   ` <fa.i9E+7rAvghoTZ1MXH4g+uL4dpCY@ifi.uio.no>
     [not found]     ` <fa.AX0JYJa8kXsH/YSZ08tuyBc8m+0@ifi.uio.no>
     [not found]       ` <fa.YAoOn2iUcpHbr53eeBfGhPTDvtM@ifi.uio.no>
2012-11-05  8:05         ` Radu Grigore
2012-11-05 10:12           ` Gabriel Scherer [this message]
2012-11-05 12:02             ` Radu Grigore
     [not found] <fa.FUGAe9RTYsSPA4RwbpKO14B0oVo@ifi.uio.no>
2012-11-02 14:59 ` Radu Grigore
2012-11-02 15:11   ` Radu Grigore
2012-11-02 16:00     ` Gabriel Scherer
2012-11-02 16:21       ` Radu Grigore
2012-11-01 21:00 Jeff Meister

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPFanBFF-onqLFqSaa200kKS9Bxx2roE47YorfQd4L2BoUzkPg@mail.gmail.com \
    --to=gabriel.scherer@gmail.com \
    --cc=caml-list@inria.fr \
    --cc=fa.caml@googlegroups.com \
    --cc=radugrigore@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).