caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Gabriel Scherer <gabriel.scherer@gmail.com>
To: Oleg <oleg@okmij.org>,
	Gabriel Scherer <gabriel.scherer@gmail.com>,
	 Yaron Minsky <yminsky@janestreet.com>,
	caml users <caml-list@inria.fr>,
	 Gregory Malecha <gmalecha@gmail.com>
Subject: Re: [Caml-list] The fastest stream library [Was: Question about
Date: Tue, 8 Nov 2016 10:45:26 -0500	[thread overview]
Message-ID: <CAPFanBEwVP2tcih3y9xUoc9xypeG47Eijf9bQHNhLLAKL5Cz_g@mail.gmail.com> (raw)
In-Reply-To: <20161108124731.GA6455@Magus.localnet>

[-- Attachment #1: Type: text/plain, Size: 6318 bytes --]

I think we are on the same wavelength, but given that you are an expert on
streaming library I'm going to expand on Enum as you might be interested.
There are many feature axes for streaming libraries and Enum tries to cover
a bit too much of them. It was designed, by the Extlib project, as a
go-between library to translate from one data structure to the other, and
to be as flexible as possible:

- It supports a wide variety of combinators common to stream libraries, on
both finite and infinite stream; there is also a notion of whether it's
easy/fast, for a given stream, to compute its size (without consuming the
stream), to eg. preallocate arrays when converting from enum to array, and
combinators try to preserve that information when they can
- In the style of the venerable Stream module, Enum is a "destructive
streaming" library, in that it mutates its internal state when you access
the next element. I think the idea is that in general this is more
memory-efficient (you cannot keep a reference to the start of the stream
that leaks memory), but it's also a bit error-prone in non-linear usage
scenarios.
- To still support non-linear usage scenarios, Enum streams have a "clone"
operation that duplicates a stream into two copies. The useful, interesting
and difficult thing about clone is that getting the same element on both
the original and the clone stream should not duplicate any side-effects
involved in computing the value: clone should duplicate the stream of
values, but keep threading the computation effect.

In my experience most of the complexity (and fragility) of the current Enum
implementation comes from the destructive aspects and cloning. The
implementation is in a fairly imperative style (enumerations are defined as
object-style record of *mutable* methods that close over the enumeration
state) and there is a fair amount of bookkeeping involved on each
update/next to support this feature set. This is not a big issue for the
original use-case, which is to convert between data structures (have 2*N
conversion functions, Foo.{to,of}_enum, instead of N^2 conversion
functions), where performance bottlenecks are usually on the
data-structure-processing side, but this means that using Enum for heavy
stream processing is known to be slow. Again, I would expect BatSeq (which
is neither destructive nor memoizing) to do much better on these workflows.

It is perfectly reasonable to question whether we need this complex feature
set for a central streaming library. I have mixed thoughts on this question:

- as a library developer, my experience is that the current implementation
is too fragile and too slow for its own good, and I think that users would
be better served by a simpler abstraction that does less in a more robust
way. The pragmatic choice is thus to use simpler stream libraries. Another
interesting point is that, if you develop those simpler, more robust stream
libraries, it is sometimes possible to reuse them to build more complex
streams on top of them (for example, a solid purely-functional stream
implementation can be turned into a destructive stream implementation by
passing references to functional streams around), so this decomposition of
concern would also help rebuilding a more robust Enum. Simon Cruanes did
good work in that direction in preliminary versions of his
Containers/Sequence libraries (I can't find specific references right now,
to the various streaming types with different feature support).

- as a language designer, I think that our inability to implement Enum
robustly on the first attempt is the sign of an interesting problem to be
solved. The feature set, from a user point of view, is *not* unreasonable,
and justifiably useful in some scenarios. As language designers, we know
that the cloning / non-duplication of effects is like playing with fire,
but that is still a problem that we should be able to solve. (We have
people properly design lock-free data structures, which are also
unreasonably complex; we should be demanding of our standard libraries!) If
I had a good way to *specify* the behaviour of a streaming library
respecting all these constraints, I think that would also clarify how to
give an simpler, more efficient, more robust implementation of it. (Another
concern that it could be interesting to throw in the mix is batching.) I
regret not being informed enough about the myriad of Haskell-side solutions
to the problem of effectful streaming, as there may also be interesting
inspiration to be taken. I tried to get François Pottier interested in the
problem of formalizing a specification for Enum, but he ran away screaming
before I finished enumerating the design points above.

On Tue, Nov 8, 2016 at 7:47 AM, Oleg <oleg@okmij.org> wrote:

>
> Gabriel wrote:
>
> > I regret not being pointed to this work earlier, because I think that
> > measuring the performance of Enum as a representative OCaml stream
> library
> > performance is not the best choice : Enum is designed to be flexible in a
> > bit too many ways to be efficient on pure-streaming scenarios (it
> supports
> > a generic "clone", and effectful generators, that makes the codebase too
> > complex for its own good; I think that Batteries community is aware that
> > Enum is not as good as it should be right now). There are other, more
> > efficient streaming libraries out there, including BatSeq in Batteries
> that
> > should already be sensibly faster; Core and Containers also have more
> > efficient streaming libraries.
>
> Please do note that we make the double claim: wide expressivity *and*
> guaranteed highest performance. We support not just the standard map and
> filter, but zipping of streams with flat_maps in them, with both
> finite and infinite streams. So we do want generality. From this
> point, Enum was quite an appropriate.
>
> Speaking of benchmarks, the real point of comparison is the
> hand-written code for each benchmark, written with imperative loops,
> references, etc. (We admit that some of that on several occasions
> the hand-written baseline code was fine-tuned when it turned out that the
> generated code outperformed it.) Please do feel free to suggest faster
> versions.
>

[-- Attachment #2: Type: text/html, Size: 6865 bytes --]

  reply	other threads:[~2016-11-08 15:46 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-21  7:13 [Caml-list] Question about Optimization Gregory Malecha
2016-04-21  9:32 ` Jonas Jensen
2016-04-21 11:45   ` Yaron Minsky
2016-04-21 15:45     ` Gregory Malecha
2016-04-21 16:02       ` Gabriel Scherer
2016-04-21 16:05         ` Daniel Bünzli
2016-04-21 16:35           ` Ben Millwood
2016-04-22 16:09             ` Gregory Malecha
2016-11-08 12:07         ` [Caml-list] The fastest stream library [Was: Question about Optimization] Oleg
2016-11-08 12:05           ` Gregory Malecha
2016-11-08 12:15             ` Gabriel Scherer
2016-11-08 12:47               ` [Caml-list] The fastest stream library [Was: Question about Oleg
2016-11-08 15:45                 ` Gabriel Scherer [this message]
2016-11-12 13:01                   ` Oleg
2016-11-12 16:21                     ` Simon Cruanes
2016-11-12 16:35                       ` Gabriel Scherer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPFanBEwVP2tcih3y9xUoc9xypeG47Eijf9bQHNhLLAKL5Cz_g@mail.gmail.com \
    --to=gabriel.scherer@gmail.com \
    --cc=caml-list@inria.fr \
    --cc=gmalecha@gmail.com \
    --cc=oleg@okmij.org \
    --cc=yminsky@janestreet.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).