From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id 0E0FA7F787 for ; Tue, 8 Nov 2016 16:46:12 +0100 (CET) Authentication-Results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=gabriel.scherer@gmail.com; spf=Pass smtp.mailfrom=gabriel.scherer@gmail.com; spf=None smtp.helo=postmaster@mail-it0-f54.google.com Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of gabriel.scherer@gmail.com) identity=pra; client-ip=209.85.214.54; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="gabriel.scherer@gmail.com"; x-sender="gabriel.scherer@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of gabriel.scherer@gmail.com designates 209.85.214.54 as permitted sender) identity=mailfrom; client-ip=209.85.214.54; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="gabriel.scherer@gmail.com"; x-sender="gabriel.scherer@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-it0-f54.google.com) identity=helo; client-ip=209.85.214.54; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="gabriel.scherer@gmail.com"; x-sender="postmaster@mail-it0-f54.google.com"; x-conformance=sidf_compatible IronPort-PHdr: =?us-ascii?q?9a23=3AZ9EmSxyJkTvbbtnXCy+O+j09IxM/srCxBDY+r6Qd?= =?us-ascii?q?1+sfIJqq85mqBkHD//Il1AaPBtSBraMawLeM+4nbGkU4qa6bt34DdJEeHzQksu?= =?us-ascii?q?4x2zIaPcieFEfgJ+TrZSFpVO5LVVti4m3peRMNQJW2WVTerzWI4CIIHV2nbEwu?= =?us-ascii?q?d76zS9CZ0p7//tvx0qWbWx9Piju5bOE6BzSNhiKViPMrh5B/IL060BrDrygAUe?= =?us-ascii?q?1XwWR1OQDbxE6ktY+YtaRu+CVIuv8n69UIEeCjJ/x5HvRkC2EBN206rJnssRTM?= =?us-ascii?q?ZQyM43oeFGIMnUwMSyfM5gv7U5O5iSD6u/BwwmHOMsT8V7E5XXK55KdmUhLyoC?= =?us-ascii?q?gCPj89tmrQj5o0xJNSuhWn7zl+xZXXccnBJf9/eLjebPsYTGxMRdpLWiFdRIi7?= =?us-ascii?q?at1LR+EIOOIQspLwvUBG+RC3AA3pAOL01hdJgGX31Os0ybJlWS7c0QMnBcNGlX?= =?us-ascii?q?3Qod71Pe9GXuW8yKTDzTzrYPZf2DO744/NJEMPu/aJCJ15e9DQxE1nLAjFg0+d?= =?us-ascii?q?s8SxMDqfzOUAty6A5OptT++1o2EiogB15DOow5F/2cHymosJxwWcpm1Cy4EvKI?= =?us-ascii?q?j9ERYjbA=3D=3D?= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0ByAAA18iFYfzbWVdFdHQEFAQsBGQYMg?= =?us-ascii?q?wQBAQEBAYF2B6Q4iFaGY4cihiQCggUHQxABAQEBAQEBAQEBARIBAQkLCwkbMoI?= =?us-ascii?q?zBAEVAQSCDwEBAQMBEhEdARsPDwMBCwYFCw0qAgIiAREBBQEcBgESCAwHB4gfA?= =?us-ascii?q?QMPCKQ/gTI/MotPggwGAS2DGgWDdAoZJw1UgxABCgEBARsCBhCGLoNSgQWEFIM?= =?us-ascii?q?4glwFj1KKXYFrjl2BboRyiTSHMoV+gkMTHoESNYEDMoMHDxyBeyA0hWyBTgEBA?= =?us-ascii?q?Q?= X-IPAS-Result: =?us-ascii?q?A0ByAAA18iFYfzbWVdFdHQEFAQsBGQYMgwQBAQEBAYF2B6Q?= =?us-ascii?q?4iFaGY4cihiQCggUHQxABAQEBAQEBAQEBARIBAQkLCwkbMoIzBAEVAQSCDwEBA?= =?us-ascii?q?QMBEhEdARsPDwMBCwYFCw0qAgIiAREBBQEcBgESCAwHB4gfAQMPCKQ/gTI/Mot?= =?us-ascii?q?PggwGAS2DGgWDdAoZJw1UgxABCgEBARsCBhCGLoNSgQWEFIM4glwFj1KKXYFrj?= =?us-ascii?q?l2BboRyiTSHMoV+gkMTHoESNYEDMoMHDxyBeyA0hWyBTgEBAQ?= X-IronPort-AV: E=Sophos;i="5.31,462,1473112800"; d="scan'208,217";a="199645909" Received: from mail-it0-f54.google.com ([209.85.214.54]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/AES128-GCM-SHA256; 08 Nov 2016 16:46:10 +0100 Received: by mail-it0-f54.google.com with SMTP id q124so106041521itd.1 for ; Tue, 08 Nov 2016 07:46:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=1TE3zaNy43IHG5d1ncVeNSKfobr0aj8mxLSFZyfpFoY=; b=H5nMdI7xeGkY9N2Rugj89IkPEurdquNbjYH88wlN6EolpbAd3oYgtZG56rs4KV+58D jfmjw84+iBgV/3Mz8bQLobAU/Qng56kPHg4hF1PlER8YQkrm4biDD76mJEKWhuqelP82 DX1/tgIdNrY2I3sswLFAXdrcfzuhQM0SKqH5DDoeF6GkTvuFNPL4AQtahMoWOrdG0YgC 41PZpqMrMmn9Gaq9CCO7pXuymq3MgvPnJ9QNKaKGFiwtJvF/6LdUCxpQ+kLFcYJTwmaL S67YlZ8MiHBn1eeSlcfqTUv8YF7012L+q7sVt6xIqGeu+Vdz3vE8flUq9he878B5G05k GK9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=1TE3zaNy43IHG5d1ncVeNSKfobr0aj8mxLSFZyfpFoY=; b=RUe4ETskzEjaJlA3rdpt7Tj0Ei1XMaTwv8tL7l8KsEo6ZI23ijTMRLAq8mXNkTAn4/ 47iOQZBMgTksc4k4PACivNOlsOLY2Dp5RDDDErHZYkzsvDm8p/v6kam+9JwCFsXX0ZVS ihFM0nrTmVzlho99EdzRRzkzZTV4iOckNTMPTJ0ns43mmpZ4L5xZrzqw7KgyqW32Z5Sx LbL6NDBlD62OBH+MxJc8+PEQtSyKStCFsf1GJZO5LsVXYtNS9rgDFWGAH7sqw2EKwmqL 5gCD6z8cnWFzZ+nslo2yAq/Lo4SRDZ2rASOObr5KCHBzmnWkJDAqFxkp30q1o1v5N2YM QVYw== X-Gm-Message-State: ABUngvcyr8WxDdQu4bUrexZAzs1rFLGk4hHU88AqVWR2AKpL/iKS4rHhz48XNBLKGnpKHBm8ISnhwDY8eFdvlg== X-Received: by 10.107.134.26 with SMTP id i26mr13600268iod.182.1478619966842; Tue, 08 Nov 2016 07:46:06 -0800 (PST) MIME-Version: 1.0 Received: by 10.79.30.129 with HTTP; Tue, 8 Nov 2016 07:45:26 -0800 (PST) In-Reply-To: <20161108124731.GA6455@Magus.localnet> References: <20161108124731.GA6455@Magus.localnet> From: Gabriel Scherer Date: Tue, 8 Nov 2016 10:45:26 -0500 Message-ID: To: Oleg , Gabriel Scherer , Yaron Minsky , caml users , Gregory Malecha Content-Type: multipart/alternative; boundary=001a113ecf34e1b5920540cc0948 Subject: Re: [Caml-list] The fastest stream library [Was: Question about --001a113ecf34e1b5920540cc0948 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I think we are on the same wavelength, but given that you are an expert on streaming library I'm going to expand on Enum as you might be interested. There are many feature axes for streaming libraries and Enum tries to cover a bit too much of them. It was designed, by the Extlib project, as a go-between library to translate from one data structure to the other, and to be as flexible as possible: - It supports a wide variety of combinators common to stream libraries, on both finite and infinite stream; there is also a notion of whether it's easy/fast, for a given stream, to compute its size (without consuming the stream), to eg. preallocate arrays when converting from enum to array, and combinators try to preserve that information when they can - In the style of the venerable Stream module, Enum is a "destructive streaming" library, in that it mutates its internal state when you access the next element. I think the idea is that in general this is more memory-efficient (you cannot keep a reference to the start of the stream that leaks memory), but it's also a bit error-prone in non-linear usage scenarios. - To still support non-linear usage scenarios, Enum streams have a "clone" operation that duplicates a stream into two copies. The useful, interesting and difficult thing about clone is that getting the same element on both the original and the clone stream should not duplicate any side-effects involved in computing the value: clone should duplicate the stream of values, but keep threading the computation effect. In my experience most of the complexity (and fragility) of the current Enum implementation comes from the destructive aspects and cloning. The implementation is in a fairly imperative style (enumerations are defined as object-style record of *mutable* methods that close over the enumeration state) and there is a fair amount of bookkeeping involved on each update/next to support this feature set. This is not a big issue for the original use-case, which is to convert between data structures (have 2*N conversion functions, Foo.{to,of}_enum, instead of N^2 conversion functions), where performance bottlenecks are usually on the data-structure-processing side, but this means that using Enum for heavy stream processing is known to be slow. Again, I would expect BatSeq (which is neither destructive nor memoizing) to do much better on these workflows. It is perfectly reasonable to question whether we need this complex feature set for a central streaming library. I have mixed thoughts on this question: - as a library developer, my experience is that the current implementation is too fragile and too slow for its own good, and I think that users would be better served by a simpler abstraction that does less in a more robust way. The pragmatic choice is thus to use simpler stream libraries. Another interesting point is that, if you develop those simpler, more robust stream libraries, it is sometimes possible to reuse them to build more complex streams on top of them (for example, a solid purely-functional stream implementation can be turned into a destructive stream implementation by passing references to functional streams around), so this decomposition of concern would also help rebuilding a more robust Enum. Simon Cruanes did good work in that direction in preliminary versions of his Containers/Sequence libraries (I can't find specific references right now, to the various streaming types with different feature support). - as a language designer, I think that our inability to implement Enum robustly on the first attempt is the sign of an interesting problem to be solved. The feature set, from a user point of view, is *not* unreasonable, and justifiably useful in some scenarios. As language designers, we know that the cloning / non-duplication of effects is like playing with fire, but that is still a problem that we should be able to solve. (We have people properly design lock-free data structures, which are also unreasonably complex; we should be demanding of our standard libraries!) If I had a good way to *specify* the behaviour of a streaming library respecting all these constraints, I think that would also clarify how to give an simpler, more efficient, more robust implementation of it. (Another concern that it could be interesting to throw in the mix is batching.) I regret not being informed enough about the myriad of Haskell-side solutions to the problem of effectful streaming, as there may also be interesting inspiration to be taken. I tried to get Fran=C3=A7ois Pottier interested in= the problem of formalizing a specification for Enum, but he ran away screaming before I finished enumerating the design points above. On Tue, Nov 8, 2016 at 7:47 AM, Oleg wrote: > > Gabriel wrote: > > > I regret not being pointed to this work earlier, because I think that > > measuring the performance of Enum as a representative OCaml stream > library > > performance is not the best choice : Enum is designed to be flexible in= a > > bit too many ways to be efficient on pure-streaming scenarios (it > supports > > a generic "clone", and effectful generators, that makes the codebase too > > complex for its own good; I think that Batteries community is aware that > > Enum is not as good as it should be right now). There are other, more > > efficient streaming libraries out there, including BatSeq in Batteries > that > > should already be sensibly faster; Core and Containers also have more > > efficient streaming libraries. > > Please do note that we make the double claim: wide expressivity *and* > guaranteed highest performance. We support not just the standard map and > filter, but zipping of streams with flat_maps in them, with both > finite and infinite streams. So we do want generality. From this > point, Enum was quite an appropriate. > > Speaking of benchmarks, the real point of comparison is the > hand-written code for each benchmark, written with imperative loops, > references, etc. (We admit that some of that on several occasions > the hand-written baseline code was fine-tuned when it turned out that the > generated code outperformed it.) Please do feel free to suggest faster > versions. > --001a113ecf34e1b5920540cc0948 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I think we are on the same wavele= ngth, but given that you are an expert on streaming library I'm going t= o expand on Enum as you might be interested. There are many feature axes fo= r streaming libraries and Enum tries to cover a bit too much of them. It wa= s designed, by the Extlib project, as a go-between library to translate fro= m one data structure to the other, and to be as flexible as possible:
- It supports a wide variety of combinators common to stream librar= ies, on both finite and infinite stream; there is also a notion of whether = it's easy/fast, for a given stream, to compute its size (without consum= ing the stream), to eg. preallocate arrays when converting from enum to arr= ay, and combinators try to preserve that information when they can
- In the style of the venerable Stream module, Enum is a "destructive= streaming" library, in that it mutates its internal state when you ac= cess the next element. I think the idea is that in general this is more mem= ory-efficient (you cannot keep a reference to the start of the stream that = leaks memory), but it's also a bit error-prone in non-linear usage scen= arios.
- To still support non-linear usage scenarios, Enum streams= have a "clone" operation that duplicates a stream into two copie= s. The useful, interesting and difficult thing about clone is that getting = the same element on both the original and the clone stream should not dupli= cate any side-effects involved in computing the value: clone should duplica= te the stream of values, but keep threading the computation effect.

=
In my experience most of the complexity (and fragility) of the c= urrent Enum implementation comes from the destructive aspects and cloning. = The implementation is in a fairly imperative style (enumerations are define= d as object-style record of *mutable* methods that close over the enumerati= on state) and there is a fair amount of bookkeeping involved on each update= /next to support this feature set. This is not a big issue for the original= use-case, which is to convert between data structures (have 2*N conversion= functions, Foo.{to,of}_enum, instead of N^2 conversion functions), where p= erformance bottlenecks are usually on the data-structure-processing side, b= ut this means that using Enum for heavy stream processing is known to be sl= ow. Again, I would expect BatSeq (which is neither destructive nor memoizin= g) to do much better on these workflows.

It is perf= ectly reasonable to question whether we need this complex feature set for a= central streaming library. I have mixed thoughts on this question:

=
- as a library developer, my experience is that the current implement= ation is too fragile and too slow for its own good, and I think that users = would be better served by a simpler abstraction that does less in a more ro= bust way. The pragmatic choice is thus to use simpler stream libraries. Ano= ther interesting point is that, if you develop those simpler, more robust s= tream libraries, it is sometimes possible to reuse them to build more compl= ex streams on top of them (for example, a solid purely-functional stream im= plementation can be turned into a destructive stream implementation by pass= ing references to functional streams around), so this decomposition of conc= ern would also help rebuilding a more robust Enum. Simon Cruanes did good w= ork in that direction in preliminary versions of his Containers/Sequence li= braries (I can't find specific references right now, to the various str= eaming types with different feature support).

- as a language designer, I thin= k that our inability to implement Enum robustly on the first attempt is the= sign of an interesting problem to be solved. The feature set, from a user = point of view, is *not* unreasonable, and justifiably useful in some scenar= ios. As language designers, we know that the cloning / non-duplication of e= ffects is like playing with fire, but that is still a problem that we shoul= d be able to solve. (We have people properly design lock-free data structur= es, which are=20 also unreasonably complex; we should be demanding of our standard=20 libraries!) If I had a good way to *specify* the behaviour of a streaming l= ibrary respecting all these constraints, I think that would also clarify ho= w to give an simpler, more efficient, more robust implementation of it. (An= other concern that it could be interesting to throw in the mix is batching.= ) I regret not being informed enough about the myriad of Haskell-side solut= ions to the problem of effectful streaming, as there may also be interestin= g inspiration to be taken. I tried to get Fran=C3=A7ois Pottier interested = in the problem of formalizing a specification for Enum, but he ran away scr= eaming before I finished enumerating the design points above.

On Tue, Nov 8, 2016 a= t 7:47 AM, Oleg <oleg@okmij.org> wrote:

Gabriel wrote:

> I regret not being pointed to this work earlier, because I think that<= br> > measuring the performance of Enum as a representative OCaml stream lib= rary
> performance is not the best choice : Enum is designed to be flexible i= n a
> bit too many ways to be efficient on pure-streaming scenarios (it supp= orts
> a generic "clone", and effectful generators, that makes the = codebase too
> complex for its own good; I think that Batteries community is aware th= at
> Enum is not as good as it should be right now). There are other, more<= br> > efficient streaming libraries out there, including BatSeq in Batteries= that
> should already be sensibly faster; Core and Containers also have more<= br> > efficient streaming libraries.

Please do note that we make the double claim: wide expressivity *and*
guaranteed highest performance. We support not just the standard map and
filter, but zipping of streams with flat_maps in them, with both
finite and infinite streams. So we do want generality. From this
point, Enum was quite an appropriate.

Speaking of benchmarks, the real point of comparison is the
hand-written code for each benchmark, written with imperative loops,
references, etc. (We admit that some of that on several occasions
the hand-written baseline code was fine-tuned when it turned out that the generated code outperformed it.) Please do feel free to suggest faster
versions.

--001a113ecf34e1b5920540cc0948--