From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jon@ffconsultancy.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.4 required=5.0 tests=AWL autolearn=disabled 
	version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id 38022BB84
	for <caml-list@yquem.inria.fr>; Tue, 17 Feb 2009 10:49:09 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AuMAAGQWmknUnwdjlGdsb2JhbACCPZIIAQEBAQkLCAkRBLswhBMG
X-IronPort-AV: E=Sophos;i="4.38,222,1233529200"; 
   d="scan'208";a="23069336"
Received: from relay.pcl-ipout01.plus.net ([212.159.7.99])
  by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-SHA; 17 Feb 2009 10:49:08 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ApsEAFMXmknUnw4T/2dsb2JhbACCPc13hBMG
Received: from pih-relay06.plus.net ([212.159.14.19])
  by relay.pcl-ipout01.plus.net with ESMTP; 17 Feb 2009 09:49:03 +0000
Received: from [87.114.148.153] (helo=leper.local)
	 by pih-relay06.plus.net with esmtp (Exim) id 1LZMZb-0002GA-1L
	for caml-list@yquem.inria.fr; Tue, 17 Feb 2009 09:49:03 +0000
From: Jon Harrop <jon@ffconsultancy.com>
Organization: Flying Frog Consultancy Ltd.
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Threads performance issue.
Date: Tue, 17 Feb 2009 09:53:56 +0000
User-Agent: KMail/1.9.9
References: <2184b2340902160715y1f935b5ehc0e6195b3f75b66b@mail.gmail.com> <2184b2340902162340s540c5ac7g9f42b59d03f643cb@mail.gmail.com> <20090217085944.GC29651@janestcapital.com>
In-Reply-To: <20090217085944.GC29651@janestcapital.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Message-Id: <200902170953.56911.jon@ffconsultancy.com>
X-Plusnet-Relay: 1451dc9600f7fc121d9a70621a5c2aab
X-Spam: no; 0.00; shinwell:01 0100,:01 ocamlopt:01 foo:01 ocamlopt:01 -thread:01 cmxa:01 cmxa:01 foo:01 uncomment:01 threading:01 threading:01 ocaml's:01 ocaml:01 byte:01 

On Tuesday 17 February 2009 08:59:44 Mark Shinwell wrote:
> On Tue, Feb 17, 2009 at 08:40:11AM +0100, R=E9mi Dewitte wrote:
> > I have made some further experiments.
> > I have a functional version of the reading algorithm. I have the origin=
al
> > imperative version of the algorithm.
> > Either it is linked to thread (T) or not (X). Either it uses extlib (E)
> > or not (X).
>
> Using:
>
> ocamlopt -o foo transf.ml
> ocamlopt -thread -o foothread transf.ml unix.cmxa threads.cmxa
>
> on local disk with a 24Mb, approx. 500,000-line CSV file, I only see a
> minor slowdown with foothread as opposed to foo.  (A minor slowdown would
> inded be expected.)  So I'm confused as to why your results are so
> different.

If you uncomment R=E9mi's line of code that requires threading (or factor i=
t out=20
into a different file and compile that as well when using threading) then y=
ou=20
should be able to reproduce his results as I have.

> You could use ocamlopt -p and run gprof on the resulting gmon.out file.

Indeed the profiling results are most enlightening. Without threads everyth=
ing=20
looks great:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          =20
 time   seconds   seconds    calls  ms/call  ms/call  name   =20
 38.89      0.07     0.07        3    23.33    60.00  camlTransf__loop_109
 22.22      0.11     0.04  2873940     0.00     0.00  caml_ml_input_char
 11.11      0.13     0.02  2873940     0.00     0.00 =20
camlTransf__read_char_106
 11.11      0.15     0.02   381696     0.00     0.00  camlTransf__code_begin
=2E..

With threads, everything looks awful:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          =20
 time   seconds   seconds    calls  ms/call  ms/call  name   =20
 22.22      0.08     0.08                            =20
caml_thread_leave_blocking_section
 16.67      0.14     0.06                            =20
caml_thread_enter_blocking_section
 13.89      0.19     0.05        3    16.67    50.00  camlTransf__loop_109
  9.72      0.23     0.04                             caml_io_mutex_unlock
  5.56      0.24     0.02  3255698     0.00     0.00  caml_c_call
=2E..

So most of the time is spent locking and unlocking OCaml's good old giant=20
global lock. Unfortunately the blocking calls are "spontaneous" so you cann=
ot=20
track them down using a profile but, given the huge rise in time spent in=20
read_char, my guess is that OCaml has introduced an OS kernel lock around=20
every single byte read! If so, I'm surprised it is running *this* fast...

R=E9mi: assuming I am correct, I'd recommend using memory mapping (and rewr=
iting=20
that code ;-).

=2D-=20
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e