From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <info@gerd-stolpmann.de>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.1 required=5.0 tests=AWL autolearn=disabled 
	version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by yquem.inria.fr (Postfix) with ESMTP id E9342BBCA
	for <caml-list@yquem.inria.fr>; Tue, 13 May 2008 16:24:36 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ArQAACVBKUjU4366mmdsb2JhbACBU5A9AQEBAQEIBQgHEQOaTwE
X-IronPort-AV: E=Sophos;i="4.27,478,1204498800"; 
   d="scan'208";a="26122783"
Received: from moutng.kundenserver.de ([212.227.126.186])
  by mail4-smtp-sop.national.inria.fr with ESMTP; 13 May 2008 16:24:36 +0200
Received: from gate.lan.gerd-stolpmann.de (dslb-084-059-104-031.pools.arcor-ip.net [84.59.104.31])
	by mrelayeu.kundenserver.de (node=mrelayeu7) with ESMTP (Nemesis)
	id 0ML2xA-1JvvQf2Q0y-0005ha; Tue, 13 May 2008 16:24:34 +0200
Received: from [192.168.0.32] (fw.lan.gerd-stolpmann.de [192.168.1.1])
	by gate.lan.gerd-stolpmann.de (Postfix) with ESMTP id 43AF2C144;
	Tue, 13 May 2008 16:24:33 +0200 (CEST)
Subject: Re: [Caml-list] Re: Why OCaml sucks
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Brian Hurt <bhurt@janestcapital.com>
Cc: Robert Fischer <robert@fischerventure.com>,
	caml-list@yquem.inria.fr
In-Reply-To: <48299F46.5030906@janestcapital.com>
References: <200805090139.54870.jon@ffconsultancy.com>
	 <200805120854.45499.ober.14@osu.edu>
	 <200805121516.13983.jon@ffconsultancy.com>
	 <200805130933.13952.ober.14@osu.edu> <48299C67.1010905@fischerventure.com>
	 <48299F46.5030906@janestcapital.com>
Content-Type: text/plain
Date: Tue, 13 May 2008 16:25:49 +0200
Message-Id: <1210688749.6371.28.camel@flake.lan.gerd-stolpmann.de>
Mime-Version: 1.0
X-Mailer: Evolution 2.12.1 
Content-Transfer-Encoding: 7bit
X-Provags-ID: V01U2FsdGVkX19RIILT2H+XzoaXmpi4T1cMKfy8uURVt7ywe1x
 SuJBoJ/cSeMrnqHWyRhCtS52z/DWwQ1OMVfOCpySc0k3aoc0JP
 lGz5j3jHJKjhtxpaiV7zrW3ZCKa/D9E
X-Spam: no; 0.00; ocaml:01 gerd:01 stolpmann:01 encodings:01 runtime:01 lexer:01 indirection:01 byte:01 assertion:01 mutable:01 parsers:01 moderately:01 lexer:01 afaik:01 nfa:01 


Am Dienstag, den 13.05.2008, 10:01 -0400 schrieb Brian Hurt:
> Robert Fischer wrote:
> > > > > > 5. Strings: pushing unicode throughout a general purpose language is a
> > > > > > mistake, IMHO. This is why languages like Java and C# are so slow.
> > > > > >           
> > > > > Unicode by itself, when wider-than-byte encodings are used, adds "zero"
> > > > > runtime overhead; the only overhead is storage (2 or 4 bytes per
> > > > > character).
> > > > >         
> > > > You cannot degrade memory consumption without also degrading performance.
> > > > Moreover, there are hidden costs such as the added complexity in a lexer
> > > > which potentially has 256x larger dispatch tables or an extra indirection
> > > > for every byte read.
> > > >       
> > 
> > Okay, I was going to let this slide, but it kept resurfacing and annoying me.
> > 
> > Is there any empirical support for the assertion that Java and C# are slow because of *unicode*?  Of
> > all things, *unicode*?  The fact that they're bytecod languages isn't a bigger hit?  At least with
> > the JVM, the hypercomplicated GC should probably take some of the blame, too -- I've seen 2x speed
> > increases by *reducing* the space available to the GC, and 10x speed increases by boosting the space
> > available to ridiculous levels so that the full GC barely ever has to fire.  The the nigh-universal
> > optimization-ruining mutable data and virtual function (e.g. method) dispatch I'm sure doesn't help,
> > too.  And this is to say nothing of user-space problems like the explosion of nontrivial types
> > associated with the object-driven style.  With all that going on, you're blaming their *Unicode
> > support* for why they're slow?  "This is why languages like Java and C# are so slow."  Really?  Got
> > evidence for that?
> > 
> > ~~ Robert.
> > 
> > _
> 
> The problem, as I understand it, is in writting parsers.  Your
> standard finite automata based regular expression library or lexical
> analyzer is based, at it's heart, on a table lookup- you have a 2D
> array, whose size is the number of input characters times the number
> of states.  For ASCII input, the number of possible input characters
> is small- 256 at most.  256 input characters times hundreds of states
> isn't that big of a table- we're looking at sizes in 10's of K- easily
> handlable even in the bad old days of 64K segments.  Even going to
> UTF-16 ups the number of input characters from 256 to 65,536- and now
> a moderately large state machine (hundreds of states) weighs in at
> tens of megabytes of table space.  And, of course, if you try to
> handle the entire 31-bit full unicode point space, welcome to really
> large tables :-).
> 
> The solution, I think, is to change the implementation of your finite
> automata to use some data structure smarter than a flat 2D array, but
> that's me.

Note that we have such a lexer already: ulex. It uses binary decision
trees, AFAIK. The resulting code has moderate size. It can take some
time, however, until ulex has converted the NFA to a DFA.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------