From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id TAA02196; Wed, 27 Nov 2002 19:06:37 +0100 (MET)
X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id TAA02327 for <caml-list@pauillac.inria.fr>; Wed, 27 Nov 2002 19:06:36 +0100 (MET)
Received: from relay.pair.com (relay1.pair.com [209.68.1.20])
	by concorde.inria.fr (8.11.1/8.11.1) with SMTP id gARI6YX15138
	for <caml-list@inria.fr>; Wed, 27 Nov 2002 19:06:35 +0100 (MET)
Received: (qmail 21825 invoked from network); 27 Nov 2002 18:06:31 -0000
Received: from arda.pair.com (HELO compaqreview.d6.com) (209.68.1.133)
  by relay1.pair.com with SMTP; 27 Nov 2002 18:06:31 -0000
X-pair-Authenticated: 209.68.1.133
Message-Id: <4.3.2.7.2.20021127090821.032eae90@localhost>
X-Sender: checker@localhost
X-Mailer: QUALCOMM Windows Eudora Version 4.3.2
Date: Wed, 27 Nov 2002 10:04:55 -0800
To: Damien Doligez <damien.doligez@inria.fr>, caml-list@inria.fr
From: Chris Hecker <checker@d6.com>
Subject: Re: [Caml-list] Why systhreads?
In-Reply-To: <E4BA1D26-0209-11D7-8024-0003930FCE12@inria.fr>
References: <4.3.2.7.2.20021125134858.037b4ef8@localhost>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
Sender: owner-caml-list@pauillac.inria.fr
Precedence: bulk


[sorry for the longwinded response]

>Do you really think so ?  In my experience, 95% of the costs of threads
>(with shared memory) are in the debugging (of the threads implementation,
>AND of the programs).  Cheap SMP machines and HT do not change the
>cost/benefit equation very much.

Like I said in my previous mail, I think it's going to be similar to 
MMX/SSE.  The performance improvement you get is not worth the development 
and support headache, until the technology is ubiquitous.  Once it's 
everywhere, it becomes worthwhile.  I'm using a middleware library for my 
game right now that requires MMX.  That's finally an acceptable 
requirement.  On xbox, which is a fixed platform with a known cpu, every 
game uses SSE, because it's just guaranteed to be there, and can make a big 
difference if you're willing to work with its problems (using structure of 
arrays layout, etc.).  And let's not even talk about the insanity of the 
PS2 architecture.  Xbox2 will use a CPU with HT, because there won't be any 
Intel CPUs that don't have HT, so it'll get used there by apps.

Now, as you point out, threads are complicated to design, program, and 
debug.  I agree with this completely.  As I said, I never use threaded 
designs if I can avoid it.  However, if it becomes very easy to spawn very 
small scale parallel threads in C on an HT processor, then it could make a 
big performance difference for some algorithms.  People are working on C 
compilers that have these extensions built in.  Intel's got one 
now.  They'll be first, everyone will ignore it until the installed base is 
big enough, and then it'll go into msvc.  MMX, SSE, and 3dnow followed the 
exact same path.

The reason this is different (or has the potential to be different) with HT 
compared to discrete cpus is that a) HT is free so it will be ubiquitous 
eventually, and b) HT drops the thread context switch time to 0.  It's not 
worth starting up a thread on another cpu to do a few instructions worth of 
work, but it is conceivable that it would be for HT.  Again, I think this 
will mirror MMX.  The original version of MMX has a horrible context switch 
time, and overloaded the FPU registers.  It was worthless.  They fixed 
it.  I assume there are similar gotchas with the first version of HT.  But, 
in a couple revs, they'll fix it and it will be possible to have a second 
thread do half the work in a small loop, with no overhead (there'll be a hw 
thread pool, hw wait on mutex/sleep, etc.).

The reason HT can make a performance difference is that your app is 
stalling in the CPU all the time anyway.  Even tight loops aren't memory 
bandwidth bound (unless it's a copy or fill), they're memory access bound; 
there's a huge difference between the two.  HT can take advantage of the 
latter and give you way more utilization, even on a smallscale loop.  In 
theory, anyway.  :)  But, as I said, I have [non-Intel] colleagues who have 
seen big wins with HT on some applications, enough to make them say, "huh, 
this actually works!"

Now, you could just say, "hey, caml's not for that kind of lowlevel stuff", 
which is a fine response.  However, I've been doing a lot of lowlevel stuff 
in my game, all in caml (linear algebra, 3d transforms, bitmap operations, 
etc.), and it's so close to being good enough to just stay in caml and not 
have to drop to C.  I understand the point of using the right tool for the 
job, but there is overhead (both cognitive and development-process-wise, 
both important) associated with hooking something in C, and so it would be 
really nice to stay in caml all the time.  Bringing this back to HT, this 
is the kind of feature that requires inria to do it, because I don't think 
anybody else understands the gc.  By contrast, I could probably get an SSE 
code generator working if I thought it was worth it.  But there's no way I 
could multithread the gc.  :)

>More important, you don't need threads and shared memory to make use
>of a SMP machine.  Any kind of parallelism will do.  Several processes
>with message-passing can easily get you 100% load on all your processors.
>Also, message-passing is more general; for example it will work on clusters.

Sure, but an HT cpu shares L1 and L2 caches between the threads.  This 
means that you really want your threads to be working on the same data and 
code if you can help it.  It'll still work for processes, but you're going 
to thrash way more than if you're doing local stuff.

Again, I'm not an HT zealot; I don't even know if it's going to 
succeed.  But, I do think it has the potential to have a big impact on 
performance oriented programming, and it would be great if there's a plan 
for supporting it in caml if it actually works.  If it's simply not 
possible to multithread the gc well, then that's that.  But it seems like 
something you want to have simmering on the mental back burner in case it 
turns out you want it later.

Sorry for the huge post,
Chris


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners