From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id LAA09858; Wed, 9 Apr 2003 11:18:54 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id LAA09850 for ; Wed, 9 Apr 2003 11:18:52 +0200 (MET DST) X-SPAM-Warning: Sending machine is listed in blackholes.five-ten-sg.com Received: from mail.fltrp.com ([211.101.185.130]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id h399IlX10620 for ; Wed, 9 Apr 2003 11:18:48 +0200 (MET DST) Received: from delbian [129.0.5.6] by mail.fltrp.com with ESMTP (SMTPD32-7.13) id A52FAE00B2; Wed, 09 Apr 2003 17:17:35 +0800 From: Yang Shouxun Organization: FLTRP To: caml-list@inria.fr Subject: Re: [Caml-list] stack overflow Date: Wed, 9 Apr 2003 17:23:30 +0800 User-Agent: KMail/1.5.1 References: <200304091045.25094.yangsx@fltrp.com> <20030409081451.GA18772@mail4.ai.univie.ac.at> In-Reply-To: <20030409081451.GA18772@mail4.ai.univie.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200304091723.30890.yangsx@fltrp.com> X-Spam: no; 0.00; shouxun:01 yangsx:01 fltrp:01 caml-list:01 heared:01 dataset:01 datasets:01 passing:01 fishy:01 debugger:01 ulimit:01 gpl:01 aifad:01 retrieval:99 closure:01 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk On Wednesday 09 April 2003 16:14, Markus Mottl wrote: > On Wed, 09 Apr 2003, Yang Shouxun wrote: > > Yes, the decision tree building function is not tail recursive. I heared > > people saying C4.5 (in C) also has stack overflow problem when the > > training dataset becomes very large. > > I can't imagine that this is the problem: either the data is > well-distributed, in which case the stack size will grow roughly > logarithmically with the size of the data due to partitioning. And if not, > the maximum depth of the tree is limited by the number of available input > variables anyway. You'd need many, many thousands of those before this > becomes a problem, which even large, industrial datasets that I know do > not exceed. My training data contain statistical values for word combinations (or collocations) extracted from a corpus. The number is indeed very large. > > I don't know how to write a tail recursive version to build trees. > > If there are not that many continuous attributes and the dataset is > > not so large, the tree stops growing before stack overflow. > > The trick is to use continuation passing style (CPS): you pass a function > closure (continuation) containing everything that's needed in subsequent > computations. Instead of returning a result, the sub-function calls the > continuation with the result, which makes the functions tail-recursive. I've learned this style in Scheme. Yet I feel paralyzed when trying to write in it to build trees. The type declaration may make my point clearer. --8<-- type dtree = Dnode of dnode | Dtree of (dnode * int * dtree list) --8<-- The problems are that unless the next call returns, the tree is not complete yet and it may have several calls on itself. > But anyway, I think there must be some fishy operation going on. Why not > use the debugger to find out? Or even better: send a link to the code :-) I suppose the program is not buggy so far as it works as expected. It's buggy in the design itself: it must recurse (so far as I can implement) and it cannot afford recurse too deeply. Sorry, I don't have a homepage. > > Can one know the maximal number of calls before it overflow the stack? > > It depends: byte-code uses its own stack, which you can query using the > Gc-module. Otherwise, for native-code, call the ulimit-program (Unix), > which displays resource limits including stack usage or interface to > the system call "getrlimit". I'm running Debian unstale. I checked just now on my laptop and "ulimit -s" reurned "unlimited". I suppose the desktop that actually ran the program was similarly configured. > In any case, it would be interesting to see your code. Are you going to > release it under some free license? Yes, I'm going to release it under GPL. As you can see, I basically use free software and am willing to pay it back. I intend to register a project for it on savannah soon. Be warned that my code may look rather ugly. I also downloaded your AIFAD and had a cursive look at it. I found it does not handle continuous attributes yet and your design goal is quite different from mine. So I wrote mine from scratch and called it DTLR (Decision Tree Learner for Retrieval). If you are interested, I can send a copy to you tomorrow. It does not implement all the features I planned, without documentation except some comments, but it is enough for my own needs right now. ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners