From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from discorde.inria.fr (discorde.inria.fr [192.93.2.38]) by yquem.inria.fr (Postfix) with ESMTP id 39FE5BC69 for ; Mon, 13 Aug 2007 23:32:22 +0200 (CEST) Received: from ptb-relay03.plus.net (ptb-relay03.plus.net [212.159.14.214]) by discorde.inria.fr (8.13.6/8.13.6) with ESMTP id l7DLWLu5008824 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO) for ; Mon, 13 Aug 2007 23:32:22 +0200 Received: from [80.229.56.224] (helo=beast.local) by ptb-relay03.plus.net with esmtp (Exim) id 1IKhWP-00065Q-4Y for caml-list@inria.fr; Mon, 13 Aug 2007 22:32:21 +0100 From: Jon Harrop Organization: Flying Frog Consultancy Ltd. To: caml-list Subject: Ropes: progressively crazy ideas Date: Mon, 13 Aug 2007 22:23:42 +0100 User-Agent: KMail/1.9.7 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200708132223.43128.jon@ffconsultancy.com> X-Miltered: at discorde with ID 46C0CDE5.000 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; abstracted:01 vectors:01 nodes:01 toying:01 computes:01 recursive:01 vastly:01 ml's:01 ocaml:01 ocaml:01 frog:98 pairs:01 reuse:01 lazy:02 purely:02 I decided to have another go at curing cancer over the weekend. Downloaded the human genome sequence from UCSC and started reading papers on suffix trees. The data is primarily a sequence of A, C, G and T, of course. However, there are also sparse entries of other characters, primarily N. You really want to store the bulk of the data as a dense array of bit pairs for the four most common characters and then store the exceptional cases in a sparse data structure. Perhaps it would be useful if your Vect implementation could be abstracted over different concrete leaf implementations, such as bit vectors? Presumably leaf size should be measured in bytes rather than elements? Also, I've noticed a correlation between the purely functional implementations of the linear-time suffix tree generation algorithms, Huet's zippers and my work that I mentioned before on lazy metadata in nodes. I'm toying with the idea of a sequence type based on Vect that lazily computes its own suffix tree. You would need some way to compose suffix trees when sequences are combined (e.g. concatenation of sequences) but the result could be very useful. For example, Mathematica's pattern matcher allows you to search for subsequences: f[{front___, x, y, z, back___}] := {front, back} This is currently implemented completely naively by recursive linear searches of an array. If the sequence type could generate and reuse suffix trees then subsequence matches should be vastly more efficient. The ability to search for subsequences is the main reason why Mathematica's pattern matches are not optimized as ML's are. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. OCaml for Scientists http://www.ffconsultancy.com/products/ocaml_for_scientists/?e