From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by yquem.inria.fr (Postfix) with ESMTP id 73EF7BB83 for ; Tue, 4 Jul 2006 16:42:46 +0200 (CEST) Received: from ptb-relay03.plus.net (ptb-relay03.plus.net [212.159.14.214]) by nez-perce.inria.fr (8.13.6/8.13.6) with ESMTP id k64Egluh019975 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO) for ; Tue, 4 Jul 2006 16:42:47 +0200 Received: from [80.229.56.224] (helo=chetara) by ptb-relay03.plus.net with esmtp (Exim) id 1Fxhkx-0001SP-Ku for caml-list@yquem.inria.fr; Tue, 04 Jul 2006 11:03:47 +0100 From: Jon Harrop Organization: Flying Frog Consultancy Ltd. To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Package with multidimensional AND sparse matrices? Date: Tue, 4 Jul 2006 10:03:58 +0000 User-Agent: KMail/1.9.1 References: In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200607041103.59059.jon@ffconsultancy.com> X-Miltered: at nez-perce with ID 44AA7E67.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; matrices:01 computes:01 alignment:01 sequences:01 computed:01 non-zero:01 non-zero:01 sequences:01 hashtables:01 matrices:01 hashtable:01 ocaml:01 2006:98 46,:98 frog:98 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.3 On Tuesday 04 July 2006 08:46, Andreas Biegert wrote: > Well, in short: the algorithm computes the optimal multiple alignment > of up to n protein sequences with maximum length L. The > straightforward and theoretically optimal approach to solve this > problem is by dynamic programming, therefore the time and space > complexity is O(L^n). I managed to come up with a pretty good > heuristic that lets me boil down the number of cells that need to be > computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact > I can even manually adjust the number of nonzero entries in the > matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will > be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds > i program has to run under 5min. An estimation of the time requirements > given the number of sequences n and masimum lengths L is: > > 24 * L^2 * T_x * 2^(n+1) - 1 > with T_x being about 100ns Ok. What patterns of access will there be? > For n=4 and L=2000 I get just about 5min. Would it be a bad idea to > use Hashtables all the way instead of matrices? No, I don't think so. If you need more performance then you can always change things later. I'd concentrate on getting a working algorithm first. > Accessing elements in > a hashtable has O(1), right? Roughly, yes. :-) -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. Objective CAML for Scientists http://www.ffconsultancy.com/products/ocaml_for_scientists