From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by yquem.inria.fr (Postfix) with ESMTP id DE741BB83 for ; Tue, 4 Jul 2006 19:58:48 +0200 (CEST) Received: from out3.smtp.messagingengine.com (out3.smtp.messagingengine.com [66.111.4.27]) by nez-perce.inria.fr (8.13.6/8.13.6) with ESMTP id k64Hwi2j028791 for ; Tue, 4 Jul 2006 19:58:50 +0200 Received: from frontend3.internal (frontend3.internal [10.202.2.152]) by frontend1.messagingengine.com (Postfix) with ESMTP id 6A96BD8B0F7; Tue, 4 Jul 2006 05:06:23 -0400 (EDT) Received: from heartbeat1.messagingengine.com ([10.202.2.160]) by frontend3.internal (MEProxy); Tue, 04 Jul 2006 05:06:25 -0400 X-Sasl-enc: 9Vydc1KlG5+WrsATqtZWjE3y+JgdgTe8WLsk5SFngT+2 1152003981 Received: from ppp-71-128-205-59.dsl.sndg02.pacbell.net (ppp-71-128-205-59.dsl.sndg02.pacbell.net [71.128.205.59]) by mail.messagingengine.com (Postfix) with ESMTP id 650281D96; Tue, 4 Jul 2006 05:06:21 -0400 (EDT) Date: Tue, 4 Jul 2006 09:06:16 +0000 (GMT) From: Martin Jambon X-X-Sender: martin@droopy To: Andreas Biegert Cc: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Package with multidimensional AND sparse matrices? In-Reply-To: Message-ID: References: <200001020456.19386.jon@ffconsultancy.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Miltered: at nez-perce with ID 44AAAC54.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; matrices:01 computes:01 alignment:01 sequences:01 computed:01 non-zero:01 non-zero:01 sequences:01 hashtables:01 matrices:01 hashtable:01 hashtbl:01 hashtbl:01 howto:01 wrote:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.4 required=5.0 tests=DNS_FROM_RFC_ABUSE autolearn=disabled version=3.0.3 On Tue, 4 Jul 2006, Andreas Biegert wrote: > Well, in short: the algorithm computes the optimal multiple alignment > of up to n protein sequences with maximum length L. The > straightforward and theoretically optimal approach to solve this > problem is by dynamic programming, therefore the time and space > complexity is O(L^n). I managed to come up with a pretty good > heuristic that lets me boil down the number of cells that need to be > computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact > I can even manually adjust the number of nonzero entries in the > matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will > be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds > i program has to run under 5min. An estimation of the time requirements > given the number of sequences n and masimum lengths L is: > > 24 * L^2 * T_x * 2^(n+1) - 1 > with T_x being about 100ns > > For n=4 and L=2000 I get just about 5min. Would it be a bad idea to > use Hashtables all the way instead of matrices? Accessing elements in > a hashtable has O(1), right? Yes, if you create it big enough so that it doesn't get resized. The best is that you look at hashtbl.ml in the standard library to see how it is implemented. But I wouldn't worry too much for now: try Hashtbl and you will quickly see if it's fast enough or not. Martin -- Martin Jambon, PhD - http://martin.jambon.free.fr Visit http://wikiomics.org, the Bioinformatics Howto Wiki