From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by yquem.inria.fr (Postfix) with ESMTP id A95F9BB83 for ; Tue, 4 Jul 2006 09:46:33 +0200 (CEST) Received: from nf-out-0910.google.com (nf-out-0910.google.com [64.233.182.191]) by nez-perce.inria.fr (8.13.6/8.13.6) with ESMTP id k647kYlK031150 for ; Tue, 4 Jul 2006 09:46:34 +0200 Received: by nf-out-0910.google.com with SMTP id m18so155134nfc for ; Tue, 04 Jul 2006 00:46:33 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=googlemail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=ZpT5IDsxKRIUbT2KY5gir+FxJfEOp+s1wFRn5vLVJSOadzchJ/Vo2lKsML3kdnv4iqxnHayb523VRJ+GpJcO36UYOrK4Z17sASOY0qmirQnlwrkXLO642f3TPXAlOs9zDQNUvfl7RIcADX7CATfqZXY/qkJpfXtfE9tdmkQ1mHg= Received: by 10.49.54.13 with SMTP id g13mr3059119nfk; Tue, 04 Jul 2006 00:46:33 -0700 (PDT) Received: by 10.48.220.18 with HTTP; Tue, 4 Jul 2006 00:46:33 -0700 (PDT) Message-ID: Date: Tue, 4 Jul 2006 09:46:33 +0200 From: "Andreas Biegert" To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Package with multidimensional AND sparse matrices? In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <200001020456.19386.jon@ffconsultancy.com> X-j-chkmail-Score: MSGID : 44AA1CDA.001 on nez-perce : j-chkmail score : X : 0/20 1 X-Miltered: at nez-perce with ID 44AA1CDA.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; matrices:01 computes:01 alignment:01 sequences:01 computed:01 non-zero:01 non-zero:01 sequences:01 hashtables:01 matrices:01 hashtable:01 computes:01 alignment:01 computed:01 hashtables:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=RCVD_BY_IP autolearn=disabled version=3.0.3 Well, in short: the algorithm computes the optimal multiple alignment of up to n protein sequences with maximum length L. The straightforward and theoretically optimal approach to solve this problem is by dynamic programming, therefore the time and space complexity is O(L^n). I managed to come up with a pretty good heuristic that lets me boil down the number of cells that need to be computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact I can even manually adjust the number of nonzero entries in the matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds i: > Well, in short: the algorithm computes the optimal multiple alignment > of up to n protein sequences with maximum length L. The > straightforward and theoretically optimal approach to solve this > problem is by dynamic programming, therefore the time and space > complexity is O(L^n). I managed to come up with a pretty good > heuristic that lets me boil down the number of cells that need to be > computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact > I can even manually adjust the number of nonzero entries in the > matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will > be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds > i program has to run under 5min. An estimation of the time requirements > given the number of sequences n and masimum lengths L is: > > 24 * L^2 * T_x * 2^(n+1) - 1 > with T_x being about 100ns > > For n=4 and L=2000 I get just about 5min. Would it be a bad idea to > use Hashtables all the way instead of matrices? Accessing elements in > a hashtable has O(1), right? > > Andreas > > > > 2000/1/2, Jon Harrop : > > On Monday 03 July 2006 23:55, Andreas Biegert wrote: > > > The ocamlfloat package provides a sparse matrix implmentation for > > > 2-dimensional matrices, but I would need at least 4 dimensions. > > > > You probably mean "tensor" rather than "matrix". > > > > > In case there is no such package, would it be very hard to write one by > > > myself? > > > > Provided performance and memory usage are not too important, it is quite > > simple to write your own implementation. First, you must decide what > > requirements you have, i.e. what do you want to do with these tensors? Then, > > you must choose a data structure to represent your tensors, e.g. a hash table > > that maps int * int * int * int to float. Finally, you must implement the > > functions that you need. > > > > You could start with something like this: > > > > module SparseArray : sig > > type t > > > > val make : int list -> t > > val get : t -> int list -> float > > val set : t -> int list -> float -> unit > > end = struct > > type t = int list * (int list, float) Hashtbl.t > > > > let make n = n, Hashtbl.create 0 > > > > let check s n i = > > List.iter2 (fun n i -> if i<0 || i>=n then invalid_arg s) n i > > > > let get (n, m) i = > > check "SparseArray.get" n i; > > try Hashtbl.find m i with Not_found -> 0. > > > > let set (n, m) i x = > > check "SparseArray.set" n i; > > if x<>0. then Hashtbl.replace m i x else > > try Hashtbl.remove m i with Not_found -> () > > end;; > > > > That should be fine for getting and setting but you probably want to loop over > > dimensions and perform arithmetic operations, in which case you need a way to > > jump to the non-zero elements. > > > > How many non-zero elements do you have and how long do you expect your > > calculations to take? > > > > -- > > Dr Jon D Harrop, Flying Frog Consultancy Ltd. > > Objective CAML for Scientists > > http://www.ffconsultancy.com/products/ocaml_for_scientists > > > > _______________________________________________ > > Caml-list mailing list. Subscription management: > > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > > Archives: http://caml.inria.fr > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs > > >