From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: ** X-Spam-Status: No, score=2.8 required=5.0 tests=DNS_FROM_RFC_POST, HTML_MESSAGE,SPF_NEUTRAL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id 2562FBBC4 for ; Fri, 20 Feb 2009 23:23:17 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArcBANi7nknRVd0VkGdsb2JhbACCQDGKG4cNPwEBAQEJCQwHEQOuOYEFjz0BAwEDhAwGhiQ X-IronPort-AV: E=Sophos;i="4.38,243,1233529200"; d="scan'208";a="24457312" Received: from mail-qy0-f21.google.com ([209.85.221.21]) by mail1-smtp-roc.national.inria.fr with ESMTP; 20 Feb 2009 23:23:16 +0100 Received: by qyk14 with SMTP id 14so4593765qyk.9 for ; Fri, 20 Feb 2009 14:23:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=bSCRbJltxC0tTcDca8v6HH6HGj14KpELDldaYz9Uzc4=; b=uT/w+UMeo0PMdMKmSeV2zO96qXVB5DhvFsqyml118JPINDGbmJj9yeIBaWtuALu8nC 5Tje3i9HsNsZ/FR4m+PTD0mXPhbbKdPpKzb12tBHInFcgaNIDy5wVhOFBS4ekOElB4fl L3509YwXdeiK+ChBsCTGan42iDEOO44MMSjA8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=v2ORwW9x2YB5VMoNDpm+Hc95BssNXkEO2EUjsFkEl9Lrsp7SwtGXR6JUpjCbUcb941 UQLd/PJLkE7IwlaS1ymLjipFXVOYlJ9suP/+aiQzREFY8btVf/JlIrGR+0Xxj4IPvi0c yN35d2awWBudoPt8CvOBmx8+vFNROSgXQv1DE= MIME-Version: 1.0 Received: by 10.224.37.139 with SMTP id x11mr2322842qad.144.1235168595199; Fri, 20 Feb 2009 14:23:15 -0800 (PST) In-Reply-To: <243054520902201153q4efbe1f0j686bd7212b515c03@mail.gmail.com> References: <243054520902200740j1d1874adib27645f6e090b073@mail.gmail.com> <499EFA7E.7010400@inria.fr> <243054520902201153q4efbe1f0j686bd7212b515c03@mail.gmail.com> Date: Fri, 20 Feb 2009 17:23:15 -0500 Message-ID: <2a1a1a0c0902201423o5ae7dd3dhb5c64da3c3255a69@mail.gmail.com> Subject: Re: [Caml-list] speeding up matrix multiplication (newbie question) From: Mike Lin To: Erick Matsen Cc: caml-list@inria.fr Content-Type: multipart/alternative; boundary=0015175ce16c5ac7400463611901 X-Spam: no; 0.00; matrices:01 blas:01 lacaml:01 ocaml:01 iirc:01 speedup:01 speedup:01 ocaml:01 model:01 matrices:01 model:01 multiplying:01 vectors:01 node:01 vectors:01 --0015175ce16c5ac7400463611901 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Erick, we should compare notes sometime. I have a lot of code for doing this kind of stuff (I am working on empirical codon models with 61x61 rate matrices). The right way to speed up matrix-vector operations is to use BLAS via either Lacaml or ocamlgsl. But if, like me, you like to counterproductively fiddle around with stupid things, here's a little ocaml library I wrote that links to a C function to do vector-vector dot using SSE2 vectorization. http://www.broad.mit.edu/~mlin/for_others/SuperFast_dot.zip IIRC in my tests it gave about 50% speedup vs. the obvious C code and 100% speedup vs. the obvious ocaml code. But, I am doing 61x61 and I'm not sure if the speedup scales down to 20x20 or especially 4x4. Mike On Fri, Feb 20, 2009 at 2:53 PM, Erick Matsen wrote: > Wow, once again I am amazed by the vitality of this list. Thank you > for your suggestions. > > Here is the context: we are interested in calculating the likelihood > of taxonomic placement of short "metagenomics" sequence fragments from > unknown organisms in the ocean. We start by assuming a model of > sequence evolution, which is a reversible Markov chain. The taxonomy > is represented as a tree, and the sequence information is a collection > of likelihoods of sequence identities. As we move up the tree, these > sequences "evolve" by getting multiplied by the exponentiated > instantaneous Markov matrix. > > The matrices are of the size of the sequence model: 4x4 when looking > at nucleotides, and 20x20 when looking at proteins. > > The bottleneck is (I mis-spoke before) that we are multiplying many > length-4 or length-20 vectors by a collection of matrices which > represent the time evolution of those sequences as follows. > > Outer loop: > modify the amount of time each markov process runs > exponentiate the rate matrices to get transition matrices > > Recur over the tree, starting at the leaves: > at a node, multiply all of the daughter likelihood vectors together > return the multiplication of that product by the trasition matrix > (bottleneck!) > > The trees are on the order of 50 leaves, and there are about 500 > likelihood vectors done at once. > > All of this gets run on a big cluster of Xeons. It's not worth > parallelizing because we are running many instances of this process > already, which fills up the cluster nodes. > > So far I have been doing the simplest thing possible, which is just to > multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit > embarassing. > > let mul_vec m v = > if Array.length v <> n_cols m then > failwith "mul_vec: matrix size and vector size don't match!"; > let result = Array.create (n_rows m) N.zero in > for i=0 to (n_rows m)-1 do > for j=0 to (n_cols m)-1 do > result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j)) > done; > done; > result > > I have implemented it in a functorial way for flexibility. N is the > number class. How much improvement might I hope for if I make a > dedicated float vector multiplication function? I'm sorry, I know > nothing about "boxing." Where can I read about that? > > > Thank you again, > > Erick > > > > On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy > wrote: > >> I'm working on speeding up some code, and I wanted to check with > >> someone before implementation. > >> > >> As you can see below, the code primarily spends its time multiplying > >> relatively small matrices. Precision is of course important but not > >> an incredibly crucial issue, as the most important thing is relative > >> comparison between things which *should* be pretty different. > > > > You need to post your matrix multiplication code so that the regulars > > on this list can tear it to pieces :-) > > > > From the profile you gave, it looks like you parameterized your matrix > > multiplication code over the + and * operations over matrix elements. > > This is good for genericity but not so good for performance, as it > > will result in more boxing (heap allocation) of floating-point values. > > The first thing you should try is write a version of matrix > > multiplication that is specialized for type "float". > > > > Then, there are several ways to write the textbook matrix > > multiplication algorithm, some of which perform less boxing than > > others. Again, post your code and we'll let you know. > > > >> Currently I'm just using native (double-precision) ocaml floats and > >> the native ocaml arrays for a first pass on the problem. Now I'm > >> thinking about moving to using float32 bigarrays, and I'm hoping > >> that the code will double in speed. I'd like to know: is that > >> realistic? Any other suggestions? > > > > It won't double in speed: arithmetic operations will take exactly the > > same time in single or double precision. What single-precision > > bigarrays buy you is halving the memory footprint of your matrices. > > That could result in better cache behavior and therefore slightly > > better speed, but it depends very much on the sizes and number of your > > matrices. > > > > - Xavier Leroy > > > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > --0015175ce16c5ac7400463611901 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Erick, we should compare notes sometime. I have a lot of code for doing thi= s kind of stuff (I am working on empirical codon models with 61x61 rate mat= rices). The right way to speed up matrix-vector operations is to use BLAS v= ia either Lacaml or ocamlgsl. But if, like me, you like to counterproductiv= ely fiddle around with stupid things, here's a little ocaml library I w= rote that links to a C function to do vector-vector dot using SSE2 vectoriz= ation.

http://www.broad.mit.edu/~mlin/for_others/SuperFast_dot.zip

IIR= C in my tests it gave about 50% speedup vs. the obvious C code and 100% spe= edup vs. the obvious ocaml code. But, I am doing 61x61 and I'm not sure= if the speedup scales down to 20x20 or especially 4x4.

Mike

On Fri, Feb 20, 2009 at 2:53 PM, Erick Matse= n <matsen@berkeley.edu> wrote:
Wow, once again I am amazed by the vitality of this list. Thank you
for your suggestions.

Here is the context: we are interested in calculating the likelihood
of taxonomic placement of short "metagenomics" sequence fragments= from
unknown organisms in the ocean. We start by assuming a model of
sequence evolution, which is a reversible Markov chain. The taxonomy
is represented as a tree, and the sequence information is a collection
of likelihoods of sequence identities. As we move up the tree, these
sequences "evolve" by getting multiplied by the exponentiated
instantaneous Markov matrix.

The matrices are of the size of the sequence model: 4x4 when looking
at nucleotides, and 20x20 when looking at proteins.

The bottleneck is (I mis-spoke before) that we are multiplying many
length-4 or length-20 vectors by a collection of matrices which
represent the time evolution of those sequences as follows.

Outer loop:
 modify the amount of time each markov process runs
 exponentiate the rate matrices to get transition matrices

 Recur over the tree, starting at the leaves:
   at a node, multiply all of the daughter likelihood vectors to= gether
   return the multiplication of that product by the trasition ma= trix
(bottleneck!)

The trees are on the order of 50 leaves, and there are about 500
likelihood vectors done at once.

All of this gets run on a big cluster of Xeons. It's not worth
parallelizing because we are running many instances of this process
already, which fills up the cluster nodes.

So far I have been doing the simplest thing possible, which is just to
multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit
embarassing.

let mul_vec m v =3D
   if Array.length v <> n_cols m then
     failwith "mul_vec: matrix size and vector size do= n't match!";
   let result =3D Array.create (n_rows m) N.zero in
   for i=3D0 to (n_rows m)-1 do
     for j=3D0 to (n_cols m)-1 do
       result.(i) <- N.add result.(i) (N.mul (get m= i j) v.(j))
     done;
   done;
   result

I have implemented it in a functorial way for flexibility. N is the
number class. How much improvement might I hope for if I make a
dedicated float vector multiplication function? I'm sorry, I know
nothing about "boxing." Where can I read about that?


Thank you again,

Erick



On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote:
>> I'm working on speeding up some code, and I wanted to check wi= th
>> someone before implementation.
>>
>> As you can see below, the code primarily spends its time multiplyi= ng
>> relatively small matrices. Precision is of course important but no= t
>> an incredibly crucial issue, as the most important thing is relati= ve
>> comparison between things which *should* be pretty different.
>
> You need to post your matrix multiplication code so that the regulars<= br> > on this list can tear it to pieces :-)
>
> From the profile you gave, it looks like you parameterized your matrix=
> multiplication code over the + and * operations over matrix elements.<= br> > This is good for genericity but not so good for performance, as it
> will result in more boxing (heap allocation) of floating-point values.=
> The first thing you should try is write a version of matrix
> multiplication that is specialized for type "float".
>
> Then, there are several ways to write the textbook matrix
> multiplication algorithm, some of which perform less boxing than
> others.  Again, post your code and we'll let you know.
>
>> Currently I'm just using native (double-precision) ocaml float= s and
>> the native ocaml arrays for a first pass on the problem.  Now= I'm
>> thinking about moving to using float32 bigarrays, and I'm hopi= ng
>> that the code will double in speed. I'd like to know: is that<= br> >> realistic? Any other suggestions?
>
> It won't double in speed: arithmetic operations will take exactly = the
> same time in single or double precision.  What single-precision > bigarrays buy you is halving the memory footprint of your matrices. > That could result in better cache behavior and therefore slightly
> better speed, but it depends very much on the sizes and number of your=
> matrices.
>
> - Xavier Leroy
>

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.in= ria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

--0015175ce16c5ac7400463611901--