From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <nilekim@gmail.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: **
X-Spam-Status: No, score=2.8 required=5.0 tests=DNS_FROM_RFC_POST,
	HTML_MESSAGE,SPF_NEUTRAL autolearn=disabled version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by yquem.inria.fr (Postfix) with ESMTP id 2562FBBC4
	for <caml-list@yquem.inria.fr>; Fri, 20 Feb 2009 23:23:17 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ArcBANi7nknRVd0VkGdsb2JhbACCQDGKG4cNPwEBAQEJCQwHEQOuOYEFjz0BAwEDhAwGhiQ
X-IronPort-AV: E=Sophos;i="4.38,243,1233529200"; 
   d="scan'208";a="24457312"
Received: from mail-qy0-f21.google.com ([209.85.221.21])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 20 Feb 2009 23:23:16 +0100
Received: by qyk14 with SMTP id 14so4593765qyk.9
        for <caml-list@inria.fr>; Fri, 20 Feb 2009 14:23:15 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=gamma;
        h=domainkey-signature:mime-version:received:in-reply-to:references
         :date:message-id:subject:from:to:cc:content-type;
        bh=bSCRbJltxC0tTcDca8v6HH6HGj14KpELDldaYz9Uzc4=;
        b=uT/w+UMeo0PMdMKmSeV2zO96qXVB5DhvFsqyml118JPINDGbmJj9yeIBaWtuALu8nC
         5Tje3i9HsNsZ/FR4m+PTD0mXPhbbKdPpKzb12tBHInFcgaNIDy5wVhOFBS4ekOElB4fl
         L3509YwXdeiK+ChBsCTGan42iDEOO44MMSjA8=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type;
        b=v2ORwW9x2YB5VMoNDpm+Hc95BssNXkEO2EUjsFkEl9Lrsp7SwtGXR6JUpjCbUcb941
         UQLd/PJLkE7IwlaS1ymLjipFXVOYlJ9suP/+aiQzREFY8btVf/JlIrGR+0Xxj4IPvi0c
         yN35d2awWBudoPt8CvOBmx8+vFNROSgXQv1DE=
MIME-Version: 1.0
Received: by 10.224.37.139 with SMTP id x11mr2322842qad.144.1235168595199; 
	Fri, 20 Feb 2009 14:23:15 -0800 (PST)
In-Reply-To: <243054520902201153q4efbe1f0j686bd7212b515c03@mail.gmail.com>
References: <243054520902200740j1d1874adib27645f6e090b073@mail.gmail.com>
	 <499EFA7E.7010400@inria.fr>
	 <243054520902201153q4efbe1f0j686bd7212b515c03@mail.gmail.com>
Date: Fri, 20 Feb 2009 17:23:15 -0500
Message-ID: <2a1a1a0c0902201423o5ae7dd3dhb5c64da3c3255a69@mail.gmail.com>
Subject: Re: [Caml-list] speeding up matrix multiplication (newbie question)
From: Mike Lin <nilekim@gmail.com>
To: Erick Matsen <matsen@berkeley.edu>
Cc: caml-list@inria.fr
Content-Type: multipart/alternative; boundary=0015175ce16c5ac7400463611901
X-Spam: no; 0.00; matrices:01 blas:01 lacaml:01 ocaml:01 iirc:01 speedup:01 speedup:01 ocaml:01 model:01 matrices:01 model:01 multiplying:01 vectors:01 node:01 vectors:01 

--0015175ce16c5ac7400463611901
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Erick, we should compare notes sometime. I have a lot of code for doing this
kind of stuff (I am working on empirical codon models with 61x61 rate
matrices). The right way to speed up matrix-vector operations is to use BLAS
via either Lacaml or ocamlgsl. But if, like me, you like to
counterproductively fiddle around with stupid things, here's a little ocaml
library I wrote that links to a C function to do vector-vector dot using
SSE2 vectorization.

http://www.broad.mit.edu/~mlin/for_others/SuperFast_dot.zip

IIRC in my tests it gave about 50% speedup vs. the obvious C code and 100%
speedup vs. the obvious ocaml code. But, I am doing 61x61 and I'm not sure
if the speedup scales down to 20x20 or especially 4x4.

Mike

On Fri, Feb 20, 2009 at 2:53 PM, Erick Matsen <matsen@berkeley.edu> wrote:

> Wow, once again I am amazed by the vitality of this list. Thank you
> for your suggestions.
>
> Here is the context: we are interested in calculating the likelihood
> of taxonomic placement of short "metagenomics" sequence fragments from
> unknown organisms in the ocean. We start by assuming a model of
> sequence evolution, which is a reversible Markov chain. The taxonomy
> is represented as a tree, and the sequence information is a collection
> of likelihoods of sequence identities. As we move up the tree, these
> sequences "evolve" by getting multiplied by the exponentiated
> instantaneous Markov matrix.
>
> The matrices are of the size of the sequence model: 4x4 when looking
> at nucleotides, and 20x20 when looking at proteins.
>
> The bottleneck is (I mis-spoke before) that we are multiplying many
> length-4 or length-20 vectors by a collection of matrices which
> represent the time evolution of those sequences as follows.
>
> Outer loop:
>  modify the amount of time each markov process runs
>  exponentiate the rate matrices to get transition matrices
>
>  Recur over the tree, starting at the leaves:
>    at a node, multiply all of the daughter likelihood vectors together
>    return the multiplication of that product by the trasition matrix
> (bottleneck!)
>
> The trees are on the order of 50 leaves, and there are about 500
> likelihood vectors done at once.
>
> All of this gets run on a big cluster of Xeons. It's not worth
> parallelizing because we are running many instances of this process
> already, which fills up the cluster nodes.
>
> So far I have been doing the simplest thing possible, which is just to
> multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit
> embarassing.
>
> let mul_vec m v =
>    if Array.length v <> n_cols m then
>      failwith "mul_vec: matrix size and vector size don't match!";
>    let result = Array.create (n_rows m) N.zero in
>    for i=0 to (n_rows m)-1 do
>      for j=0 to (n_cols m)-1 do
>        result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j))
>      done;
>    done;
>    result
>
> I have implemented it in a functorial way for flexibility. N is the
> number class. How much improvement might I hope for if I make a
> dedicated float vector multiplication function? I'm sorry, I know
> nothing about "boxing." Where can I read about that?
>
>
> Thank you again,
>
> Erick
>
>
>
> On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy <Xavier.Leroy@inria.fr>
> wrote:
> >> I'm working on speeding up some code, and I wanted to check with
> >> someone before implementation.
> >>
> >> As you can see below, the code primarily spends its time multiplying
> >> relatively small matrices. Precision is of course important but not
> >> an incredibly crucial issue, as the most important thing is relative
> >> comparison between things which *should* be pretty different.
> >
> > You need to post your matrix multiplication code so that the regulars
> > on this list can tear it to pieces :-)
> >
> > From the profile you gave, it looks like you parameterized your matrix
> > multiplication code over the + and * operations over matrix elements.
> > This is good for genericity but not so good for performance, as it
> > will result in more boxing (heap allocation) of floating-point values.
> > The first thing you should try is write a version of matrix
> > multiplication that is specialized for type "float".
> >
> > Then, there are several ways to write the textbook matrix
> > multiplication algorithm, some of which perform less boxing than
> > others.  Again, post your code and we'll let you know.
> >
> >> Currently I'm just using native (double-precision) ocaml floats and
> >> the native ocaml arrays for a first pass on the problem.  Now I'm
> >> thinking about moving to using float32 bigarrays, and I'm hoping
> >> that the code will double in speed. I'd like to know: is that
> >> realistic? Any other suggestions?
> >
> > It won't double in speed: arithmetic operations will take exactly the
> > same time in single or double precision.  What single-precision
> > bigarrays buy you is halving the memory footprint of your matrices.
> > That could result in better cache behavior and therefore slightly
> > better speed, but it depends very much on the sizes and number of your
> > matrices.
> >
> > - Xavier Leroy
> >
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

--0015175ce16c5ac7400463611901
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Erick, we should compare notes sometime. I have a lot of code for doing thi=
s kind of stuff (I am working on empirical codon models with 61x61 rate mat=
rices). The right way to speed up matrix-vector operations is to use BLAS v=
ia either Lacaml or ocamlgsl. But if, like me, you like to counterproductiv=
ely fiddle around with stupid things, here&#39;s a little ocaml library I w=
rote that links to a C function to do vector-vector dot using SSE2 vectoriz=
ation.<br>
<br><a href=3D"http://www.broad.mit.edu/~mlin/for_others/SuperFast_dot.zip"=
>http://www.broad.mit.edu/~mlin/for_others/SuperFast_dot.zip</a><br><br>IIR=
C in my tests it gave about 50% speedup vs. the obvious C code and 100% spe=
edup vs. the obvious ocaml code. But, I am doing 61x61 and I&#39;m not sure=
 if the speedup scales down to 20x20 or especially 4x4.<br>
<br>Mike<br>
<br><div class=3D"gmail_quote">On Fri, Feb 20, 2009 at 2:53 PM, Erick Matse=
n <span dir=3D"ltr">&lt;<a href=3D"mailto:matsen@berkeley.edu" target=3D"_b=
lank">matsen@berkeley.edu</a>&gt;</span> wrote:<br><blockquote class=3D"gma=
il_quote" style=3D"border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0=
pt 0pt 0.8ex; padding-left: 1ex;">

Wow, once again I am amazed by the vitality of this list. Thank you<br>
for your suggestions.<br>
<br>
Here is the context: we are interested in calculating the likelihood<br>
of taxonomic placement of short &quot;metagenomics&quot; sequence fragments=
 from<br>
unknown organisms in the ocean. We start by assuming a model of<br>
sequence evolution, which is a reversible Markov chain. The taxonomy<br>
is represented as a tree, and the sequence information is a collection<br>
of likelihoods of sequence identities. As we move up the tree, these<br>
sequences &quot;evolve&quot; by getting multiplied by the exponentiated<br>
instantaneous Markov matrix.<br>
<br>
The matrices are of the size of the sequence model: 4x4 when looking<br>
at nucleotides, and 20x20 when looking at proteins.<br>
<br>
The bottleneck is (I mis-spoke before) that we are multiplying many<br>
length-4 or length-20 vectors by a collection of matrices which<br>
represent the time evolution of those sequences as follows.<br>
<br>
Outer loop:<br>
 &nbsp;modify the amount of time each markov process runs<br>
 &nbsp;exponentiate the rate matrices to get transition matrices<br>
<br>
 &nbsp;Recur over the tree, starting at the leaves:<br>
 &nbsp; &nbsp;at a node, multiply all of the daughter likelihood vectors to=
gether<br>
 &nbsp; &nbsp;return the multiplication of that product by the trasition ma=
trix<br>
(bottleneck!)<br>
<br>
The trees are on the order of 50 leaves, and there are about 500<br>
likelihood vectors done at once.<br>
<br>
All of this gets run on a big cluster of Xeons. It&#39;s not worth<br>
parallelizing because we are running many instances of this process<br>
already, which fills up the cluster nodes.<br>
<br>
So far I have been doing the simplest thing possible, which is just to<br>
multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit<br>
embarassing.<br>
<br>
let mul_vec m v =3D<br>
 &nbsp; &nbsp;if Array.length v &lt;&gt; n_cols m then<br>
 &nbsp; &nbsp; &nbsp;failwith &quot;mul_vec: matrix size and vector size do=
n&#39;t match!&quot;;<br>
 &nbsp; &nbsp;let result =3D Array.create (n_rows m) N.zero in<br>
 &nbsp; &nbsp;for i=3D0 to (n_rows m)-1 do<br>
 &nbsp; &nbsp; &nbsp;for j=3D0 to (n_cols m)-1 do<br>
 &nbsp; &nbsp; &nbsp; &nbsp;result.(i) &lt;- N.add result.(i) (N.mul (get m=
 i j) v.(j))<br>
 &nbsp; &nbsp; &nbsp;done;<br>
 &nbsp; &nbsp;done;<br>
 &nbsp; &nbsp;result<br>
<br>
I have implemented it in a functorial way for flexibility. N is the<br>
number class. How much improvement might I hope for if I make a<br>
dedicated float vector multiplication function? I&#39;m sorry, I know<br>
nothing about &quot;boxing.&quot; Where can I read about that?<br>
<br>
<br>
Thank you again,<br>
<font color=3D"#888888"><br>
Erick<br>
</font><div><div></div><div><br>
<br>
<br>
On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy &lt;<a href=3D"mailto:Xavier=
.Leroy@inria.fr" target=3D"_blank">Xavier.Leroy@inria.fr</a>&gt; wrote:<br>
&gt;&gt; I&#39;m working on speeding up some code, and I wanted to check wi=
th<br>
&gt;&gt; someone before implementation.<br>
&gt;&gt;<br>
&gt;&gt; As you can see below, the code primarily spends its time multiplyi=
ng<br>
&gt;&gt; relatively small matrices. Precision is of course important but no=
t<br>
&gt;&gt; an incredibly crucial issue, as the most important thing is relati=
ve<br>
&gt;&gt; comparison between things which *should* be pretty different.<br>
&gt;<br>
&gt; You need to post your matrix multiplication code so that the regulars<=
br>
&gt; on this list can tear it to pieces :-)<br>
&gt;<br>
&gt; From the profile you gave, it looks like you parameterized your matrix=
<br>
&gt; multiplication code over the + and * operations over matrix elements.<=
br>
&gt; This is good for genericity but not so good for performance, as it<br>
&gt; will result in more boxing (heap allocation) of floating-point values.=
<br>
&gt; The first thing you should try is write a version of matrix<br>
&gt; multiplication that is specialized for type &quot;float&quot;.<br>
&gt;<br>
&gt; Then, there are several ways to write the textbook matrix<br>
&gt; multiplication algorithm, some of which perform less boxing than<br>
&gt; others. &nbsp;Again, post your code and we&#39;ll let you know.<br>
&gt;<br>
&gt;&gt; Currently I&#39;m just using native (double-precision) ocaml float=
s and<br>
&gt;&gt; the native ocaml arrays for a first pass on the problem. &nbsp;Now=
 I&#39;m<br>
&gt;&gt; thinking about moving to using float32 bigarrays, and I&#39;m hopi=
ng<br>
&gt;&gt; that the code will double in speed. I&#39;d like to know: is that<=
br>
&gt;&gt; realistic? Any other suggestions?<br>
&gt;<br>
&gt; It won&#39;t double in speed: arithmetic operations will take exactly =
the<br>
&gt; same time in single or double precision. &nbsp;What single-precision<b=
r>
&gt; bigarrays buy you is halving the memory footprint of your matrices.<br=
>
&gt; That could result in better cache behavior and therefore slightly<br>
&gt; better speed, but it depends very much on the sizes and number of your=
<br>
&gt; matrices.<br>
&gt;<br>
&gt; - Xavier Leroy<br>
&gt;<br>
<br>
_______________________________________________<br>
Caml-list mailing list. Subscription management:<br>
<a href=3D"http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list" target=
=3D"_blank">http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list</a><br=
>
Archives: <a href=3D"http://caml.inria.fr" target=3D"_blank">http://caml.in=
ria.fr</a><br>
Beginner&#39;s list: <a href=3D"http://groups.yahoo.com/group/ocaml_beginne=
rs" target=3D"_blank">http://groups.yahoo.com/group/ocaml_beginners</a><br>
Bug reports: <a href=3D"http://caml.inria.fr/bin/caml-bugs" target=3D"_blan=
k">http://caml.inria.fr/bin/caml-bugs</a><br>
</div></div></blockquote></div><br>

--0015175ce16c5ac7400463611901--