Erick, we should compare notes sometime. I have a lot of code for doing this kind of stuff (I am working on empirical codon models with 61x61 rate matrices). The right way to speed up matrix-vector operations is to use BLAS via either Lacaml or ocamlgsl. But if, like me, you like to counterproductively fiddle around with stupid things, here's a little ocaml library I wrote that links to a C function to do vector-vector dot using SSE2 vectorization. http://www.broad.mit.edu/~mlin/for_others/SuperFast_dot.zip IIRC in my tests it gave about 50% speedup vs. the obvious C code and 100% speedup vs. the obvious ocaml code. But, I am doing 61x61 and I'm not sure if the speedup scales down to 20x20 or especially 4x4. Mike On Fri, Feb 20, 2009 at 2:53 PM, Erick Matsen wrote: > Wow, once again I am amazed by the vitality of this list. Thank you > for your suggestions. > > Here is the context: we are interested in calculating the likelihood > of taxonomic placement of short "metagenomics" sequence fragments from > unknown organisms in the ocean. We start by assuming a model of > sequence evolution, which is a reversible Markov chain. The taxonomy > is represented as a tree, and the sequence information is a collection > of likelihoods of sequence identities. As we move up the tree, these > sequences "evolve" by getting multiplied by the exponentiated > instantaneous Markov matrix. > > The matrices are of the size of the sequence model: 4x4 when looking > at nucleotides, and 20x20 when looking at proteins. > > The bottleneck is (I mis-spoke before) that we are multiplying many > length-4 or length-20 vectors by a collection of matrices which > represent the time evolution of those sequences as follows. > > Outer loop: > modify the amount of time each markov process runs > exponentiate the rate matrices to get transition matrices > > Recur over the tree, starting at the leaves: > at a node, multiply all of the daughter likelihood vectors together > return the multiplication of that product by the trasition matrix > (bottleneck!) > > The trees are on the order of 50 leaves, and there are about 500 > likelihood vectors done at once. > > All of this gets run on a big cluster of Xeons. It's not worth > parallelizing because we are running many instances of this process > already, which fills up the cluster nodes. > > So far I have been doing the simplest thing possible, which is just to > multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit > embarassing. > > let mul_vec m v = > if Array.length v <> n_cols m then > failwith "mul_vec: matrix size and vector size don't match!"; > let result = Array.create (n_rows m) N.zero in > for i=0 to (n_rows m)-1 do > for j=0 to (n_cols m)-1 do > result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j)) > done; > done; > result > > I have implemented it in a functorial way for flexibility. N is the > number class. How much improvement might I hope for if I make a > dedicated float vector multiplication function? I'm sorry, I know > nothing about "boxing." Where can I read about that? > > > Thank you again, > > Erick > > > > On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy > wrote: > >> I'm working on speeding up some code, and I wanted to check with > >> someone before implementation. > >> > >> As you can see below, the code primarily spends its time multiplying > >> relatively small matrices. Precision is of course important but not > >> an incredibly crucial issue, as the most important thing is relative > >> comparison between things which *should* be pretty different. > > > > You need to post your matrix multiplication code so that the regulars > > on this list can tear it to pieces :-) > > > > From the profile you gave, it looks like you parameterized your matrix > > multiplication code over the + and * operations over matrix elements. > > This is good for genericity but not so good for performance, as it > > will result in more boxing (heap allocation) of floating-point values. > > The first thing you should try is write a version of matrix > > multiplication that is specialized for type "float". > > > > Then, there are several ways to write the textbook matrix > > multiplication algorithm, some of which perform less boxing than > > others. Again, post your code and we'll let you know. > > > >> Currently I'm just using native (double-precision) ocaml floats and > >> the native ocaml arrays for a first pass on the problem. Now I'm > >> thinking about moving to using float32 bigarrays, and I'm hoping > >> that the code will double in speed. I'd like to know: is that > >> realistic? Any other suggestions? > > > > It won't double in speed: arithmetic operations will take exactly the > > same time in single or double precision. What single-precision > > bigarrays buy you is halving the memory footprint of your matrices. > > That could result in better cache behavior and therefore slightly > > better speed, but it depends very much on the sizes and number of your > > matrices. > > > > - Xavier Leroy > > > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs >