caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* speeding up matrix multiplication (newbie question)
@ 2009-02-20 15:40 Erick Matsen
  2009-02-20 15:45 ` [Caml-list] " RABIH.ELCHAAR
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Erick Matsen @ 2009-02-20 15:40 UTC (permalink / raw)
  To: caml-list

Hello Ocaml community---


I'm working on speeding up some code, and I wanted to check with
someone before implementation.

As you can see below, the code primarily spends its time multiplying relatively
small matrices. Precision is of course important but not an incredibly crucial
issue, as the most important thing is relative comparison between things which
*should* be pretty different. Currently I'm just using native
(double-precision) ocaml floats and the native ocaml arrays for a first pass on
the problem.

Now I'm thinking about moving to using float32 bigarrays, and I'm hoping that
the code will double in speed. I'd like to know: is that realistic? Any other
suggestions?


Thank you,

Erick



--------   profiling information --------

%      cumulative  self     self       total
time   seconds     seconds  calls      s/call  s/call  name
30.27  7.44        7.44     836419     0.00    0.00    camlMat__mul_vec_263
15.42  11.23       3.79     335237785  0.00    0.00    camlMat__get_447
14.65  14.83       3.60     334624076  0.00    0.00    camlNumber__mul_185
13.75  18.21       3.38     682814594  0.00    0.00    caml_apply2
11.31  20.99       2.78     334624076  0.00    0.00    camlNumber__add_183
6.02   22.47       1.48     335724401  0.00    0.00    caml_apply3
1.14   22.75       0.28     480860     0.00    0.00    camlDiagd__fun_304
1.06   23.01       0.26     159338     0.00    0.00    caml_oldify_local_roots
1.06   23.27       0.26     79634      0.00    0.00    sweep_slice
0.90   23.49       0.22     79828      0.00    0.00    mark_slice
0.65   23.65       0.16     10455018   0.00    0.00    camlQtree__code_begin
0.61   23.80       0.15     1517329    0.00    0.00    caml_oldify_one
0.57   23.94       0.14     17592082   0.00    0.00    camlMat__n_cols_458
0.57   24.08       0.14     13102569   0.00    0.00    caml_modify
0.57   24.22       0.14     522761     0.00    0.00    camlArray__mapi_142
...


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [Caml-list] speeding up matrix multiplication (newbie question)
  2009-02-20 15:40 speeding up matrix multiplication (newbie question) Erick Matsen
@ 2009-02-20 15:45 ` RABIH.ELCHAAR
  2009-02-20 17:46 ` Jon Harrop
  2009-02-20 18:46 ` Xavier Leroy
  2 siblings, 0 replies; 12+ messages in thread
From: RABIH.ELCHAAR @ 2009-02-20 15:45 UTC (permalink / raw)
  To: matsen, caml-list

I don't think you can do better than calling some C functions (bound checking, ... ).
Why not have a look on ocaml bindings of C libraries (using bigarrays), 
like ocamlgsl (O Andrieu) http://oandrieu.nerim.net/ocaml/gsl/
or lacaml http://caml.inria.fr/cgi-bin/hump.fr.cgi?contrib=255

Hope this helps

Rabih

-----Message d'origine-----
De : caml-list-bounces@yquem.inria.fr [mailto:caml-list-bounces@yquem.inria.fr] De la part de Erick Matsen
Envoyé : vendredi 20 février 2009 16:40
À : caml-list@inria.fr
Objet : [Caml-list] speeding up matrix multiplication (newbie question)

Hello Ocaml community---


I'm working on speeding up some code, and I wanted to check with
someone before implementation.

As you can see below, the code primarily spends its time multiplying relatively
small matrices. Precision is of course important but not an incredibly crucial
issue, as the most important thing is relative comparison between things which
*should* be pretty different. Currently I'm just using native
(double-precision) ocaml floats and the native ocaml arrays for a first pass on
the problem.

Now I'm thinking about moving to using float32 bigarrays, and I'm hoping that
the code will double in speed. I'd like to know: is that realistic? Any other
suggestions?


Thank you,

Erick



--------   profiling information --------

%      cumulative  self     self       total
time   seconds     seconds  calls      s/call  s/call  name
30.27  7.44        7.44     836419     0.00    0.00    camlMat__mul_vec_263
15.42  11.23       3.79     335237785  0.00    0.00    camlMat__get_447
14.65  14.83       3.60     334624076  0.00    0.00    camlNumber__mul_185
13.75  18.21       3.38     682814594  0.00    0.00    caml_apply2
11.31  20.99       2.78     334624076  0.00    0.00    camlNumber__add_183
6.02   22.47       1.48     335724401  0.00    0.00    caml_apply3
1.14   22.75       0.28     480860     0.00    0.00    camlDiagd__fun_304
1.06   23.01       0.26     159338     0.00    0.00    caml_oldify_local_roots
1.06   23.27       0.26     79634      0.00    0.00    sweep_slice
0.90   23.49       0.22     79828      0.00    0.00    mark_slice
0.65   23.65       0.16     10455018   0.00    0.00    camlQtree__code_begin
0.61   23.80       0.15     1517329    0.00    0.00    caml_oldify_one
0.57   23.94       0.14     17592082   0.00    0.00    camlMat__n_cols_458
0.57   24.08       0.14     13102569   0.00    0.00    caml_modify
0.57   24.22       0.14     522761     0.00    0.00    camlArray__mapi_142
...

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs
This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information. 
Any unauthorised use or dissemination is prohibited. 
E-mails are susceptible to alteration. 
Neither Societe Generale Asset Management nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or falsified. 
  
Find out more about Societe Generale Asset Management's proposal on www.sgam.com
  
                                ******** 
  
Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires.
Toute utilisation ou diffusion non autorisee est interdite. 
Tout message electronique est susceptible d'alteration. Societe Generale Asset Management et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie. 

Decouvrez l'offre et les services de Societe Generale Asset Management sur le site www.sgam.fr


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] speeding up matrix multiplication (newbie question)
  2009-02-20 15:40 speeding up matrix multiplication (newbie question) Erick Matsen
  2009-02-20 15:45 ` [Caml-list] " RABIH.ELCHAAR
@ 2009-02-20 17:46 ` Jon Harrop
  2009-02-20 18:46 ` Xavier Leroy
  2 siblings, 0 replies; 12+ messages in thread
From: Jon Harrop @ 2009-02-20 17:46 UTC (permalink / raw)
  To: caml-list

On Friday 20 February 2009 15:40:00 Erick Matsen wrote:
> Hello Ocaml community---
>
> I'm working on speeding up some code, and I wanted to check with
> someone before implementation.
>
> As you can see below, the code primarily spends its time multiplying
> relatively small matrices. Precision is of course important but not an
> incredibly crucial issue, as the most important thing is relative
> comparison between things which *should* be pretty different. Currently I'm
> just using native
> (double-precision) ocaml floats and the native ocaml arrays for a first
> pass on the problem.
>
> Now I'm thinking about moving to using float32 bigarrays, and I'm hoping
> that the code will double in speed. I'd like to know: is that realistic?
> Any other suggestions?

What exactly are you doing? Exactly how big are the matrices? What is your 
exact code? What is the higher-level algorithm using the matrix multiply? Are 
you doing a loop just over many matrix multiplies? What platform and 
architecture are you using? Is the code parallelized?

Depending upon the answers to the above questions, you may be able to achieve 
huge performance improvements using tools like LLVM to generate SSE code 
whilst still acting upon OCaml's data structures.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] speeding up matrix multiplication (newbie question)
  2009-02-20 15:40 speeding up matrix multiplication (newbie question) Erick Matsen
  2009-02-20 15:45 ` [Caml-list] " RABIH.ELCHAAR
  2009-02-20 17:46 ` Jon Harrop
@ 2009-02-20 18:46 ` Xavier Leroy
  2009-02-20 19:53   ` Erick Matsen
  2 siblings, 1 reply; 12+ messages in thread
From: Xavier Leroy @ 2009-02-20 18:46 UTC (permalink / raw)
  To: Erick Matsen; +Cc: caml-list

> I'm working on speeding up some code, and I wanted to check with
> someone before implementation.
> 
> As you can see below, the code primarily spends its time multiplying
> relatively small matrices. Precision is of course important but not
> an incredibly crucial issue, as the most important thing is relative
> comparison between things which *should* be pretty different.

You need to post your matrix multiplication code so that the regulars
on this list can tear it to pieces :-)

>From the profile you gave, it looks like you parameterized your matrix
multiplication code over the + and * operations over matrix elements.
This is good for genericity but not so good for performance, as it
will result in more boxing (heap allocation) of floating-point values.
The first thing you should try is write a version of matrix
multiplication that is specialized for type "float".

Then, there are several ways to write the textbook matrix
multiplication algorithm, some of which perform less boxing than
others.  Again, post your code and we'll let you know.

> Currently I'm just using native (double-precision) ocaml floats and
> the native ocaml arrays for a first pass on the problem.  Now I'm
> thinking about moving to using float32 bigarrays, and I'm hoping
> that the code will double in speed. I'd like to know: is that
> realistic? Any other suggestions?

It won't double in speed: arithmetic operations will take exactly the
same time in single or double precision.  What single-precision
bigarrays buy you is halving the memory footprint of your matrices.
That could result in better cache behavior and therefore slightly
better speed, but it depends very much on the sizes and number of your
matrices.

- Xavier Leroy


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] speeding up matrix multiplication (newbie question)
  2009-02-20 18:46 ` Xavier Leroy
@ 2009-02-20 19:53   ` Erick Matsen
  2009-02-20 21:21     ` Will M. Farr
                       ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Erick Matsen @ 2009-02-20 19:53 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: caml-list

Wow, once again I am amazed by the vitality of this list. Thank you
for your suggestions.

Here is the context: we are interested in calculating the likelihood
of taxonomic placement of short "metagenomics" sequence fragments from
unknown organisms in the ocean. We start by assuming a model of
sequence evolution, which is a reversible Markov chain. The taxonomy
is represented as a tree, and the sequence information is a collection
of likelihoods of sequence identities. As we move up the tree, these
sequences "evolve" by getting multiplied by the exponentiated
instantaneous Markov matrix.

The matrices are of the size of the sequence model: 4x4 when looking
at nucleotides, and 20x20 when looking at proteins.

The bottleneck is (I mis-spoke before) that we are multiplying many
length-4 or length-20 vectors by a collection of matrices which
represent the time evolution of those sequences as follows.

Outer loop:
  modify the amount of time each markov process runs
  exponentiate the rate matrices to get transition matrices

  Recur over the tree, starting at the leaves:
    at a node, multiply all of the daughter likelihood vectors together
    return the multiplication of that product by the trasition matrix
(bottleneck!)

The trees are on the order of 50 leaves, and there are about 500
likelihood vectors done at once.

All of this gets run on a big cluster of Xeons. It's not worth
parallelizing because we are running many instances of this process
already, which fills up the cluster nodes.

So far I have been doing the simplest thing possible, which is just to
multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit
embarassing.

let mul_vec m v =
    if Array.length v <> n_cols m then
      failwith "mul_vec: matrix size and vector size don't match!";
    let result = Array.create (n_rows m) N.zero in
    for i=0 to (n_rows m)-1 do
      for j=0 to (n_cols m)-1 do
	result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j))
      done;
    done;
    result

I have implemented it in a functorial way for flexibility. N is the
number class. How much improvement might I hope for if I make a
dedicated float vector multiplication function? I'm sorry, I know
nothing about "boxing." Where can I read about that?


Thank you again,

Erick



On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote:
>> I'm working on speeding up some code, and I wanted to check with
>> someone before implementation.
>>
>> As you can see below, the code primarily spends its time multiplying
>> relatively small matrices. Precision is of course important but not
>> an incredibly crucial issue, as the most important thing is relative
>> comparison between things which *should* be pretty different.
>
> You need to post your matrix multiplication code so that the regulars
> on this list can tear it to pieces :-)
>
> From the profile you gave, it looks like you parameterized your matrix
> multiplication code over the + and * operations over matrix elements.
> This is good for genericity but not so good for performance, as it
> will result in more boxing (heap allocation) of floating-point values.
> The first thing you should try is write a version of matrix
> multiplication that is specialized for type "float".
>
> Then, there are several ways to write the textbook matrix
> multiplication algorithm, some of which perform less boxing than
> others.  Again, post your code and we'll let you know.
>
>> Currently I'm just using native (double-precision) ocaml floats and
>> the native ocaml arrays for a first pass on the problem.  Now I'm
>> thinking about moving to using float32 bigarrays, and I'm hoping
>> that the code will double in speed. I'd like to know: is that
>> realistic? Any other suggestions?
>
> It won't double in speed: arithmetic operations will take exactly the
> same time in single or double precision.  What single-precision
> bigarrays buy you is halving the memory footprint of your matrices.
> That could result in better cache behavior and therefore slightly
> better speed, but it depends very much on the sizes and number of your
> matrices.
>
> - Xavier Leroy
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] speeding up matrix multiplication (newbie question)
  2009-02-20 19:53   ` Erick Matsen
@ 2009-02-20 21:21     ` Will M. Farr
  2009-02-20 21:37     ` Martin Jambon
  2009-02-20 22:23     ` Mike Lin
  2 siblings, 0 replies; 12+ messages in thread
From: Will M. Farr @ 2009-02-20 21:21 UTC (permalink / raw)
  To: Erick Matsen, ocaml

Erick,

Sorry about the long email, but here is an explanation of what
"boxing" means, how it slows you down in this case, and how you can
(eventually) figure out whether it will slow you down in general.  I'm
not an expert, so I've probably made mistakes in the following, but I
think the broad outlines are correct.  True experts can weigh in to
correct me.

Due to polymorphism (i.e. the fact that one must be able to stuff
*any* ocaml value into an 'a list), every ocaml value must have a
uniform format and size if it is possible that it can be used
polymorphically.  The current compiler can prove in certain cases that
a value is not used polymorphically; when this is possible, the value
can be stored more efficiently.  For example, in the expression:

let f a b =
let x = a *. b in
3.0 *. x

the compiler can tell that x is always going to be used as a float and
not polymorphically.  However, a, b, and (3.0 *. x) can escape to the
larger world, where they could potentially be used polymorphically,
and therefore need to have the uniform representation.

It turns out that the representation chosen for the OCaml runtime is a
32-bit integer (i.e. a pointer).  So, to represent a float that can be
used polymorphically, the runtime allocates 64-bits of memory, stuffs
the float into it, and uses the pointer to the memory to represent the
float in the program.  This procedure is called "boxing".  In the case
above, the contents of x can be represented "unboxed" as a real 64-bit
floating point number, while the contents of a and b, and the result
(3.0 *. x) must be boxed.

The boxing rules for the ocaml compiler are described at
http://caml.inria.fr/pub/old_caml_site/ocaml/numerical.html (I don't
know if this is current or not).  The relevant rules for analyizing
your code are probably a) no cross-module inlining of functor argument
functions (i.e. N.add, N.mul, etc), b) boxing of floats at all
(non-inlined) function boundaries, and c) floating point arrays store
their elements unboxed.  There are other rules, given at the above
website.

So, the code you wrote

result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j))

allocates a box for the result of (get m i j) (unless this procedure
is inlined, which is possible if it's short, and not coming in via
functor argument---it would be better to use m.(i).(j) if possible),
another box for v.(j), passes them to N.mul, which allocates a box for
the result of the multiplication.  Then a box is allocated for
result.(i), and the previous result and result.(i) are passed to
N.add, which allocates a box for its result.  Then this box is opened
and the value in it is stored in result.(i).  Ugh.  If, instead, you
write

result.(i) <- result.(i) +. (get m i j)*.v.(j)

there is *at most* one box for (get m i j).  If we assume that (get m
i j) is inlined and therefore the boxing overhead is eliminated, the
entire process takes only a few clock cycles.  Compare to possibly
several hundred or more clock cycles above, plus the extra stress on
the GC from collecting all the boxes when they are no longer
referenced by the program, and it adds up to *significant* overhead.

In short, you'll get a lot faster if you specialize the operations to
floats.  Like order-of-magnitude or greater speedups.

Good luck!

Will

On Fri, Feb 20, 2009 at 2:53 PM, Erick Matsen <matsen@berkeley.edu> wrote:
> Wow, once again I am amazed by the vitality of this list. Thank you
> for your suggestions.
>
> Here is the context: we are interested in calculating the likelihood
> of taxonomic placement of short "metagenomics" sequence fragments from
> unknown organisms in the ocean. We start by assuming a model of
> sequence evolution, which is a reversible Markov chain. The taxonomy
> is represented as a tree, and the sequence information is a collection
> of likelihoods of sequence identities. As we move up the tree, these
> sequences "evolve" by getting multiplied by the exponentiated
> instantaneous Markov matrix.
>
> The matrices are of the size of the sequence model: 4x4 when looking
> at nucleotides, and 20x20 when looking at proteins.
>
> The bottleneck is (I mis-spoke before) that we are multiplying many
> length-4 or length-20 vectors by a collection of matrices which
> represent the time evolution of those sequences as follows.
>
> Outer loop:
>  modify the amount of time each markov process runs
>  exponentiate the rate matrices to get transition matrices
>
>  Recur over the tree, starting at the leaves:
>    at a node, multiply all of the daughter likelihood vectors together
>    return the multiplication of that product by the trasition matrix
> (bottleneck!)
>
> The trees are on the order of 50 leaves, and there are about 500
> likelihood vectors done at once.
>
> All of this gets run on a big cluster of Xeons. It's not worth
> parallelizing because we are running many instances of this process
> already, which fills up the cluster nodes.
>
> So far I have been doing the simplest thing possible, which is just to
> multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit
> embarassing.
>
> let mul_vec m v =
>    if Array.length v <> n_cols m then
>      failwith "mul_vec: matrix size and vector size don't match!";
>    let result = Array.create (n_rows m) N.zero in
>    for i=0 to (n_rows m)-1 do
>      for j=0 to (n_cols m)-1 do
>        result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j))
>      done;
>    done;
>    result
>
> I have implemented it in a functorial way for flexibility. N is the
> number class. How much improvement might I hope for if I make a
> dedicated float vector multiplication function? I'm sorry, I know
> nothing about "boxing." Where can I read about that?
>
>
> Thank you again,
>
> Erick
>
>
>
> On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote:
>>> I'm working on speeding up some code, and I wanted to check with
>>> someone before implementation.
>>>
>>> As you can see below, the code primarily spends its time multiplying
>>> relatively small matrices. Precision is of course important but not
>>> an incredibly crucial issue, as the most important thing is relative
>>> comparison between things which *should* be pretty different.
>>
>> You need to post your matrix multiplication code so that the regulars
>> on this list can tear it to pieces :-)
>>
>> From the profile you gave, it looks like you parameterized your matrix
>> multiplication code over the + and * operations over matrix elements.
>> This is good for genericity but not so good for performance, as it
>> will result in more boxing (heap allocation) of floating-point values.
>> The first thing you should try is write a version of matrix
>> multiplication that is specialized for type "float".
>>
>> Then, there are several ways to write the textbook matrix
>> multiplication algorithm, some of which perform less boxing than
>> others.  Again, post your code and we'll let you know.
>>
>>> Currently I'm just using native (double-precision) ocaml floats and
>>> the native ocaml arrays for a first pass on the problem.  Now I'm
>>> thinking about moving to using float32 bigarrays, and I'm hoping
>>> that the code will double in speed. I'd like to know: is that
>>> realistic? Any other suggestions?
>>
>> It won't double in speed: arithmetic operations will take exactly the
>> same time in single or double precision.  What single-precision
>> bigarrays buy you is halving the memory footprint of your matrices.
>> That could result in better cache behavior and therefore slightly
>> better speed, but it depends very much on the sizes and number of your
>> matrices.
>>
>> - Xavier Leroy
>>
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] speeding up matrix multiplication (newbie question)
  2009-02-20 19:53   ` Erick Matsen
  2009-02-20 21:21     ` Will M. Farr
@ 2009-02-20 21:37     ` Martin Jambon
  2009-02-20 22:23     ` Mike Lin
  2 siblings, 0 replies; 12+ messages in thread
From: Martin Jambon @ 2009-02-20 21:37 UTC (permalink / raw)
  To: Erick Matsen; +Cc: Xavier Leroy, caml-list

Erick Matsen wrote:
> Wow, once again I am amazed by the vitality of this list. Thank you
> for your suggestions.
> 
> Here is the context: we are interested in calculating the likelihood
> of taxonomic placement of short "metagenomics" sequence fragments from
> unknown organisms in the ocean. We start by assuming a model of
> sequence evolution, which is a reversible Markov chain. The taxonomy
> is represented as a tree, and the sequence information is a collection
> of likelihoods of sequence identities. As we move up the tree, these
> sequences "evolve" by getting multiplied by the exponentiated
> instantaneous Markov matrix.
> 
> The matrices are of the size of the sequence model: 4x4 when looking
> at nucleotides, and 20x20 when looking at proteins.
> 
> The bottleneck is (I mis-spoke before) that we are multiplying many
> length-4 or length-20 vectors by a collection of matrices which
> represent the time evolution of those sequences as follows.
> 
> Outer loop:
>   modify the amount of time each markov process runs
>   exponentiate the rate matrices to get transition matrices
> 
>   Recur over the tree, starting at the leaves:
>     at a node, multiply all of the daughter likelihood vectors together
>     return the multiplication of that product by the trasition matrix
> (bottleneck!)
> 
> The trees are on the order of 50 leaves, and there are about 500
> likelihood vectors done at once.
> 
> All of this gets run on a big cluster of Xeons. It's not worth
> parallelizing because we are running many instances of this process
> already, which fills up the cluster nodes.
> 
> So far I have been doing the simplest thing possible, which is just to
> multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit
> embarassing.
> 
> let mul_vec m v =
>     if Array.length v <> n_cols m then
>       failwith "mul_vec: matrix size and vector size don't match!";
>     let result = Array.create (n_rows m) N.zero in
>     for i=0 to (n_rows m)-1 do
>       for j=0 to (n_cols m)-1 do
> 	result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j))
>       done;
>     done;
>     result
> 
> I have implemented it in a functorial way for flexibility. N is the
> number class. How much improvement might I hope for if I make a
> dedicated float vector multiplication function? I'm sorry, I know
> nothing about "boxing." Where can I read about that?

Depending on the savings, you can afford to spend more or less time optimizing
this. Here are some simple things to consider:

In the OCaml land, try first getting rid of the functor (or use a
defunctorizer; ocamldefun?).

Limit memory accesses, by doing something like:

for i = 0 to m - 1 do
  let a_i = m.(i) in
  for j = 0 to n - 1 do
    let a_ij = a_i.(j) in (* instead of a.(i).(j) *)
    ...
  done
done

Also you can use Array.unsafe_get where it really matters.


You can also use bigarrays and implement the loop in C. It could be fun. I'm
not sure how much it saves.


Martin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] speeding up matrix multiplication (newbie question)
  2009-02-20 19:53   ` Erick Matsen
  2009-02-20 21:21     ` Will M. Farr
  2009-02-20 21:37     ` Martin Jambon
@ 2009-02-20 22:23     ` Mike Lin
  2009-02-20 22:30       ` Will M. Farr
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Lin @ 2009-02-20 22:23 UTC (permalink / raw)
  To: Erick Matsen; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 5438 bytes --]

Erick, we should compare notes sometime. I have a lot of code for doing this
kind of stuff (I am working on empirical codon models with 61x61 rate
matrices). The right way to speed up matrix-vector operations is to use BLAS
via either Lacaml or ocamlgsl. But if, like me, you like to
counterproductively fiddle around with stupid things, here's a little ocaml
library I wrote that links to a C function to do vector-vector dot using
SSE2 vectorization.

http://www.broad.mit.edu/~mlin/for_others/SuperFast_dot.zip

IIRC in my tests it gave about 50% speedup vs. the obvious C code and 100%
speedup vs. the obvious ocaml code. But, I am doing 61x61 and I'm not sure
if the speedup scales down to 20x20 or especially 4x4.

Mike

On Fri, Feb 20, 2009 at 2:53 PM, Erick Matsen <matsen@berkeley.edu> wrote:

> Wow, once again I am amazed by the vitality of this list. Thank you
> for your suggestions.
>
> Here is the context: we are interested in calculating the likelihood
> of taxonomic placement of short "metagenomics" sequence fragments from
> unknown organisms in the ocean. We start by assuming a model of
> sequence evolution, which is a reversible Markov chain. The taxonomy
> is represented as a tree, and the sequence information is a collection
> of likelihoods of sequence identities. As we move up the tree, these
> sequences "evolve" by getting multiplied by the exponentiated
> instantaneous Markov matrix.
>
> The matrices are of the size of the sequence model: 4x4 when looking
> at nucleotides, and 20x20 when looking at proteins.
>
> The bottleneck is (I mis-spoke before) that we are multiplying many
> length-4 or length-20 vectors by a collection of matrices which
> represent the time evolution of those sequences as follows.
>
> Outer loop:
>  modify the amount of time each markov process runs
>  exponentiate the rate matrices to get transition matrices
>
>  Recur over the tree, starting at the leaves:
>    at a node, multiply all of the daughter likelihood vectors together
>    return the multiplication of that product by the trasition matrix
> (bottleneck!)
>
> The trees are on the order of 50 leaves, and there are about 500
> likelihood vectors done at once.
>
> All of this gets run on a big cluster of Xeons. It's not worth
> parallelizing because we are running many instances of this process
> already, which fills up the cluster nodes.
>
> So far I have been doing the simplest thing possible, which is just to
> multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit
> embarassing.
>
> let mul_vec m v =
>    if Array.length v <> n_cols m then
>      failwith "mul_vec: matrix size and vector size don't match!";
>    let result = Array.create (n_rows m) N.zero in
>    for i=0 to (n_rows m)-1 do
>      for j=0 to (n_cols m)-1 do
>        result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j))
>      done;
>    done;
>    result
>
> I have implemented it in a functorial way for flexibility. N is the
> number class. How much improvement might I hope for if I make a
> dedicated float vector multiplication function? I'm sorry, I know
> nothing about "boxing." Where can I read about that?
>
>
> Thank you again,
>
> Erick
>
>
>
> On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy <Xavier.Leroy@inria.fr>
> wrote:
> >> I'm working on speeding up some code, and I wanted to check with
> >> someone before implementation.
> >>
> >> As you can see below, the code primarily spends its time multiplying
> >> relatively small matrices. Precision is of course important but not
> >> an incredibly crucial issue, as the most important thing is relative
> >> comparison between things which *should* be pretty different.
> >
> > You need to post your matrix multiplication code so that the regulars
> > on this list can tear it to pieces :-)
> >
> > From the profile you gave, it looks like you parameterized your matrix
> > multiplication code over the + and * operations over matrix elements.
> > This is good for genericity but not so good for performance, as it
> > will result in more boxing (heap allocation) of floating-point values.
> > The first thing you should try is write a version of matrix
> > multiplication that is specialized for type "float".
> >
> > Then, there are several ways to write the textbook matrix
> > multiplication algorithm, some of which perform less boxing than
> > others.  Again, post your code and we'll let you know.
> >
> >> Currently I'm just using native (double-precision) ocaml floats and
> >> the native ocaml arrays for a first pass on the problem.  Now I'm
> >> thinking about moving to using float32 bigarrays, and I'm hoping
> >> that the code will double in speed. I'd like to know: is that
> >> realistic? Any other suggestions?
> >
> > It won't double in speed: arithmetic operations will take exactly the
> > same time in single or double precision.  What single-precision
> > bigarrays buy you is halving the memory footprint of your matrices.
> > That could result in better cache behavior and therefore slightly
> > better speed, but it depends very much on the sizes and number of your
> > matrices.
> >
> > - Xavier Leroy
> >
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #2: Type: text/html, Size: 6990 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] speeding up matrix multiplication (newbie question)
  2009-02-20 22:23     ` Mike Lin
@ 2009-02-20 22:30       ` Will M. Farr
  2009-02-20 22:43         ` Markus Mottl
  2009-02-20 22:43         ` Mike Lin
  0 siblings, 2 replies; 12+ messages in thread
From: Will M. Farr @ 2009-02-20 22:30 UTC (permalink / raw)
  To: Mike Lin; +Cc: Erick Matsen, caml-list

Mike and Erick,

In some of my work, I've got code which is constantly creating and
multiplying 4x4 matrices (Lorentz transforms).  I usually write in a
functional style, so I do not generally overwrite old matrices with
the multiplication results.  I have discovered that, at these sizes,
it's about a factor of two to three faster to do this with OCaml
arrays for the matrices rather than the Bigarrays you would need to
use to interface with the BLAS in Lacaml or ocamlgsl.  The reason is
that the memory block for a bigarray is malloc-ed, which is extremely
slow compared to the OCaml native allocation.  That overhead
completely eliminates the advantage of going through C or Fortran
libraries.

The ideal solution would be to use OCaml native float arrays (fast
allocation) with LAPACK (fast matrix multiply), but I'm satisfied with
the performance of my code, and too lazy to put out the effort to
implement the C glue for this.

Just thought you might appreciate another data point.  I have no idea
where the turnover is between 4x4 and 61x61 :).

Will

2009/2/20 Mike Lin <nilekim@gmail.com>:
> Erick, we should compare notes sometime. I have a lot of code for doing this
> kind of stuff (I am working on empirical codon models with 61x61 rate
> matrices). The right way to speed up matrix-vector operations is to use BLAS
> via either Lacaml or ocamlgsl. But if, like me, you like to
> counterproductively fiddle around with stupid things, here's a little ocaml
> library I wrote that links to a C function to do vector-vector dot using
> SSE2 vectorization.
>
> http://www.broad.mit.edu/~mlin/for_others/SuperFast_dot.zip
>
> IIRC in my tests it gave about 50% speedup vs. the obvious C code and 100%
> speedup vs. the obvious ocaml code. But, I am doing 61x61 and I'm not sure
> if the speedup scales down to 20x20 or especially 4x4.
>
> Mike
>
> On Fri, Feb 20, 2009 at 2:53 PM, Erick Matsen <matsen@berkeley.edu> wrote:
>>
>> Wow, once again I am amazed by the vitality of this list. Thank you
>> for your suggestions.
>>
>> Here is the context: we are interested in calculating the likelihood
>> of taxonomic placement of short "metagenomics" sequence fragments from
>> unknown organisms in the ocean. We start by assuming a model of
>> sequence evolution, which is a reversible Markov chain. The taxonomy
>> is represented as a tree, and the sequence information is a collection
>> of likelihoods of sequence identities. As we move up the tree, these
>> sequences "evolve" by getting multiplied by the exponentiated
>> instantaneous Markov matrix.
>>
>> The matrices are of the size of the sequence model: 4x4 when looking
>> at nucleotides, and 20x20 when looking at proteins.
>>
>> The bottleneck is (I mis-spoke before) that we are multiplying many
>> length-4 or length-20 vectors by a collection of matrices which
>> represent the time evolution of those sequences as follows.
>>
>> Outer loop:
>>  modify the amount of time each markov process runs
>>  exponentiate the rate matrices to get transition matrices
>>
>>  Recur over the tree, starting at the leaves:
>>    at a node, multiply all of the daughter likelihood vectors together
>>    return the multiplication of that product by the trasition matrix
>> (bottleneck!)
>>
>> The trees are on the order of 50 leaves, and there are about 500
>> likelihood vectors done at once.
>>
>> All of this gets run on a big cluster of Xeons. It's not worth
>> parallelizing because we are running many instances of this process
>> already, which fills up the cluster nodes.
>>
>> So far I have been doing the simplest thing possible, which is just to
>> multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit
>> embarassing.
>>
>> let mul_vec m v =
>>    if Array.length v <> n_cols m then
>>      failwith "mul_vec: matrix size and vector size don't match!";
>>    let result = Array.create (n_rows m) N.zero in
>>    for i=0 to (n_rows m)-1 do
>>      for j=0 to (n_cols m)-1 do
>>        result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j))
>>      done;
>>    done;
>>    result
>>
>> I have implemented it in a functorial way for flexibility. N is the
>> number class. How much improvement might I hope for if I make a
>> dedicated float vector multiplication function? I'm sorry, I know
>> nothing about "boxing." Where can I read about that?
>>
>>
>> Thank you again,
>>
>> Erick
>>
>>
>>
>> On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy <Xavier.Leroy@inria.fr>
>> wrote:
>> >> I'm working on speeding up some code, and I wanted to check with
>> >> someone before implementation.
>> >>
>> >> As you can see below, the code primarily spends its time multiplying
>> >> relatively small matrices. Precision is of course important but not
>> >> an incredibly crucial issue, as the most important thing is relative
>> >> comparison between things which *should* be pretty different.
>> >
>> > You need to post your matrix multiplication code so that the regulars
>> > on this list can tear it to pieces :-)
>> >
>> > From the profile you gave, it looks like you parameterized your matrix
>> > multiplication code over the + and * operations over matrix elements.
>> > This is good for genericity but not so good for performance, as it
>> > will result in more boxing (heap allocation) of floating-point values.
>> > The first thing you should try is write a version of matrix
>> > multiplication that is specialized for type "float".
>> >
>> > Then, there are several ways to write the textbook matrix
>> > multiplication algorithm, some of which perform less boxing than
>> > others.  Again, post your code and we'll let you know.
>> >
>> >> Currently I'm just using native (double-precision) ocaml floats and
>> >> the native ocaml arrays for a first pass on the problem.  Now I'm
>> >> thinking about moving to using float32 bigarrays, and I'm hoping
>> >> that the code will double in speed. I'd like to know: is that
>> >> realistic? Any other suggestions?
>> >
>> > It won't double in speed: arithmetic operations will take exactly the
>> > same time in single or double precision.  What single-precision
>> > bigarrays buy you is halving the memory footprint of your matrices.
>> > That could result in better cache behavior and therefore slightly
>> > better speed, but it depends very much on the sizes and number of your
>> > matrices.
>> >
>> > - Xavier Leroy
>> >
>>
>> _______________________________________________
>> Caml-list mailing list. Subscription management:
>> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
>> Archives: http://caml.inria.fr
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] speeding up matrix multiplication (newbie question)
  2009-02-20 22:30       ` Will M. Farr
@ 2009-02-20 22:43         ` Markus Mottl
  2009-02-23 22:59           ` Erick Matsen
  2009-02-20 22:43         ` Mike Lin
  1 sibling, 1 reply; 12+ messages in thread
From: Markus Mottl @ 2009-02-20 22:43 UTC (permalink / raw)
  To: Will M. Farr; +Cc: Mike Lin, caml-list

Unless you want to interface C-calls into BLAS/LAPACK directly without
bounds checking, releasing the OCaml-lock, and other "fru-fru", it
seems unlikely that you will get much of an advantage using those
libraries given the small size of your matrices.  E.g. Lacaml is
optimized for larger matrices (probably > 10x10).

I guess you should be fine rolling your own implementation for such
small matrices.

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] speeding up matrix multiplication (newbie question)
  2009-02-20 22:30       ` Will M. Farr
  2009-02-20 22:43         ` Markus Mottl
@ 2009-02-20 22:43         ` Mike Lin
  1 sibling, 0 replies; 12+ messages in thread
From: Mike Lin @ 2009-02-20 22:43 UTC (permalink / raw)
  To: Will M. Farr; +Cc: Erick Matsen, caml-list

[-- Attachment #1: Type: text/plain, Size: 7924 bytes --]

These are good points. I tend to compulsively eliminate any kind of memory
allocation from my numerical loops -- it's true the OCaml allocator is a lot
faster than malloc, but you could end up repaying a lot of that back to the
GC later!
The silly library I sent out does operate on OCaml float arrays (which are
unboxed and directly accessible as double arrays to C, though I'm not sure
if it's totally kosher to use 'em that way). Something to kick around...
Mike

On Fri, Feb 20, 2009 at 5:30 PM, Will M. Farr <farr@mit.edu> wrote:

> Mike and Erick,
>
> In some of my work, I've got code which is constantly creating and
> multiplying 4x4 matrices (Lorentz transforms).  I usually write in a
> functional style, so I do not generally overwrite old matrices with
> the multiplication results.  I have discovered that, at these sizes,
> it's about a factor of two to three faster to do this with OCaml
> arrays for the matrices rather than the Bigarrays you would need to
> use to interface with the BLAS in Lacaml or ocamlgsl.  The reason is
> that the memory block for a bigarray is malloc-ed, which is extremely
> slow compared to the OCaml native allocation.  That overhead
> completely eliminates the advantage of going through C or Fortran
> libraries.
>
> The ideal solution would be to use OCaml native float arrays (fast
> allocation) with LAPACK (fast matrix multiply), but I'm satisfied with
> the performance of my code, and too lazy to put out the effort to
> implement the C glue for this.
>
> Just thought you might appreciate another data point.  I have no idea
> where the turnover is between 4x4 and 61x61 :).
>
> Will
>
> 2009/2/20 Mike Lin <nilekim@gmail.com>:
> > Erick, we should compare notes sometime. I have a lot of code for doing
> this
> > kind of stuff (I am working on empirical codon models with 61x61 rate
> > matrices). The right way to speed up matrix-vector operations is to use
> BLAS
> > via either Lacaml or ocamlgsl. But if, like me, you like to
> > counterproductively fiddle around with stupid things, here's a little
> ocaml
> > library I wrote that links to a C function to do vector-vector dot using
> > SSE2 vectorization.
> >
> > http://www.broad.mit.edu/~mlin/for_others/SuperFast_dot.zip<http://www.broad.mit.edu/%7Emlin/for_others/SuperFast_dot.zip>
> >
> > IIRC in my tests it gave about 50% speedup vs. the obvious C code and
> 100%
> > speedup vs. the obvious ocaml code. But, I am doing 61x61 and I'm not
> sure
> > if the speedup scales down to 20x20 or especially 4x4.
> >
> > Mike
> >
> > On Fri, Feb 20, 2009 at 2:53 PM, Erick Matsen <matsen@berkeley.edu>
> wrote:
> >>
> >> Wow, once again I am amazed by the vitality of this list. Thank you
> >> for your suggestions.
> >>
> >> Here is the context: we are interested in calculating the likelihood
> >> of taxonomic placement of short "metagenomics" sequence fragments from
> >> unknown organisms in the ocean. We start by assuming a model of
> >> sequence evolution, which is a reversible Markov chain. The taxonomy
> >> is represented as a tree, and the sequence information is a collection
> >> of likelihoods of sequence identities. As we move up the tree, these
> >> sequences "evolve" by getting multiplied by the exponentiated
> >> instantaneous Markov matrix.
> >>
> >> The matrices are of the size of the sequence model: 4x4 when looking
> >> at nucleotides, and 20x20 when looking at proteins.
> >>
> >> The bottleneck is (I mis-spoke before) that we are multiplying many
> >> length-4 or length-20 vectors by a collection of matrices which
> >> represent the time evolution of those sequences as follows.
> >>
> >> Outer loop:
> >>  modify the amount of time each markov process runs
> >>  exponentiate the rate matrices to get transition matrices
> >>
> >>  Recur over the tree, starting at the leaves:
> >>    at a node, multiply all of the daughter likelihood vectors together
> >>    return the multiplication of that product by the trasition matrix
> >> (bottleneck!)
> >>
> >> The trees are on the order of 50 leaves, and there are about 500
> >> likelihood vectors done at once.
> >>
> >> All of this gets run on a big cluster of Xeons. It's not worth
> >> parallelizing because we are running many instances of this process
> >> already, which fills up the cluster nodes.
> >>
> >> So far I have been doing the simplest thing possible, which is just to
> >> multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit
> >> embarassing.
> >>
> >> let mul_vec m v =
> >>    if Array.length v <> n_cols m then
> >>      failwith "mul_vec: matrix size and vector size don't match!";
> >>    let result = Array.create (n_rows m) N.zero in
> >>    for i=0 to (n_rows m)-1 do
> >>      for j=0 to (n_cols m)-1 do
> >>        result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j))
> >>      done;
> >>    done;
> >>    result
> >>
> >> I have implemented it in a functorial way for flexibility. N is the
> >> number class. How much improvement might I hope for if I make a
> >> dedicated float vector multiplication function? I'm sorry, I know
> >> nothing about "boxing." Where can I read about that?
> >>
> >>
> >> Thank you again,
> >>
> >> Erick
> >>
> >>
> >>
> >> On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy <Xavier.Leroy@inria.fr>
> >> wrote:
> >> >> I'm working on speeding up some code, and I wanted to check with
> >> >> someone before implementation.
> >> >>
> >> >> As you can see below, the code primarily spends its time multiplying
> >> >> relatively small matrices. Precision is of course important but not
> >> >> an incredibly crucial issue, as the most important thing is relative
> >> >> comparison between things which *should* be pretty different.
> >> >
> >> > You need to post your matrix multiplication code so that the regulars
> >> > on this list can tear it to pieces :-)
> >> >
> >> > From the profile you gave, it looks like you parameterized your matrix
> >> > multiplication code over the + and * operations over matrix elements.
> >> > This is good for genericity but not so good for performance, as it
> >> > will result in more boxing (heap allocation) of floating-point values.
> >> > The first thing you should try is write a version of matrix
> >> > multiplication that is specialized for type "float".
> >> >
> >> > Then, there are several ways to write the textbook matrix
> >> > multiplication algorithm, some of which perform less boxing than
> >> > others.  Again, post your code and we'll let you know.
> >> >
> >> >> Currently I'm just using native (double-precision) ocaml floats and
> >> >> the native ocaml arrays for a first pass on the problem.  Now I'm
> >> >> thinking about moving to using float32 bigarrays, and I'm hoping
> >> >> that the code will double in speed. I'd like to know: is that
> >> >> realistic? Any other suggestions?
> >> >
> >> > It won't double in speed: arithmetic operations will take exactly the
> >> > same time in single or double precision.  What single-precision
> >> > bigarrays buy you is halving the memory footprint of your matrices.
> >> > That could result in better cache behavior and therefore slightly
> >> > better speed, but it depends very much on the sizes and number of your
> >> > matrices.
> >> >
> >> > - Xavier Leroy
> >> >
> >>
> >> _______________________________________________
> >> Caml-list mailing list. Subscription management:
> >> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> >> Archives: http://caml.inria.fr
> >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> >> Bug reports: http://caml.inria.fr/bin/caml-bugs
> >
> >
> > _______________________________________________
> > Caml-list mailing list. Subscription management:
> > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> > Archives: http://caml.inria.fr
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
> >
> >
>

[-- Attachment #2: Type: text/html, Size: 10653 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Caml-list] speeding up matrix multiplication (newbie question)
  2009-02-20 22:43         ` Markus Mottl
@ 2009-02-23 22:59           ` Erick Matsen
  0 siblings, 0 replies; 12+ messages in thread
From: Erick Matsen @ 2009-02-23 22:59 UTC (permalink / raw)
  To: Markus Mottl; +Cc: Will M. Farr, Mike Lin, caml-list

Hello Caml-community--


First of all, big thanks to Will Farr, Mike Lin, Martin Jambon, and Markus
Mottl.

So far, I rewrote things in a non-functional way to avoid unecessary memory
allocation, and then used the float operations directly. This resulted in a 4
fold increase in speed. Second, I moved some of the code to use Bigarrays and
then used the Lacaml BLAS interface. This resulted in another 25% savings. I
didn't have big expectations given Markus' email.

Indeed, he appears to be right:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 14.90      9.60     9.60                             caml_ba_offset
 13.82     18.50     8.90 11368070     0.00     0.00  camlFqtree__fun_144
 10.35     25.17     6.67                             caml_ba_get_N
  7.50     30.00     4.83  7456800     0.00     0.00  camlDiagd__fun_309
  7.31     34.71     4.71 456351614     0.00     0.00  caml_copy_double
  6.85     39.12     4.41                             caml_ba_set_aux
  5.84     42.88     3.76 880051135     0.00     0.00  caml_c_call
  4.43     45.73     2.85                             caml_ba_get_1
  3.55     48.02     2.29                             caml_ba_set_1
  1.90     49.24     1.23                             __i686.get_pc_thunk.bx
  1.89     50.47     1.22                             camlVec4_D__of_array_155
  1.87     51.67     1.21                             lacaml_Zlansy_stub_bc
  1.65     52.73     1.06 151152614     0.00     0.00  caml_modify
  1.37     53.61     0.88  9788472     0.00     0.00  caml_adjust_gc_speed
  1.33     54.47     0.86 12992080     0.00     0.00  camlFqtree__calcLikes_87
  0.99     55.11     0.64                             caml_ceil_float
  0.96     55.73     0.62                             lacaml_Dgemv_stub
  0.89     56.30     0.57                             caml_ba_get_2
  0.88     56.88     0.57
camlLacaml_utils__gXmv_get_params_375
  0.80     57.39     0.52                             caml_ba_dim
  0.79     57.90     0.51                             camlLacaml4_D__gemv_227
  0.78     58.41     0.51    46584     0.00     0.00  sweep_slice
  0.76     58.90     0.49  9788472     0.00     0.00  caml_alloc_custom
  0.39     59.15     0.25    43714     0.00     0.00  mark_slice
  0.39     59.40     0.25                             caml_ba_multov

As he suggested would be the case, the code is spending most of its time in the
bigarray stub function responsible for checking bounds and pulling out the
right array element.

I wanted to double check that writing my own implementation in fortran or c
will eliminate such overhead. Or will I still have to do the "fru-fru" he
describes anyway?


Again, thank you all. The improvements have already been quite significant.


Erick


On Fri, Feb 20, 2009 at 2:43 PM, Markus Mottl <markus.mottl@gmail.com> wrote:
> Unless you want to interface C-calls into BLAS/LAPACK directly without
> bounds checking, releasing the OCaml-lock, and other "fru-fru", it
> seems unlikely that you will get much of an advantage using those
> libraries given the small size of your matrices.  E.g. Lacaml is
> optimized for larger matrices (probably > 10x10).
>
> I guess you should be fine rolling your own implementation for such
> small matrices.
>
> Regards,
> Markus
>
> --
> Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-02-23 22:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-20 15:40 speeding up matrix multiplication (newbie question) Erick Matsen
2009-02-20 15:45 ` [Caml-list] " RABIH.ELCHAAR
2009-02-20 17:46 ` Jon Harrop
2009-02-20 18:46 ` Xavier Leroy
2009-02-20 19:53   ` Erick Matsen
2009-02-20 21:21     ` Will M. Farr
2009-02-20 21:37     ` Martin Jambon
2009-02-20 22:23     ` Mike Lin
2009-02-20 22:30       ` Will M. Farr
2009-02-20 22:43         ` Markus Mottl
2009-02-23 22:59           ` Erick Matsen
2009-02-20 22:43         ` Mike Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).