caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Re: [Caml-list] Package with multidimensional AND sparse matrices?
  2006-07-03 22:55 Package with multidimensional AND sparse matrices? Andreas Biegert
@ 2000-01-02  4:56 ` Jon Harrop
       [not found]   ` <fd0847170607040038u117d2594r38adcbdf1f00ec04@mail.gmail.com>
  2006-07-03 23:59 ` Markus Mottl
  1 sibling, 1 reply; 7+ messages in thread
From: Jon Harrop @ 2000-01-02  4:56 UTC (permalink / raw)
  To: caml-list

On Monday 03 July 2006 23:55, Andreas Biegert wrote:
> The ocamlfloat package provides a sparse matrix implmentation for
> 2-dimensional matrices, but I would need at least 4 dimensions.

You probably mean "tensor" rather than "matrix".

> In case there is no such package, would it be very hard to write one by
> myself?

Provided performance and memory usage are not too important, it is quite 
simple to write your own implementation. First, you must decide what 
requirements you have, i.e. what do you want to do with these tensors? Then, 
you must choose a data structure to represent your tensors, e.g. a hash table 
that maps int * int * int * int to float. Finally, you must implement the 
functions that you need.

You could start with something like this:

module SparseArray : sig
  type t

  val make : int list -> t
  val get : t -> int list -> float
  val set : t -> int list -> float -> unit
end = struct
  type t = int list * (int list, float) Hashtbl.t

  let make n = n, Hashtbl.create 0

  let check s n i =
    List.iter2 (fun n i -> if i<0 || i>=n then invalid_arg s) n i

  let get (n, m) i =
    check "SparseArray.get" n i;
    try Hashtbl.find m i with Not_found -> 0.

  let set (n, m) i x =
    check "SparseArray.set" n i;
    if x<>0. then Hashtbl.replace m i x else
    try Hashtbl.remove m i with Not_found -> ()
end;;

That should be fine for getting and setting but you probably want to loop over 
dimensions and perform arithmetic operations, in which case you need a way to 
jump to the non-zero elements.

How many non-zero elements do you have and how long do you expect your 
calculations to take?

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
Objective CAML for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Package with multidimensional AND sparse matrices?
@ 2006-07-03 22:55 Andreas Biegert
  2000-01-02  4:56 ` [Caml-list] " Jon Harrop
  2006-07-03 23:59 ` Markus Mottl
  0 siblings, 2 replies; 7+ messages in thread
From: Andreas Biegert @ 2006-07-03 22:55 UTC (permalink / raw)
  To: caml-list

Hi,

I am a bioinformatics student from Germany and OCaml is my language of
choice for my diploma thesis program. So far, everything worked out
great, thanks to OCaml allowing for short development cycles. However,
no I am at a point that involves computations of very sparse but
relatively big multidimensional matrices. The worst case scenario is a
4-dim 2000*2000*2000*2000 float matrix. Here is what I found so far:

1. Bigarray module:
The Bigarray module fits perfectly, BUT it does not implement sparse
matrix storage. Therefore the worst case as stated above is not
feasable memorywise.

2. ocamlfloat package:
The ocamlfloat package provides a sparse matrix implmentation for
2-dimensional matrices, but I would need at least 4 dimensions.

Does anybody know a package that implements multidimensional AND
sparse matrices? In case there is no such package, would it be very
hard to write one by myself?

Thanks in advance for helping,

Andreas


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Package with multidimensional AND sparse matrices?
  2006-07-03 22:55 Package with multidimensional AND sparse matrices? Andreas Biegert
  2000-01-02  4:56 ` [Caml-list] " Jon Harrop
@ 2006-07-03 23:59 ` Markus Mottl
  1 sibling, 0 replies; 7+ messages in thread
From: Markus Mottl @ 2006-07-03 23:59 UTC (permalink / raw)
  To: Andreas Biegert; +Cc: caml-list

On 7/3/06, Andreas Biegert <andreas.biegert@googlemail.com> wrote:
> Does anybody know a package that implements multidimensional AND
> sparse matrices? In case there is no such package, would it be very
> hard to write one by myself?

I don't know of any OCaml-libraries implementing datastructures and
algorithms for sparse tensors.  You haven't mentioned anything about
the kind of algorithms you want to implement.  Maybe the problem can
be represented differently so that you can use matrices without too
many conceptual hassles.  Otherwise I'd first look around for some
Fortran libraries that could be interfaced, because implementing
stable and efficient numerical algorithms, especially given this kind
of data, is pretty hard.

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Package with multidimensional AND sparse matrices?
       [not found]   ` <fd0847170607040038u117d2594r38adcbdf1f00ec04@mail.gmail.com>
@ 2006-07-04  7:46     ` Andreas Biegert
  2006-07-04  9:06       ` Martin Jambon
  2006-07-04 10:03       ` Jon Harrop
  0 siblings, 2 replies; 7+ messages in thread
From: Andreas Biegert @ 2006-07-04  7:46 UTC (permalink / raw)
  To: caml-list

Well, in short: the algorithm computes the optimal multiple alignment
of up to n protein sequences with maximum length L. The
straightforward and theoretically optimal approach to solve this
problem is by dynamic programming, therefore the time and space
complexity is O(L^n). I managed to come up with a pretty good
heuristic that lets me boil down the number of cells that need to be
computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact
I can even manually adjust the number of nonzero entries in the
matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will
be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds
i<j<k<l. The bad news is that performance is important since the
program has to run under 5min. An estimation of the time requirements
given the number of sequences n and masimum lengths L is:

24 * L^2 * T_x * 2^(n+1) - 1
with T_x being about 100ns

For n=4 and L=2000 I get just about 5min. Would it be a bad idea to
use Hashtables all the way instead of matrices? Accessing elements in
a hashtable has O(1), right?

Andreas

2006/7/4, Andreas Biegert <andreas.biegert@googlemail.com>:
> Well, in short: the algorithm computes the optimal multiple alignment
> of up to n protein sequences with maximum length L. The
> straightforward and theoretically optimal approach to solve this
> problem is by dynamic programming, therefore the time and space
> complexity is O(L^n). I managed to come up with a pretty good
> heuristic that lets me boil down the number of cells that need to be
> computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact
> I can even manually adjust the number of nonzero entries in the
> matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will
> be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds
> i<j<k<l. The bad news is that performance is important since the
> program has to run under 5min. An estimation of the time requirements
> given the number of sequences n and masimum lengths L is:
>
> 24 * L^2 * T_x * 2^(n+1) - 1
> with T_x being about 100ns
>
> For n=4 and L=2000 I get just about 5min. Would it be a bad idea to
> use Hashtables all the way instead of matrices? Accessing elements in
> a hashtable has O(1), right?
>
> Andreas
>
>
>
> 2000/1/2, Jon Harrop <jon@ffconsultancy.com>:
> > On Monday 03 July 2006 23:55, Andreas Biegert wrote:
> > > The ocamlfloat package provides a sparse matrix implmentation for
> > > 2-dimensional matrices, but I would need at least 4 dimensions.
> >
> > You probably mean "tensor" rather than "matrix".
> >
> > > In case there is no such package, would it be very hard to write one by
> > > myself?
> >
> > Provided performance and memory usage are not too important, it is quite
> > simple to write your own implementation. First, you must decide what
> > requirements you have, i.e. what do you want to do with these tensors? Then,
> > you must choose a data structure to represent your tensors, e.g. a hash table
> > that maps int * int * int * int to float. Finally, you must implement the
> > functions that you need.
> >
> > You could start with something like this:
> >
> > module SparseArray : sig
> >   type t
> >
> >   val make : int list -> t
> >   val get : t -> int list -> float
> >   val set : t -> int list -> float -> unit
> > end = struct
> >   type t = int list * (int list, float) Hashtbl.t
> >
> >   let make n = n, Hashtbl.create 0
> >
> >   let check s n i =
> >     List.iter2 (fun n i -> if i<0 || i>=n then invalid_arg s) n i
> >
> >   let get (n, m) i =
> >     check "SparseArray.get" n i;
> >     try Hashtbl.find m i with Not_found -> 0.
> >
> >   let set (n, m) i x =
> >     check "SparseArray.set" n i;
> >     if x<>0. then Hashtbl.replace m i x else
> >     try Hashtbl.remove m i with Not_found -> ()
> > end;;
> >
> > That should be fine for getting and setting but you probably want to loop over
> > dimensions and perform arithmetic operations, in which case you need a way to
> > jump to the non-zero elements.
> >
> > How many non-zero elements do you have and how long do you expect your
> > calculations to take?
> >
> > --
> > Dr Jon D Harrop, Flying Frog Consultancy Ltd.
> > Objective CAML for Scientists
> > http://www.ffconsultancy.com/products/ocaml_for_scientists
> >
> > _______________________________________________
> > Caml-list mailing list. Subscription management:
> > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> > Archives: http://caml.inria.fr
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
> >
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Package with multidimensional AND sparse matrices?
  2006-07-04  7:46     ` Andreas Biegert
@ 2006-07-04  9:06       ` Martin Jambon
  2006-07-04 10:03       ` Jon Harrop
  1 sibling, 0 replies; 7+ messages in thread
From: Martin Jambon @ 2006-07-04  9:06 UTC (permalink / raw)
  To: Andreas Biegert; +Cc: caml-list

On Tue, 4 Jul 2006, Andreas Biegert wrote:

> Well, in short: the algorithm computes the optimal multiple alignment
> of up to n protein sequences with maximum length L. The
> straightforward and theoretically optimal approach to solve this
> problem is by dynamic programming, therefore the time and space
> complexity is O(L^n). I managed to come up with a pretty good
> heuristic that lets me boil down the number of cells that need to be
> computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact
> I can even manually adjust the number of nonzero entries in the
> matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will
> be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds
> i<j<k<l. The bad news is that performance is important since the
> program has to run under 5min. An estimation of the time requirements
> given the number of sequences n and masimum lengths L is:
>
> 24 * L^2 * T_x * 2^(n+1) - 1
> with T_x being about 100ns
>
> For n=4 and L=2000 I get just about 5min. Would it be a bad idea to
> use Hashtables all the way instead of matrices? Accessing elements in
> a hashtable has O(1), right?

Yes, if you create it big enough so that it doesn't get resized.
The best is that you look at hashtbl.ml in the standard library to see 
how it is implemented.

But I wouldn't worry too much for now: try Hashtbl and you will quickly 
see if it's fast enough or not.


Martin

--
Martin Jambon, PhD - http://martin.jambon.free.fr
Visit http://wikiomics.org, the Bioinformatics Howto Wiki


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Package with multidimensional AND sparse matrices?
  2006-07-04  7:46     ` Andreas Biegert
  2006-07-04  9:06       ` Martin Jambon
@ 2006-07-04 10:03       ` Jon Harrop
  2006-07-04 16:55         ` Andreas Biegert
  1 sibling, 1 reply; 7+ messages in thread
From: Jon Harrop @ 2006-07-04 10:03 UTC (permalink / raw)
  To: caml-list

On Tuesday 04 July 2006 08:46, Andreas Biegert wrote:
> Well, in short: the algorithm computes the optimal multiple alignment
> of up to n protein sequences with maximum length L. The
> straightforward and theoretically optimal approach to solve this
> problem is by dynamic programming, therefore the time and space
> complexity is O(L^n). I managed to come up with a pretty good
> heuristic that lets me boil down the number of cells that need to be
> computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact
> I can even manually adjust the number of nonzero entries in the
> matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will
> be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds
> i<j<k<l. The bad news is that performance is important since the
> program has to run under 5min. An estimation of the time requirements
> given the number of sequences n and masimum lengths L is:
>
> 24 * L^2 * T_x * 2^(n+1) - 1
> with T_x being about 100ns

Ok. What patterns of access will there be?

> For n=4 and L=2000 I get just about 5min. Would it be a bad idea to
> use Hashtables all the way instead of matrices?

No, I don't think so. If you need more performance then you can always change 
things later. I'd concentrate on getting a working algorithm first.

> Accessing elements in 
> a hashtable has O(1), right?

Roughly, yes. :-)

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
Objective CAML for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] Package with multidimensional AND sparse matrices?
  2006-07-04 10:03       ` Jon Harrop
@ 2006-07-04 16:55         ` Andreas Biegert
  0 siblings, 0 replies; 7+ messages in thread
From: Andreas Biegert @ 2006-07-04 16:55 UTC (permalink / raw)
  To: caml-list

Since the algorithm computes alignemnts in most casses the pattern of
access will be adjacent tuples, for example (i,j,k,l) (i,j,k-1,l-1)
etc. But it does not have to be that way. The only contstraint that
holds for sure is i<j<k<l.

2006/7/4, Jon Harrop <jon@ffconsultancy.com>:
> On Tuesday 04 July 2006 08:46, Andreas Biegert wrote:
> > Well, in short: the algorithm computes the optimal multiple alignment
> > of up to n protein sequences with maximum length L. The
> > straightforward and theoretically optimal approach to solve this
> > problem is by dynamic programming, therefore the time and space
> > complexity is O(L^n). I managed to come up with a pretty good
> > heuristic that lets me boil down the number of cells that need to be
> > computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact
> > I can even manually adjust the number of nonzero entries in the
> > matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will
> > be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds
> > i<j<k<l. The bad news is that performance is important since the
> > program has to run under 5min. An estimation of the time requirements
> > given the number of sequences n and masimum lengths L is:
> >
> > 24 * L^2 * T_x * 2^(n+1) - 1
> > with T_x being about 100ns
>
> Ok. What patterns of access will there be?
>
> > For n=4 and L=2000 I get just about 5min. Would it be a bad idea to
> > use Hashtables all the way instead of matrices?
>
> No, I don't think so. If you need more performance then you can always change
> things later. I'd concentrate on getting a working algorithm first.
>
> > Accessing elements in
> > a hashtable has O(1), right?
>
> Roughly, yes. :-)
>
> --
> Dr Jon D Harrop, Flying Frog Consultancy Ltd.
> Objective CAML for Scientists
> http://www.ffconsultancy.com/products/ocaml_for_scientists
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-07-04 17:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-03 22:55 Package with multidimensional AND sparse matrices? Andreas Biegert
2000-01-02  4:56 ` [Caml-list] " Jon Harrop
     [not found]   ` <fd0847170607040038u117d2594r38adcbdf1f00ec04@mail.gmail.com>
2006-07-04  7:46     ` Andreas Biegert
2006-07-04  9:06       ` Martin Jambon
2006-07-04 10:03       ` Jon Harrop
2006-07-04 16:55         ` Andreas Biegert
2006-07-03 23:59 ` Markus Mottl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).