caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Re: Re: Re: [Caml-list] newbie questions
@ 2003-04-09 14:04 Dr.Dr.Ruediger M.Flaig
  0 siblings, 0 replies; only message in thread
From: Dr.Dr.Ruediger M.Flaig @ 2003-04-09 14:04 UTC (permalink / raw)
  To: caml-list

> Please define "fast processing of large amounts of data". This can mean
> widely different things.

I am working with DNA, so my idea of large amounts of data is a huge annotated sequence file. Well, the annotations are not a problem, they are small by comparison and may be kept separate... so what remains is a simple trail of up to several millions of base pairs -- indexed 2-bit elements, strictly speaking, but usually dealt with as bytes.

Imagine I want to do the following: In order to plan an experiment, I have to find to which positions of a long DNA sequence a shorter one may bind under certain circumstances... there are approximations for that, but lab experience shows that they just don't work properly for real life. So the most reliable thing to do is: for all possible subsets of both sequences, calculate their affinity:

Seq 1 = ggatcggctaag -> Subsets: ggatcggctaa, ggatcggcta, ggatcggct, ggatcggc, ..., gg, gatcggctaag, gatcggctaa, gatcggcta, gatcggc, ..., ga, ..., ctaa, cta, ct, taa, ta .
Seq 2 = aacgtaa -> Subsets: aacgta, aacgt, aacg, ..., aa, acgta, acgt, acg, ac, ..., taa, ta, aa .
Match ggatcggctaa with aacgta, aacgt, aacg, ..., aa, acgta, acgt, acg, ac, ..., taa, ta, aa ; match ggatcggcta with aacgta, aacgt, aacg, ..., aa, acgta, acgt, acg, ac, ..., taa, ta, aa ; ......... ; match ta with aa.
where "match" means: calculate the maximal temperature at which these sequences may bind to each other.

Okay, getting this done either recursively or iteratively is freshman level programming, and you can add lots of "cutoffs" to reduce the work load, but getting it done FAST is quite a different matter when one of the sequences is in the megabyte range...

> And figure out how you can minimize the rate of new object
> creation.

No new object creation is needed at all, if all this is done by indexing... (I have followed the thread about GC efficiency)

> If you are dealing with matrices (numerical analysis...), yes, probably
> you want Array's or Bigarray's.
>
> Otherwise, even for structures mapping an integer range to values, arrays
> may not be the best choice. I have in mind a particular example where we
> used a balanced binary map from integers to values, because this allowed
> implementing certain optimizations (see section 6.2 of
> http://www.di.ens.fr/~monniaux/biblio/Static_analyzer_LNCS2566.pdf ).

Yup, that sounds very interesting. I'll have a look. 

Yours,
   Ruediger

Dr. Dr. Ruediger Marcus Flaig
Institute for Immunology
University of Heidelberg
Im Neuenheimer Feld 305
D-69120 Heidelberg
<flaig@cirith-ungol.sanctacaris.net>
Tel. +49-172-7652946
Fax  +49-4075110-17171

_____________________________________________________________
Free eMail .... the way it should be....
http://www.hablas.com

_____________________________________________________________
Select your own custom email address for FREE! Get you@yourchoice.com w/No Ads, 6MB, POP & more! http://www.everyone.net/selectmail?campaign=tag

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2003-04-09 14:04 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-09 14:04 Re: Re: [Caml-list] newbie questions Dr.Dr.Ruediger M.Flaig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).