caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Ropes: progressively crazy ideas
@ 2007-08-13 21:23 Jon Harrop
  0 siblings, 0 replies; only message in thread
From: Jon Harrop @ 2007-08-13 21:23 UTC (permalink / raw)
  To: caml-list


I decided to have another go at curing cancer over the weekend. Downloaded the 
human genome sequence from UCSC and started reading papers on suffix trees.

The data is primarily a sequence of A, C, G and T, of course. However, there 
are also sparse entries of other characters, primarily N. You really want to 
store the bulk of the data as a dense array of bit pairs for the four most 
common characters and then store the exceptional cases in a sparse data 
structure.

Perhaps it would be useful if your Vect implementation could be abstracted 
over different concrete leaf implementations, such as bit vectors? Presumably 
leaf size should be measured in bytes rather than elements?

Also, I've noticed a correlation between the purely functional implementations 
of the linear-time suffix tree generation algorithms, Huet's zippers and my 
work that I mentioned before on lazy metadata in nodes.

I'm toying with the idea of a sequence type based on Vect that lazily computes 
its own suffix tree. You would need some way to compose suffix trees when 
sequences are combined (e.g. concatenation of sequences) but the result could 
be very useful.

For example, Mathematica's pattern matcher allows you to search for 
subsequences:

  f[{front___, x, y, z, back___}] := {front, back}

This is currently implemented completely naively by recursive linear searches 
of an array. If the sequence type could generate and reuse suffix trees then 
subsequence matches should be vastly more efficient. The ability to search 
for subsequences is the main reason why Mathematica's pattern matches are not 
optimized as ML's are.

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists/?e


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2007-08-13 21:32 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-13 21:23 Ropes: progressively crazy ideas Jon Harrop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).