caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Substring search on an array of strings
@ 2004-06-24  7:52 Diego Olivier Fernandez Pons
  2004-06-24 15:51 ` Shawn Wagner
  0 siblings, 1 reply; 5+ messages in thread
From: Diego Olivier Fernandez Pons @ 2004-06-24  7:52 UTC (permalink / raw)
  To: caml-list

    Bonjour,

Caml Str library does not seem to provide a function allowing to match
efficently a string in a array of strings.

I was wondering if there were already known algorithms for this before
I try the workarounds I had in mind :

i) building a 'superstring' with the array of strings (the shortest
superstring problem is NP but there are simple primal-dual
approximation algorithms) and use a classical search algorithm.

ii) try to lift the suffix Knuth-Morris-Prat automaton to work on the
union of the words instead of a single word (is this possible ?)


        Diego Olivier

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Substring search on an array of strings
  2004-06-24  7:52 [Caml-list] Substring search on an array of strings Diego Olivier Fernandez Pons
@ 2004-06-24 15:51 ` Shawn Wagner
  2004-06-25 16:21   ` Diego Olivier Fernandez Pons
  0 siblings, 1 reply; 5+ messages in thread
From: Shawn Wagner @ 2004-06-24 15:51 UTC (permalink / raw)
  To: caml-list

On Thu, Jun 24, 2004 at 09:52:37AM +0200, Diego Olivier Fernandez Pons wrote:
>     Bonjour,
> 
> Caml Str library does not seem to provide a function allowing to match
> efficently a string in a array of strings.
> 
> I was wondering if there were already known algorithms for this before
> I try the workarounds I had in mind :


Do you want to find out if one of the strings in the array equals some other
string, or if that other string appears as a substring in one of the array's
strings? Either way, annexlib has the needed functions:
http://raevnos.pennmush.org/code/annexlib/

-- 
Shawn Wagner
shawnw@speakeasy.org

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Substring search on an array of strings
  2004-06-24 15:51 ` Shawn Wagner
@ 2004-06-25 16:21   ` Diego Olivier Fernandez Pons
  0 siblings, 0 replies; 5+ messages in thread
From: Diego Olivier Fernandez Pons @ 2004-06-25 16:21 UTC (permalink / raw)
  To: Shawn Wagner; +Cc: caml-list

    Bonjour,

> Do you want to find out if one of the strings in the array equals some other
> string, or if that other string appears as a substring in one of the array's
> strings? Either way, annexlib has the needed functions:
> http://raevnos.pennmush.org/code/annexlib/

Find out if a string appears as a substring in an array of strings and
return the list of couples (string_index, position) of all matches.

After having examined annexlib, it seems that it doesn't have a
_specific_ function to do what I want. Of course I could always launch
Array.size searches, one on each string. But it is a performance issue
of course because the database is really big.

        Diego Olivier


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Substring search on an array of strings
  2004-06-25 17:05 ` Diego Olivier Fernandez Pons
@ 2004-06-25 17:25   ` Shawn Wagner
  0 siblings, 0 replies; 5+ messages in thread
From: Shawn Wagner @ 2004-06-25 17:25 UTC (permalink / raw)
  To: caml-list

On Fri, Jun 25, 2004 at 07:05:13PM +0200, Diego Olivier Fernandez Pons wrote:
>     Bonjour,
> 
> > If you need to find every substring in every element of the array
> > then, yes, you'll have to look at every element of the array to see
> > if that substring appears or not in it.
> 
> No, or at least only in the worst case.
> 
> In the same way the Knuth-Morris-Pratt suffix automaton allows you not
> to search for every position in a string if it is not necessary, a
> specific procedure for searching in an array of strings could avoid
> you searching on every string.
> 
> I will try to give some simple examples to get some intuition :
> 
> - Suppose there are many strings that only contains a few different
> letters. Then you could build first a map from subsets of the alphabet
> to string indexes and only search in those strings

And, of course, to build that map you have to look at every element of the
array. Of course, you can then re-use the map, but it has to be generated
somehow.

> 
> - Suppose that the strings have a particular pattern (say
> "processortype.memory.operatingsystem"). Then you could build indexes
> on every processor type, memory or operating system, and search only
> in the corresponding strings
> 

You still have to look at the strings to find out where those corresponding
strings are. Now, with something like your second example you can sort the
array (Once again having to look at everything in the array) and use a
binary search to find the range of strings starting with, say, athlon,
without having to look at every element in that particular search.

Your original questions didn't say anything about such extra information or
patterns in the strings you're searching. If you want an answer you like,
you have to give us enough data about the problem that we have a hope of
coming up with one! Don't say: How do I find every substring in an array of
strings. Say: In an array of strings of the format
'processortype.memory.operatingsystem', how can I find all strings that have
'linux' in the operatingsystem field?


Aside: Why did you cc'ing caml-list when I only replied to you in that last
email? (This one is going to caml-list too)

-- 
Shawn Wagner
shawnw@speakeasy.org

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Substring search on an array of strings
       [not found] <20040625165512.GG595@speakeasy.org>
@ 2004-06-25 17:05 ` Diego Olivier Fernandez Pons
  2004-06-25 17:25   ` Shawn Wagner
  0 siblings, 1 reply; 5+ messages in thread
From: Diego Olivier Fernandez Pons @ 2004-06-25 17:05 UTC (permalink / raw)
  To: Shawn Wagner; +Cc: caml-list

    Bonjour,

> If you need to find every substring in every element of the array
> then, yes, you'll have to look at every element of the array to see
> if that substring appears or not in it.

No, or at least only in the worst case.

In the same way the Knuth-Morris-Pratt suffix automaton allows you not
to search for every position in a string if it is not necessary, a
specific procedure for searching in an array of strings could avoid
you searching on every string.

I will try to give some simple examples to get some intuition :

- Suppose there are many strings that only contains a few different
letters. Then you could build first a map from subsets of the alphabet
to string indexes and only search in those strings

- Suppose that the strings have a particular pattern (say
"processortype.memory.operatingsystem"). Then you could build indexes
on every processor type, memory or operating system, and search only
in the corresponding strings


        Diego Olivier

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-06-25 17:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-06-24  7:52 [Caml-list] Substring search on an array of strings Diego Olivier Fernandez Pons
2004-06-24 15:51 ` Shawn Wagner
2004-06-25 16:21   ` Diego Olivier Fernandez Pons
     [not found] <20040625165512.GG595@speakeasy.org>
2004-06-25 17:05 ` Diego Olivier Fernandez Pons
2004-06-25 17:25   ` Shawn Wagner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).