[Caml-list] [CAML]:: efficient data structure for storing and searching int list list

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
@ 2013-04-12 14:36 沈胜宇
  2013-04-12 15:01 ` simon cruanes
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: 沈胜宇 @ 2013-04-12 14:36 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 634 bytes --]

Dear all:

I have an int list list, whose name is LL

and I need to frequently decide whether a particular int list, whose name is L, is a sublist of an element of LL.

Is there any efficent data structure to do this?

At the mean time, I store LL as (int, bool) Hashtbl.t list, that is, each element of LL is stored as a hash table.

So searching L in LL is reduce to decide whether there exist an element of LL, such every element of L hit in this element.

At the mean time, the space is not a big problem, but the run time overhead is major concern,

So if there exist any more faster data structure?

Thank you

Shen

[-- Attachment #2: Type: text/html, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
  2013-04-12 14:36 [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 沈胜宇
@ 2013-04-12 15:01 ` simon cruanes
  2013-04-12 15:48 ` Jean-Francois Monin
  2013-04-12 22:15 ` Toby Kelsey
  2 siblings, 0 replies; 8+ messages in thread
From: simon cruanes @ 2013-04-12 15:01 UTC (permalink / raw)
  To: caml-list

If the order in the lists does not matter, I would suggest some kind of
Trie (http://en.wikipedia.org/wiki/Trie) to store the *sorted* int
lists; the algorithm for search would recursively explore all the
branches of the trie that can be a superlist of the input list.

Here is a code snippet (not thoroughly tested):

(* ------------------- %< ------ >% ----------------- *)

type trie =
  | Node of bool *  (* end of a list? *)
            (int * trie) list  (* subtries, indexed by their first
element *)

let empty = Node (false, [])

(* add [l] to [trie], assuming [l] is sorted *)
let rec add trie l =
  match trie, l with
  | Node (_, subtries), [] -> Node (true, subtries)
  | Node (b, subtries), x::l' ->
    let subtrie =
      try List.assoc x subtries
      with Not_found -> Node (false, []) in
    (* recursive add *)
    let subtrie = add subtrie l' in
    let subtries = List.remove_assoc x subtries in
    Node (b, (x,subtrie) :: subtries)

(* find whether [l] is a sublist of some list of [trie] *)
let rec find trie l =
  match trie, l with
  | _, [] -> true
  | Node (_, subtries), (x::l') ->
    find_list x subtries l'
and find_list x subtries l' = match subtries with
  | [] -> false
  | (y,subtrie)::subtries' ->
    (if y < x then find subtrie (x::l')
    else if y = x then find subtrie l'
    else false) || find_list x subtries' l'

(* ------------------- %< ------ >% ----------------- *)

Cheers!


Simon

On 12/04/2013 16:36, 沈胜宇 wrote:
> Dear all:
> 
> I have an int list list, whose name is LL
> 
> and I need to frequently decide whether a particular int list, whose
> name is L, is a sublist of an element of LL.
> 
> Is there any efficent data structure to do this?
> 
> At the mean time, I store LL as (int, bool) Hashtbl.t list, that is,
> each element of LL is stored as a hash table.
> 
> So searching L in LL is reduce to decide whether there exist an element
> of LL, such every element of L hit in this element.
> 
> At the mean time, the space is not a big problem, but the run time
> overhead is major concern,
> 
> So if there exist any more faster data structure?
> 
> Thank you
> 
> Shen
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
  2013-04-12 14:36 [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 沈胜宇
  2013-04-12 15:01 ` simon cruanes
@ 2013-04-12 15:48 ` Jean-Francois Monin
  2013-04-13  6:58   ` 沈胜宇
  2013-04-12 22:15 ` Toby Kelsey
  2 siblings, 1 reply; 8+ messages in thread
From: Jean-Francois Monin @ 2013-04-12 15:48 UTC (permalink / raw)
  To: 沈胜宇; +Cc: caml-list

You may have some total order on the elements of your lists.
Then consider only sorted lists, and implement LL with tries.

JF

On Fri, Apr 12, 2013 at 10:36:22PM +0800, 沈胜宇 wrote:
>    Dear all:
>    I have an int list list, whose name is LL
>    and I need to frequently decide whether a particular int list, whose name
>    is L, is a sublist of an element of LL.
>    Is there any efficent data structure to do this?
>    At the mean time, I store LL as (int, bool) Hashtbl.t list, that is, each
>    element of LL is stored as a hash table.
>    So searching L in LL is reduce to decide whether there exist an element of
>    LL, such every element of L hit in this element.
>    At the mean time, the space is not a big problem, but the run time
>    overhead is major concern,
>    So if there exist any more faster data structure?
>    Thank you
>    Shen

-- 
Jean-Francois Monin
LIAMA Project FORMES, CNRS  &  Universite de Grenoble 1 &
Tsinghua University

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
  2013-04-12 14:36 [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 沈胜宇
  2013-04-12 15:01 ` simon cruanes
  2013-04-12 15:48 ` Jean-Francois Monin
@ 2013-04-12 22:15 ` Toby Kelsey
  2013-04-13  6:57   ` 沈胜宇
  2 siblings, 1 reply; 8+ messages in thread
From: Toby Kelsey @ 2013-04-12 22:15 UTC (permalink / raw)
  To: caml-list; +Cc: syshen

On 12/04/13 15:36, 沈胜宇 wrote:
> Dear all:
> I have an int list list, whose name is LL
> and I need to frequently decide whether a particular int list, whose name is L, is a sublist of an element of LL.
> 
> Is there any efficent data structure to do this?

A data structure useful for finding substrings quickly is the "suffix tree",
this can be built in O(n) - for small alphabets - or O(n log n) time and
substring searches take O(length substring) time. The suffix tree takes more
space than the original string though. An int list can take the role of the
string here.

Toby

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
  2013-04-12 22:15 ` Toby Kelsey
@ 2013-04-13  6:57   ` 沈胜宇
  2013-04-23  9:05     ` Goswin von Brederlow
  0 siblings, 1 reply; 8+ messages in thread
From: 沈胜宇 @ 2013-04-13  6:57 UTC (permalink / raw)
  To: Toby Kelsey; +Cc: caml-list

Dear Toby:

Thank you for your help.

But my problem is a little more difference from the substring searching problem with suffix tree.

In my problem, a list L1 is another list L2's sublist, is much more general that the substring problem.

For example, bcd is a substring of abcde, because bcd is continuely occur in abcde.

At the same time, bd is not a substring of abcde, because is is not continuesly in abcde.

But in my problem, a list b->d is a sub list of a->b->c->d->e.

So after reading the suffix tree introduction on wiki, I think it may not fit for my problem.

I also find that trie is more general than suffix, and can be used to handle my problem. but it is too general in the sense that it di not effiecently handle the case that two list with multiple(not just one) shared sublist.

For example, I first insert a list a->b->c->d->e->f into trie, and then I insert a->b->d->e into the trie.

the trie can not store the second shared sublist d->e in the same place, it can only store them like 
a->b->c->d->e->f
    ->d->e

So do you have more suggenhion on this ?

Shen

> -----原始邮件-----
> 发件人: "Toby Kelsey" <toby.kelsey@gmail.com>
> 发送时间: 2013-04-13 06:15:25 (星期六)
> 收件人: caml-list@inria.fr
> 抄送: syshen@nudt.edu.cn
> 主题: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
> 
> On 12/04/13 15:36, 沈胜宇 wrote:
> > Dear all:
> > I have an int list list, whose name is LL
> > and I need to frequently decide whether a particular int list, whose name is L, is a sublist of an element of LL.
> > 
> > Is there any efficent data structure to do this?
> 
> A data structure useful for finding substrings quickly is the "suffix tree",
> this can be built in O(n) - for small alphabets - or O(n log n) time and
> substring searches take O(length substring) time. The suffix tree takes more
> space than the original string though. An int list can take the role of the
> string here.
> 
> Toby

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
  2013-04-12 15:48 ` Jean-Francois Monin
@ 2013-04-13  6:58   ` 沈胜宇
  2013-04-13  7:56     ` Gabriel Scherer
  0 siblings, 1 reply; 8+ messages in thread
From: 沈胜宇 @ 2013-04-13  6:58 UTC (permalink / raw)
  To: Jean-Francois Monin; +Cc: caml-list

Dear Monin:

thank you for your help.

But I think trie is too general in the sense that it did not effiecently handle the case that two list with multiple(not just one) shared sublist.

For example, I first insert a list a->b->c->d->e->f into trie, and then I insert a->b->d->e into the trie.

the trie can not store the second shared sublist d->e in the same place, it can only store them like 
a->b->c->d->e->f
    ->d->e

do you have more suggesion on this?

Shen
> -----原始邮件-----
> 发件人: "Jean-Francois Monin" <jean-francois.monin@imag.fr>
> 发送时间: 2013-04-12 23:48:04 (星期五)
> 收件人: "沈胜宇" <syshen@nudt.edu.cn>
> 抄送: caml-list <caml-list@inria.fr>
> 主题: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
> 
> You may have some total order on the elements of your lists.
> Then consider only sorted lists, and implement LL with tries.
> 
> JF
> 
> On Fri, Apr 12, 2013 at 10:36:22PM +0800, 沈胜宇 wrote:
> >    Dear all:
> >    I have an int list list, whose name is LL
> >    and I need to frequently decide whether a particular int list, whose name
> >    is L, is a sublist of an element of LL.
> >    Is there any efficent data structure to do this?
> >    At the mean time, I store LL as (int, bool) Hashtbl.t list, that is, each
> >    element of LL is stored as a hash table.
> >    So searching L in LL is reduce to decide whether there exist an element of
> >    LL, such every element of L hit in this element.
> >    At the mean time, the space is not a big problem, but the run time
> >    overhead is major concern,
> >    So if there exist any more faster data structure?
> >    Thank you
> >    Shen
> 
> -- 
> Jean-Francois Monin
> LIAMA Project FORMES, CNRS  &  Universite de Grenoble 1 &
> Tsinghua University


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
  2013-04-13  6:58   ` 沈胜宇
@ 2013-04-13  7:56     ` Gabriel Scherer
  0 siblings, 0 replies; 8+ messages in thread
From: Gabriel Scherer @ 2013-04-13  7:56 UTC (permalink / raw)
  To: 沈胜宇; +Cc: Jean-Francois Monin, caml-list

There is a fairly generic way to get an efficient data structure if
you don't mind huge preprocessing costs. You can see your problem as a
word recognition problem (you want to accept only words that are
sublists of one of the lists in your set), so a natural data
representation of this is a finite-state automaton.
Getting an efficient automaton out of your data set is easy (but may
be extremely costly): you only need to implement a determinization
algorithm (and if you want to avoid space explosion, maybe a
minimization algorithm as well) and those are well-known. Given an
automaton for a list LL, you can add a new list L by creating an
automaton recognizing sublists of L, making its union with your LL
automaton, and determinizing again.

Of course, that is a kind of giant hammer, there are probably more
specialized approaches that may be suitable for your problem. I didn't
understand whether you're trying to check a subsequence problem ('ac'
is a subsequence of 'abcd') or a substring problem ('ab' is not a
substring, while 'abc' would be). For the substring problem, a common
trick is to add to your trie not only a L, but also the reversed
prefixes of L: for the word 'abcd' you would store 'abcd', 'bcd|a',
'cd|ab', 'd|abc'. Checking substring inclusion is then immediate. This
results in a multiplication of the memory usage; note that DFA
minimization can be seen as an optimal, principled way to introduce
sharing in this data structure.

On Sat, Apr 13, 2013 at 8:58 AM, 沈胜宇 <syshen@nudt.edu.cn> wrote:
> Dear Monin:
>
> thank you for your help.
>
> But I think trie is too general in the sense that it did not effiecently handle the case that two list with multiple(not just one) shared sublist.
>
> For example, I first insert a list a->b->c->d->e->f into trie, and then I insert a->b->d->e into the trie.
>
> the trie can not store the second shared sublist d->e in the same place, it can only store them like
> a->b->c->d->e->f
>     ->d->e
>
> do you have more suggesion on this?
>
> Shen
>> -----原始邮件-----
>> 发件人: "Jean-Francois Monin" <jean-francois.monin@imag.fr>
>> 发送时间: 2013-04-12 23:48:04 (星期五)
>> 收件人: "沈胜宇" <syshen@nudt.edu.cn>
>> 抄送: caml-list <caml-list@inria.fr>
>> 主题: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
>>
>> You may have some total order on the elements of your lists.
>> Then consider only sorted lists, and implement LL with tries.
>>
>> JF
>>
>> On Fri, Apr 12, 2013 at 10:36:22PM +0800, 沈胜宇 wrote:
>> >    Dear all:
>> >    I have an int list list, whose name is LL
>> >    and I need to frequently decide whether a particular int list, whose name
>> >    is L, is a sublist of an element of LL.
>> >    Is there any efficent data structure to do this?
>> >    At the mean time, I store LL as (int, bool) Hashtbl.t list, that is, each
>> >    element of LL is stored as a hash table.
>> >    So searching L in LL is reduce to decide whether there exist an element of
>> >    LL, such every element of L hit in this element.
>> >    At the mean time, the space is not a big problem, but the run time
>> >    overhead is major concern,
>> >    So if there exist any more faster data structure?
>> >    Thank you
>> >    Shen
>>
>> --
>> Jean-Francois Monin
>> LIAMA Project FORMES, CNRS  &  Universite de Grenoble 1 &
>> Tsinghua University
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
  2013-04-13  6:57   ` 沈胜宇
@ 2013-04-23  9:05     ` Goswin von Brederlow
  0 siblings, 0 replies; 8+ messages in thread
From: Goswin von Brederlow @ 2013-04-23  9:05 UTC (permalink / raw)
  To: caml-list

On Sat, Apr 13, 2013 at 02:57:11PM +0800, ?????? wrote:
> Dear Toby:
> 
> Thank you for your help.
> 
> But my problem is a little more difference from the substring searching problem with suffix tree.
> 
> In my problem, a list L1 is another list L2's sublist, is much more general that the substring problem.
> 
> For example, bcd is a substring of abcde, because bcd is continuely occur in abcde.
> 
> At the same time, bd is not a substring of abcde, because is is not continuesly in abcde.
> 
> But in my problem, a list b->d is a sub list of a->b->c->d->e.
> 
> 
> So after reading the suffix tree introduction on wiki, I think it may not fit for my problem.
> 
> I also find that trie is more general than suffix, and can be used to handle my problem. but it is too general in the sense that it di not effiecently handle the case that two list with multiple(not just one) shared sublist.
> 
> For example, I first insert a list a->b->c->d->e->f into trie, and then I insert a->b->d->e into the trie.
> 
> the trie can not store the second shared sublist d->e in the same place, it can only store them like 
> a->b->c->d->e->f
>     ->d->e
> 
> So do you have more suggenhion on this ?
> 
> Shen
> 
> > -----????????-----
> > ??????: "Toby Kelsey" <toby.kelsey@gmail.com>
> > ????????: 2013-04-13 06:15:25 (??????)
> > ??????: caml-list@inria.fr
> > ????: syshen@nudt.edu.cn
> > ????: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
> > 
> > On 12/04/13 15:36, ?????? wrote:
> > > Dear all:
> > > I have an int list list, whose name is LL
> > > and I need to frequently decide whether a particular int list, whose name is L, is a sublist of an element of LL.
> > > 
> > > Is there any efficent data structure to do this?
> > 
> > A data structure useful for finding substrings quickly is the "suffix tree",
> > this can be built in O(n) - for small alphabets - or O(n log n) time and
> > substring searches take O(length substring) time. The suffix tree takes more
> > space than the original string though. An int list can take the role of the
> > string here.
> > 
> > Toby

Note: A suffix tree can be build in O(n) and takes O(n) space. Takes
something like 48-64 times the space of the string in ocaml.


Seems like you aren't looking for sublists (in which the order would
matter) but subsets (order doesn't matter and elements are unique).

You can build a lookup tree containing all subsets of each set like this:

Tree with {a,b,c,d,e} inserted:

+a+b+c+d-e
| | | \e-d
| | +d+c-e
| | | \e-c
| | \e+c-d
| |   \d-c
| +c+b+d-e
| | | \e-d
| | +d+b-e
| | | \e-b
| | \e+b-d
| |   \d-b
| +d+b+c-e
| | | \e-c
| | +c+b-e
| | | \e-b
| | \e+b-c
| |   \c-b
| ...

That gets rather large. If you not only need to know L is a subset of
one of the sets in LL then each node also needs to store a list of
sets containing the subset expressed so far.

If you can get L sorted that reduces the tree quite a bit:

+a+b+c+d-e
| | | \e
| | +d-e
| | \e
| +c+d-e
| | \e
| +d-e
| \e
+b+c+d-e
| | \e
| +d-e
| \e
+c+d-e
| \e
+d-e
\e

Since L is sorted you only need the paths that are sorted. That gives
you a tree of size O(2^n) where n is the number of unique ints in all
sets. Still huge but your n might be small enough. This will give you
O(|L|) lookup.

Alternatively to sorting L you could still use the above tree. Start
at the root and check the first child: a. Is a in L? If so go down
that branch, otherwise check the next child. With L as a list each
lookup would be O(n). As Set it would be O(log n) and as Hashtbl.t it
would O(1).

MfG
	Goswin

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-04-23  9:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-12 14:36 [Caml-list] [CAML]:: efficient data structure for storing and searching int list list 沈胜宇
2013-04-12 15:01 ` simon cruanes
2013-04-12 15:48 ` Jean-Francois Monin
2013-04-13  6:58   ` 沈胜宇
2013-04-13  7:56     ` Gabriel Scherer
2013-04-12 22:15 ` Toby Kelsey
2013-04-13  6:57   ` 沈胜宇
2013-04-23  9:05     ` Goswin von Brederlow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).