Re: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Gabriel Scherer <gabriel.scherer@gmail.com>
To: 沈胜宇 <syshen@nudt.edu.cn>
Cc: Jean-Francois Monin <jean-francois.monin@imag.fr>,
	caml-list <caml-list@inria.fr>
Subject: Re: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
Date: Sat, 13 Apr 2013 09:56:20 +0200	[thread overview]
Message-ID: <CAPFanBE19vET5afD_yQLEOVeUzOr5QRqTt-BZhYPkekdgH-KEg@mail.gmail.com> (raw)
In-Reply-To: <6936226f.468c.13e02309329.Coremail.syshen@nudt.edu.cn>

There is a fairly generic way to get an efficient data structure if
you don't mind huge preprocessing costs. You can see your problem as a
word recognition problem (you want to accept only words that are
sublists of one of the lists in your set), so a natural data
representation of this is a finite-state automaton.
Getting an efficient automaton out of your data set is easy (but may
be extremely costly): you only need to implement a determinization
algorithm (and if you want to avoid space explosion, maybe a
minimization algorithm as well) and those are well-known. Given an
automaton for a list LL, you can add a new list L by creating an
automaton recognizing sublists of L, making its union with your LL
automaton, and determinizing again.

Of course, that is a kind of giant hammer, there are probably more
specialized approaches that may be suitable for your problem. I didn't
understand whether you're trying to check a subsequence problem ('ac'
is a subsequence of 'abcd') or a substring problem ('ab' is not a
substring, while 'abc' would be). For the substring problem, a common
trick is to add to your trie not only a L, but also the reversed
prefixes of L: for the word 'abcd' you would store 'abcd', 'bcd|a',
'cd|ab', 'd|abc'. Checking substring inclusion is then immediate. This
results in a multiplication of the memory usage; note that DFA
minimization can be seen as an optimal, principled way to introduce
sharing in this data structure.

On Sat, Apr 13, 2013 at 8:58 AM, 沈胜宇 <syshen@nudt.edu.cn> wrote:
> Dear Monin:
>
> thank you for your help.
>
> But I think trie is too general in the sense that it did not effiecently handle the case that two list with multiple(not just one) shared sublist.
>
> For example, I first insert a list a->b->c->d->e->f into trie, and then I insert a->b->d->e into the trie.
>
> the trie can not store the second shared sublist d->e in the same place, it can only store them like
> a->b->c->d->e->f
>     ->d->e
>
> do you have more suggesion on this?
>
> Shen
>> -----原始邮件-----
>> 发件人: "Jean-Francois Monin" <jean-francois.monin@imag.fr>
>> 发送时间: 2013-04-12 23:48:04 (星期五)
>> 收件人: "沈胜宇" <syshen@nudt.edu.cn>
>> 抄送: caml-list <caml-list@inria.fr>
>> 主题: Re: [Caml-list] [CAML]:: efficient data structure for storing and searching int list list
>>
>> You may have some total order on the elements of your lists.
>> Then consider only sorted lists, and implement LL with tries.
>>
>> JF
>>
>> On Fri, Apr 12, 2013 at 10:36:22PM +0800, 沈胜宇 wrote:
>> >    Dear all:
>> >    I have an int list list, whose name is LL
>> >    and I need to frequently decide whether a particular int list, whose name
>> >    is L, is a sublist of an element of LL.
>> >    Is there any efficent data structure to do this?
>> >    At the mean time, I store LL as (int, bool) Hashtbl.t list, that is, each
>> >    element of LL is stored as a hash table.
>> >    So searching L in LL is reduce to decide whether there exist an element of
>> >    LL, such every element of L hit in this element.
>> >    At the mean time, the space is not a big problem, but the run time
>> >    overhead is major concern,
>> >    So if there exist any more faster data structure?
>> >    Thank you
>> >    Shen
>>
>> --
>> Jean-Francois Monin
>> LIAMA Project FORMES, CNRS  &  Universite de Grenoble 1 &
>> Tsinghua University
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

next prev parent reply	other threads:[~2013-04-13  7:57 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-12 14:36 沈胜宇
2013-04-12 15:01 ` simon cruanes
2013-04-12 15:48 ` Jean-Francois Monin
2013-04-13  6:58   ` 沈胜宇
2013-04-13  7:56     ` Gabriel Scherer [this message]
2013-04-12 22:15 ` Toby Kelsey
2013-04-13  6:57   ` 沈胜宇
2013-04-23  9:05     ` Goswin von Brederlow

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPFanBE19vET5afD_yQLEOVeUzOr5QRqTt-BZhYPkekdgH-KEg@mail.gmail.com \
    --to=gabriel.scherer@gmail.com \
    --cc=caml-list@inria.fr \
    --cc=jean-francois.monin@imag.fr \
    --cc=syshen@nudt.edu.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).