caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Brian Hurt <bhurt@janestcapital.com>
To: sasha.mal@excite.com
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] More efficient implementation of intersection of sets?
Date: Fri, 04 Apr 2008 13:27:53 -0400	[thread overview]
Message-ID: <47F66519.1020400@janestcapital.com> (raw)
In-Reply-To: <20080402140127.055C08B314@xprdmxin.myway.com>

[-- Attachment #1: Type: text/plain, Size: 2065 bytes --]

sasha mal wrote:

>>I'm thinking of an alternative approach which inserts keys one at a time.
>>    
>>
>
>  
>
>>Basically, we go through both trees simultaneously, starting with smallest
>>    
>>
>
>  
>
>>nodes of both trees. If two nodes with the same key are seen, the are inserted
>>    
>>
>
>  
>
>>into a new tree that will contain the intersection; then the pair of nodes with
>>    
>>
>
>  
>
>>next larger keys is taken. Otherwise, if, say, key1<key2, then we search for
>>    
>>
>
>  
>
>>the smallest key k>=key2 inside the first tree. If k=key2, we add k into the
>>    
>>
>
>  
>
>>new tree and proceed to the next pair. If k>key2, we proceed with the pair 
>>    
>>
>
>  
>
>>(k, key2) instead of (key1, key2). For key1>key2, proceed analogously, i.e.
>>    
>>
>
>  
>
>>search for the smallest key k>=key1 in the second tree. If at some point of
>>    
>>
>
>  
>
>>time no "larger" key is found in one of the two original trees, the new tree
>>    
>>
>
>  
>
>>where we insert elements to contains exactly the intersection.
>>    
>>
>
>
>
>[Small corrections.]
>
>
>
>By the way, the current implemented procedure always splits the second tree, regardless of the sizes of the trees. Imagine that we have two trees of very different sizes (e.g. with 5 and 10^5 elements). What tree should be split then?
>
>  
>
Sorry for the late response:

The length-5 tree should be the one which is split, for two reasons.  
Reason number one, it makes it the most likely that one half or the 
other will be empty.  This means that it's much more likely that I'll be 
able to skip large subtrees of the million-element set without 
inspection.  The second reason is that split (the function) is O((log 
N)^2) in cost, this is obviously cheaper if N=5 than if N=1 million.

Looking at this algorithm over the last couple of days, I've come to 
conclusion that it's O(N) normal-case, but O(N (log N)^2) worst case, 
and O(log N) best-cast, where N = the size of the larger set.  But I'm 
not sure of it.  It's certainly an interesting approach.

Brian


[-- Attachment #2: Type: text/html, Size: 3366 bytes --]

  reply	other threads:[~2008-04-04 17:27 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-02 14:01 sasha mal
2008-04-04 17:27 ` Brian Hurt [this message]
  -- strict thread matches above, loose matches on Subject: below --
2008-04-02 13:30 sasha mal
2008-04-01 15:55 sasha mal
2008-04-01 23:42 ` [Caml-list] " Jon Harrop
2008-04-02 14:05   ` Frédéric Gava
2008-04-02 15:34 ` Mike Furr

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47F66519.1020400@janestcapital.com \
    --to=bhurt@janestcapital.com \
    --cc=caml-list@yquem.inria.fr \
    --cc=sasha.mal@excite.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).