From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL,HTML_MESSAGE autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id 6AAC3BBCA for ; Fri, 4 Apr 2008 19:27:55 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AlgBAIcB9kdCm3xro2dsb2JhbACCPjQpimiDSgEBAQEBAQcFCQcWmXo X-IronPort-AV: E=Sophos;i="4.25,605,1199660400"; d="scan'208,217";a="11084627" Received: from www.janestcapital.com (HELO smtp.janestcapital.com) ([66.155.124.107]) by mail3-smtp-sop.national.inria.fr with ESMTP; 04 Apr 2008 19:27:54 +0200 Received: from [172.25.129.161] [38.96.172.125] by janestcapital.com with ESMTP (SMTPD-9.10) id A5190A78; Fri, 04 Apr 2008 13:27:53 -0400 Message-ID: <47F66519.1020400@janestcapital.com> Date: Fri, 04 Apr 2008 13:27:53 -0400 From: Brian Hurt User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.2) Gecko/20040804 Netscape/7.2 (ax) X-Accept-Language: en-us, en MIME-Version: 1.0 To: sasha.mal@excite.com Cc: caml-list@yquem.inria.fr Subject: Re: [Caml-list] More efficient implementation of intersection of sets? References: <20080402140127.055C08B314@xprdmxin.myway.com> In-Reply-To: <20080402140127.055C08B314@xprdmxin.myway.com> Content-Type: multipart/alternative; boundary="------------090509060705010305090100" X-Spam: no; 0.00; intersection:01 nodes:01 nodes:01 intersection:01 subtrees:01 subtrees:01 wrote:01 wrote:01 caml-list:01 pair:01 pair:01 algorithm:01 algorithm:01 corrections:01 corrections:01 This is a multi-part message in MIME format. --------------090509060705010305090100 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit sasha mal wrote: >>I'm thinking of an alternative approach which inserts keys one at a time. >> >> > > > >>Basically, we go through both trees simultaneously, starting with smallest >> >> > > > >>nodes of both trees. If two nodes with the same key are seen, the are inserted >> >> > > > >>into a new tree that will contain the intersection; then the pair of nodes with >> >> > > > >>next larger keys is taken. Otherwise, if, say, key1> >> > > > >>the smallest key k>=key2 inside the first tree. If k=key2, we add k into the >> >> > > > >>new tree and proceed to the next pair. If k>key2, we proceed with the pair >> >> > > > >>(k, key2) instead of (key1, key2). For key1>key2, proceed analogously, i.e. >> >> > > > >>search for the smallest key k>=key1 in the second tree. If at some point of >> >> > > > >>time no "larger" key is found in one of the two original trees, the new tree >> >> > > > >>where we insert elements to contains exactly the intersection. >> >> > > > >[Small corrections.] > > > >By the way, the current implemented procedure always splits the second tree, regardless of the sizes of the trees. Imagine that we have two trees of very different sizes (e.g. with 5 and 10^5 elements). What tree should be split then? > > > Sorry for the late response: The length-5 tree should be the one which is split, for two reasons. Reason number one, it makes it the most likely that one half or the other will be empty. This means that it's much more likely that I'll be able to skip large subtrees of the million-element set without inspection. The second reason is that split (the function) is O((log N)^2) in cost, this is obviously cheaper if N=5 than if N=1 million. Looking at this algorithm over the last couple of days, I've come to conclusion that it's O(N) normal-case, but O(N (log N)^2) worst case, and O(log N) best-cast, where N = the size of the larger set. But I'm not sure of it. It's certainly an interesting approach. Brian --------------090509060705010305090100 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit sasha mal wrote:
I'm thinking of an alternative approach which inserts keys one at a time.
    

  
Basically, we go through both trees simultaneously, starting with smallest
    

  
nodes of both trees. If two nodes with the same key are seen, the are inserted
    

  
into a new tree that will contain the intersection; then the pair of nodes with
    

  
next larger keys is taken. Otherwise, if, say, key1<key2, then we search for
    

  
the smallest key k>=key2 inside the first tree. If k=key2, we add k into the
    

  
new tree and proceed to the next pair. If k>key2, we proceed with the pair 
    

  
(k, key2) instead of (key1, key2). For key1>key2, proceed analogously, i.e.
    

  
search for the smallest key k>=key1 in the second tree. If at some point of
    

  
time no "larger" key is found in one of the two original trees, the new tree
    

  
where we insert elements to contains exactly the intersection.
    



[Small corrections.]



By the way, the current implemented procedure always splits the second tree, regardless of the sizes of the trees. Imagine that we have two trees of very different sizes (e.g. with 5 and 10^5 elements). What tree should be split then?

  
Sorry for the late response:

The length-5 tree should be the one which is split, for two reasons.  Reason number one, it makes it the most likely that one half or the other will be empty.  This means that it's much more likely that I'll be able to skip large subtrees of the million-element set without inspection.  The second reason is that split (the function) is O((log N)^2) in cost, this is obviously cheaper if N=5 than if N=1 million.

Looking at this algorithm over the last couple of days, I've come to conclusion that it's O(N) normal-case, but O(N (log N)^2) worst case, and O(log N) best-cast, where N = the size of the larger set.  But I'm not sure of it.  It's certainly an interesting approach.

Brian

--------------090509060705010305090100--