From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bhurt@janestcapital.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=AWL,HTML_MESSAGE 
	autolearn=disabled version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id 6AAC3BBCA
	for <caml-list@yquem.inria.fr>; Fri,  4 Apr 2008 19:27:55 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AlgBAIcB9kdCm3xro2dsb2JhbACCPjQpimiDSgEBAQEBAQcFCQcWmXo
X-IronPort-AV: E=Sophos;i="4.25,605,1199660400"; 
   d="scan'208,217";a="11084627"
Received: from www.janestcapital.com (HELO smtp.janestcapital.com) ([66.155.124.107])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 04 Apr 2008 19:27:54 +0200
Received: from [172.25.129.161] [38.96.172.125] by janestcapital.com with ESMTP
  (SMTPD-9.10) id A5190A78; Fri, 04 Apr 2008 13:27:53 -0400
Message-ID: <47F66519.1020400@janestcapital.com>
Date: Fri, 04 Apr 2008 13:27:53 -0400
From: Brian Hurt <bhurt@janestcapital.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.2) Gecko/20040804 Netscape/7.2 (ax)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: sasha.mal@excite.com
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] More efficient implementation of intersection of
 sets?
References: <20080402140127.055C08B314@xprdmxin.myway.com>
In-Reply-To: <20080402140127.055C08B314@xprdmxin.myway.com>
Content-Type: multipart/alternative;
 boundary="------------090509060705010305090100"
X-Spam: no; 0.00; intersection:01 nodes:01 nodes:01 intersection:01 subtrees:01 subtrees:01 wrote:01 wrote:01 caml-list:01 pair:01 pair:01 algorithm:01 algorithm:01 corrections:01 corrections:01 

This is a multi-part message in MIME format.
--------------090509060705010305090100
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

sasha mal wrote:

>>I'm thinking of an alternative approach which inserts keys one at a time.
>>    
>>
>
>  
>
>>Basically, we go through both trees simultaneously, starting with smallest
>>    
>>
>
>  
>
>>nodes of both trees. If two nodes with the same key are seen, the are inserted
>>    
>>
>
>  
>
>>into a new tree that will contain the intersection; then the pair of nodes with
>>    
>>
>
>  
>
>>next larger keys is taken. Otherwise, if, say, key1<key2, then we search for
>>    
>>
>
>  
>
>>the smallest key k>=key2 inside the first tree. If k=key2, we add k into the
>>    
>>
>
>  
>
>>new tree and proceed to the next pair. If k>key2, we proceed with the pair 
>>    
>>
>
>  
>
>>(k, key2) instead of (key1, key2). For key1>key2, proceed analogously, i.e.
>>    
>>
>
>  
>
>>search for the smallest key k>=key1 in the second tree. If at some point of
>>    
>>
>
>  
>
>>time no "larger" key is found in one of the two original trees, the new tree
>>    
>>
>
>  
>
>>where we insert elements to contains exactly the intersection.
>>    
>>
>
>
>
>[Small corrections.]
>
>
>
>By the way, the current implemented procedure always splits the second tree, regardless of the sizes of the trees. Imagine that we have two trees of very different sizes (e.g. with 5 and 10^5 elements). What tree should be split then?
>
>  
>
Sorry for the late response:

The length-5 tree should be the one which is split, for two reasons.  
Reason number one, it makes it the most likely that one half or the 
other will be empty.  This means that it's much more likely that I'll be 
able to skip large subtrees of the million-element set without 
inspection.  The second reason is that split (the function) is O((log 
N)^2) in cost, this is obviously cheaper if N=5 than if N=1 million.

Looking at this algorithm over the last couple of days, I've come to 
conclusion that it's O(N) normal-case, but O(N (log N)^2) worst case, 
and O(log N) best-cast, where N = the size of the larger set.  But I'm 
not sure of it.  It's certainly an interesting approach.

Brian


--------------090509060705010305090100
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
  <title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
sasha mal wrote:<br>
<blockquote cite="mid20080402140127.055C08B314@xprdmxin.myway.com"
 type="cite">
  <blockquote type="cite">
    <pre wrap="">I'm thinking of an alternative approach which inserts keys one at a time.
    </pre>
  </blockquote>
  <pre wrap=""><!---->
  </pre>
  <blockquote type="cite">
    <pre wrap="">Basically, we go through both trees simultaneously, starting with smallest
    </pre>
  </blockquote>
  <pre wrap=""><!---->
  </pre>
  <blockquote type="cite">
    <pre wrap="">nodes of both trees. If two nodes with the same key are seen, the are inserted
    </pre>
  </blockquote>
  <pre wrap=""><!---->
  </pre>
  <blockquote type="cite">
    <pre wrap="">into a new tree that will contain the intersection; then the pair of nodes with
    </pre>
  </blockquote>
  <pre wrap=""><!---->
  </pre>
  <blockquote type="cite">
    <pre wrap="">next larger keys is taken. Otherwise, if, say, key1&lt;key2, then we search for
    </pre>
  </blockquote>
  <pre wrap=""><!---->
  </pre>
  <blockquote type="cite">
    <pre wrap="">the smallest key k&gt;=key2 inside the first tree. If k=key2, we add k into the
    </pre>
  </blockquote>
  <pre wrap=""><!---->
  </pre>
  <blockquote type="cite">
    <pre wrap="">new tree and proceed to the next pair. If k&gt;key2, we proceed with the pair 
    </pre>
  </blockquote>
  <pre wrap=""><!---->
  </pre>
  <blockquote type="cite">
    <pre wrap="">(k, key2) instead of (key1, key2). For key1&gt;key2, proceed analogously, i.e.
    </pre>
  </blockquote>
  <pre wrap=""><!---->
  </pre>
  <blockquote type="cite">
    <pre wrap="">search for the smallest key k&gt;=key1 in the second tree. If at some point of
    </pre>
  </blockquote>
  <pre wrap=""><!---->
  </pre>
  <blockquote type="cite">
    <pre wrap="">time no "larger" key is found in one of the two original trees, the new tree
    </pre>
  </blockquote>
  <pre wrap=""><!---->
  </pre>
  <blockquote type="cite">
    <pre wrap="">where we insert elements to contains exactly the intersection.
    </pre>
  </blockquote>
  <pre wrap=""><!---->


[Small corrections.]


By the way, the current implemented procedure always splits the second tree, regardless of the sizes of the trees. Imagine that we have two trees of very different sizes (e.g. with 5 and 10^5 elements). What tree should be split then?

  </pre>
</blockquote>
Sorry for the late response:<br>
<br>
The length-5 tree should be the one which is split, for two reasons.&nbsp;
Reason number one, it makes it the most likely that one half or the
other will be empty.&nbsp; This means that it's much more likely that I'll
be able to skip large subtrees of the million-element set without
inspection.&nbsp; The second reason is that split (the function) is O((log
N)^2) in cost, this is obviously cheaper if N=5 than if N=1 million.<br>
<br>
Looking at this algorithm over the last couple of days, I've come to
conclusion that it's O(N) normal-case, but O(N (log N)^2) worst case,
and O(log N) best-cast, where N = the size of the larger set.&nbsp; But I'm
not sure of it.&nbsp; It's certainly an interesting approach.<br>
<br>
Brian<br>
<br>
</body>
</html>

--------------090509060705010305090100--