From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.3 required=5.0 tests=SPF_FAIL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 9B309BC54 for ; Fri, 4 Sep 2009 01:13:41 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AuQEAOPpn0pDErp/X2dsb2JhbACbMh6/fYQbBQ X-IronPort-AV: E=Sophos;i="4.44,327,1249250400"; d="scan'208";a="45984610" Received: from li21-127.members.linode.com ([67.18.186.127]) by mail4-smtp-sop.national.inria.fr with ESMTP; 04 Sep 2009 01:13:40 +0200 Received: from pcp062738pcs.wireless.calpoly.edu (pcp062738pcs.wireless.calpoly.edu [129.65.201.168]) by li21-127.members.linode.com (Postfix) with ESMTPSA id E83C5E779; Thu, 3 Sep 2009 19:13:37 -0400 (EDT) Cc: caml-list@inria.fr Message-Id: <98CD7C59-7CBB-4988-916E-0A4CB343344D@brinckerhoff.org> From: John Clements To: Nicolas barnier In-Reply-To: <4A9FDB6D.6060107@recherche.enac.fr> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v936) Subject: Re: [Caml-list] Ocaml clone detector Date: Thu, 3 Sep 2009 16:13:37 -0700 References: <200909031 11944.6479d156.mle+ocaml@mega-nerd.com><4A9F2264.7000909@mcmaster.ca><20090 903131453.59a2d2e7.mle+ocaml@mega-nerd.com> <7d8707de0909022238w3124a0a7v1756a577e8467f76@mail.gmail.com> <4A9FDB6D.6060107@recherche.enac.fr> X-Mailer: Apple Mail (2.936) X-Spam: no; 0.00; ocaml:01 barnier:01 insofar:01 wrote:01 caml-list:01 compares:01 algorithm:01 algorithms:03 seems:03 languages:03 alex:03 shorter:04 side-effect:04 moss:04 size:95 On Sep 3, 2009, at 8:06 AM, Nicolas barnier wrote: > An amazing and simple technology to detect plagiarism is > compression-based similarity distance. It is a side-effect > of state-of-the-art compression algorithms that can be used > to compute a distance for many kind of documents (it seems > to work at least for program sources, books, music, DNA etc): > take any two files A and B, compress A, compress B, and compress > the concatenation of A and B, i.e. AB; take the size of these > compressed files c(A), c(B) and c(AB); the similarity distance > is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)). > Indeed, if documents A and B share information, the compression > of AB will be much shorter than c(A) + c(B). Also see Alex Aiken's "MOSS" (measure of software similarity). It's online, language-specific, works for a variety of languages. Don't know how its algorithm compares to the one here. I suspect it's different insofar the one you describe is language-independent. John Clements