From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id 6119BBC37 for ; Thu, 3 Sep 2009 17:06:24 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: At4IAPp3n0rD3J8k/2dsb2JhbACBU4Ew2DuEGwU X-IronPort-AV: E=Sophos;i="4.44,325,1249250400"; d="scan'208";a="35437929" Received: from imss-1.enac.fr ([195.220.159.36]) by mail1-smtp-roc.national.inria.fr with ESMTP; 03 Sep 2009 17:06:24 +0200 Received: from mauve.recherche.enac.fr (imss-1.imss.interne.enac [127.0.0.1]) by imss-1.enac.fr (Postfix) with ESMTP id 96F8F18B3AB for ; Thu, 3 Sep 2009 17:06:21 +0200 (CEST) Received: from [10.31.1.89] (beige.recherche.enac.fr [10.31.1.89]) by mauve.recherche.enac.fr (Postfix) with ESMTP id 524112FA825 for ; Thu, 3 Sep 2009 17:06:21 +0200 (CEST) Message-ID: <4A9FDB6D.6060107@recherche.enac.fr> Date: Thu, 03 Sep 2009 17:06:21 +0200 From: Nicolas barnier User-Agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090701) MIME-Version: 1.0 To: caml-list@inria.fr Subject: Re: [Caml-list] Ocaml clone detector References: <200909031 11944.6479d156.mle+ocaml@mega-nerd.com><4A9F2264.7000909@mcmaster.ca><20090 903131453.59a2d2e7.mle+ocaml@mega-nerd.com> <7d8707de0909022238w3124a0a7v1756a577e8467f76@mail.gmail.com> In-Reply-To: <7d8707de0909022238w3124a0a7v1756a577e8467f76@mail.gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam: no; 0.00; barnier:01 barnier:01 enac:01 ocaml:01 interstices:01 ocaml:01 regexp:01 wrote:01 recursively:01 caml-list:01 pairs:01 algorithm:01 output:02 algorithms:03 seems:03 An amazing and simple technology to detect plagiarism is compression-based similarity distance. It is a side-effect of state-of-the-art compression algorithms that can be used to compute a distance for many kind of documents (it seems to work at least for program sources, books, music, DNA etc): take any two files A and B, compress A, compress B, and compress the concatenation of A and B, i.e. AB; take the size of these compressed files c(A), c(B) and c(AB); the similarity distance is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)). Indeed, if documents A and B share information, the compression of AB will be much shorter than c(A) + c(B). A good article (in French unfortunately) can be found at: http://interstices.info/jcms/c_21828/classer-musiques-langues-images-textes-et-genomes where a link points to "Baldr", a free Java application written by a French CS professor to compute this distance pairwise for a set of source codes and sort the result. What appears is that the distance between two documents will be much smaller in case of plagiarism than for any other two (even if good students will tend to produce close source codes for the same exam). I wrote a small Ocaml program (baldml) to perform the same task (but without GUI): ./baldml.opt -algo bz2 -regexp ".+ml$" -n 3 dir where you can choose the compression algorithm among bz2 or gzip, specify a Str-style regexp to filter the files (Ocaml files in the example, but I use it as well for C exam) and the number of sorted lines you want in the output among the n(n-1)/2 unique pairs of the n matching files recursively found in the directory "dir". I can provide the 100 lines of code it if anyone is interested. Hope this helps, -- Nicolas