From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <barnier@recherche.enac.fr>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by yquem.inria.fr (Postfix) with ESMTP id 6119BBC37
	for <caml-list@yquem.inria.fr>; Thu,  3 Sep 2009 17:06:24 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: At4IAPp3n0rD3J8k/2dsb2JhbACBU4Ew2DuEGwU
X-IronPort-AV: E=Sophos;i="4.44,325,1249250400"; 
   d="scan'208";a="35437929"
Received: from imss-1.enac.fr ([195.220.159.36])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 03 Sep 2009 17:06:24 +0200
Received: from mauve.recherche.enac.fr (imss-1.imss.interne.enac [127.0.0.1])
	by imss-1.enac.fr (Postfix) with ESMTP id 96F8F18B3AB
	for <caml-list@inria.fr>; Thu,  3 Sep 2009 17:06:21 +0200 (CEST)
Received: from [10.31.1.89] (beige.recherche.enac.fr [10.31.1.89])
	by mauve.recherche.enac.fr (Postfix) with ESMTP id 524112FA825
	for <caml-list@inria.fr>; Thu,  3 Sep 2009 17:06:21 +0200 (CEST)
Message-ID: <4A9FDB6D.6060107@recherche.enac.fr>
Date: Thu, 03 Sep 2009 17:06:21 +0200
From: Nicolas barnier <barnier@recherche.enac.fr>
User-Agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090701)
MIME-Version: 1.0
To: caml-list@inria.fr
Subject: Re: [Caml-list] Ocaml clone detector
References: <F1B386A5-A12F-49B4-8B9B-42BD3E0C4726@ropas.snu.ac.kr><200909031	11944.6479d156.mle+ocaml@mega-nerd.com><4A9F2264.7000909@mcmaster.ca><20090	903131453.59a2d2e7.mle+ocaml@mega-nerd.com> <7d8707de0909022238w3124a0a7v1756a577e8467f76@mail.gmail.com>
In-Reply-To: <7d8707de0909022238w3124a0a7v1756a577e8467f76@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam: no; 0.00; barnier:01 barnier:01 enac:01 ocaml:01 interstices:01 ocaml:01 regexp:01 wrote:01 recursively:01 caml-list:01 pairs:01 algorithm:01 output:02 algorithms:03 seems:03 

An amazing and simple technology to detect plagiarism is
compression-based similarity distance. It is a side-effect
of state-of-the-art compression algorithms that can be used
to compute a distance for many kind of documents (it seems
to work at least for program sources, books, music, DNA etc):
take any two files A and B, compress A, compress B, and compress
the concatenation of A and B, i.e. AB; take the size of these
compressed files c(A), c(B) and c(AB); the similarity distance
is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)).
Indeed, if documents A and B share information, the compression
of AB will be much shorter than c(A) + c(B).

A good article (in French unfortunately) can be found at:

http://interstices.info/jcms/c_21828/classer-musiques-langues-images-textes-et-genomes

where a link points to "Baldr", a free Java application written
by a French CS professor to compute this distance pairwise for a
set of source codes and sort the result. What appears is that the
distance between two documents will be much smaller in case of
plagiarism than for any other two (even if good students will tend
to produce close source codes for the same exam).

I wrote a small Ocaml program (baldml) to perform the same task
(but without GUI):

./baldml.opt -algo bz2 -regexp ".+ml$" -n 3 dir

where you can choose the compression algorithm among bz2 or gzip,
specify a Str-style regexp to filter the files (Ocaml files in the
example, but I use it as well for C exam) and the number of sorted
lines you want in the output among the n(n-1)/2 unique pairs of the
n matching files recursively found in the directory "dir".

I can provide the 100 lines of code it if anyone is interested.

Hope this helps,

-- Nicolas