caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Nicolas barnier <barnier@recherche.enac.fr>
To: caml-list@inria.fr
Subject: Re: [Caml-list] Ocaml clone detector
Date: Thu, 03 Sep 2009 17:06:21 +0200	[thread overview]
Message-ID: <4A9FDB6D.6060107@recherche.enac.fr> (raw)
In-Reply-To: <7d8707de0909022238w3124a0a7v1756a577e8467f76@mail.gmail.com>

An amazing and simple technology to detect plagiarism is
compression-based similarity distance. It is a side-effect
of state-of-the-art compression algorithms that can be used
to compute a distance for many kind of documents (it seems
to work at least for program sources, books, music, DNA etc):
take any two files A and B, compress A, compress B, and compress
the concatenation of A and B, i.e. AB; take the size of these
compressed files c(A), c(B) and c(AB); the similarity distance
is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)).
Indeed, if documents A and B share information, the compression
of AB will be much shorter than c(A) + c(B).

A good article (in French unfortunately) can be found at:

http://interstices.info/jcms/c_21828/classer-musiques-langues-images-textes-et-genomes

where a link points to "Baldr", a free Java application written
by a French CS professor to compute this distance pairwise for a
set of source codes and sort the result. What appears is that the
distance between two documents will be much smaller in case of
plagiarism than for any other two (even if good students will tend
to produce close source codes for the same exam).

I wrote a small Ocaml program (baldml) to perform the same task
(but without GUI):

./baldml.opt -algo bz2 -regexp ".+ml$" -n 3 dir

where you can choose the compression algorithm among bz2 or gzip,
specify a Str-style regexp to filter the files (Ocaml files in the
example, but I use it as well for C exam) and the number of sorted
lines you want in the output among the n(n-1)/2 unique pairs of the
n matching files recursively found in the directory "dir".

I can provide the 100 lines of code it if anyone is interested.

Hope this helps,

-- Nicolas


  reply	other threads:[~2009-09-03 15:06 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-03  0:21 Kihong Heo
2009-09-03  1:19 ` [Caml-list] " Erik de Castro Lopo
2009-09-03  1:56   ` Jacques Carette
2009-09-03  3:14     ` Erik de Castro Lopo
2009-09-03  5:38       ` Andrej Bauer
2009-09-03 15:06         ` Nicolas barnier [this message]
2009-09-03 23:13           ` John Clements
2009-09-04 11:30         ` Stéphane Glondu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A9FDB6D.6060107@recherche.enac.fr \
    --to=barnier@recherche.enac.fr \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).