caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Ocaml clone detector
@ 2009-09-03  0:21 Kihong Heo
  2009-09-03  1:19 ` [Caml-list] " Erik de Castro Lopo
  0 siblings, 1 reply; 8+ messages in thread
From: Kihong Heo @ 2009-09-03  0:21 UTC (permalink / raw)
  To: caml-list; +Cc: ms1

Hi everyone.
I think you can help me.

I want to know there is a clone detector for Ocaml program.

If you know any program like that, please let me know.

Cheers,
--
Kihong Heo
khheo@ropas.snu.ac.kr
Programming Research Lab.
Seoul National University


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Ocaml clone detector
  2009-09-03  0:21 Ocaml clone detector Kihong Heo
@ 2009-09-03  1:19 ` Erik de Castro Lopo
  2009-09-03  1:56   ` Jacques Carette
  0 siblings, 1 reply; 8+ messages in thread
From: Erik de Castro Lopo @ 2009-09-03  1:19 UTC (permalink / raw)
  To: caml-list

Kihong Heo wrote:

> I want to know there is a clone detector for Ocaml program.

Maybe it would help if you explained what a "clone detector"
is.

Erik
-- 
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Ocaml clone detector
  2009-09-03  1:19 ` [Caml-list] " Erik de Castro Lopo
@ 2009-09-03  1:56   ` Jacques Carette
  2009-09-03  3:14     ` Erik de Castro Lopo
  0 siblings, 1 reply; 8+ messages in thread
From: Jacques Carette @ 2009-09-03  1:56 UTC (permalink / raw)
  To: caml-list

Erik de Castro Lopo wrote:
> Maybe it would help if you explained what a "clone detector"
> is.
>   
A "clone" is software-engineering speak for "duplicated code".  Exactly
what qualifies as 'duplicated code' and how to efficiently find such
(without too many false positives nor false negatives) is still fairly
active research.  This is a huge issue in languages without decent
abstraction features, and less so otherwise (or so it seems).

Jacques


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Ocaml clone detector
  2009-09-03  1:56   ` Jacques Carette
@ 2009-09-03  3:14     ` Erik de Castro Lopo
  2009-09-03  5:38       ` Andrej Bauer
  0 siblings, 1 reply; 8+ messages in thread
From: Erik de Castro Lopo @ 2009-09-03  3:14 UTC (permalink / raw)
  To: caml-list

Jacques Carette wrote:

> Erik de Castro Lopo wrote:
> > Maybe it would help if you explained what a "clone detector"
> > is.
> >   
> A "clone" is software-engineering speak for "duplicated code".  Exactly
> what qualifies as 'duplicated code' and how to efficiently find such
> (without too many false positives nor false negatives) is still fairly
> active research.  This is a huge issue in languages without decent
> abstraction features, and less so otherwise (or so it seems).

Thanks for the explanation. 

I can think of two situations where such a clone detector may be useful;
for finding similar chunks of code so that they may be refactored for
software QA management and for detecting plaguarism in student programming
assignments.

I suspect that these two kinds of clone detectors would actually be
quite different.

Cheers,
Erik
-- 
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Ocaml clone detector
  2009-09-03  3:14     ` Erik de Castro Lopo
@ 2009-09-03  5:38       ` Andrej Bauer
  2009-09-03 15:06         ` Nicolas barnier
  2009-09-04 11:30         ` Stéphane Glondu
  0 siblings, 2 replies; 8+ messages in thread
From: Andrej Bauer @ 2009-09-03  5:38 UTC (permalink / raw)
  To: caml-list

As far as student plagiarism goes, we found out that for Java
programs, you can pretty much detect frauds by erasing everything from
the programs except parentheses ( ) { } and then comparing the
remaining strings for editing distance. My explanation is that
students who copy code don't want to spend much time on it. In order
to chance the parenthesis they would have to understand the structure
of the program, which they don't.

Andrej


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Ocaml clone detector
  2009-09-03  5:38       ` Andrej Bauer
@ 2009-09-03 15:06         ` Nicolas barnier
  2009-09-03 23:13           ` John Clements
  2009-09-04 11:30         ` Stéphane Glondu
  1 sibling, 1 reply; 8+ messages in thread
From: Nicolas barnier @ 2009-09-03 15:06 UTC (permalink / raw)
  To: caml-list

An amazing and simple technology to detect plagiarism is
compression-based similarity distance. It is a side-effect
of state-of-the-art compression algorithms that can be used
to compute a distance for many kind of documents (it seems
to work at least for program sources, books, music, DNA etc):
take any two files A and B, compress A, compress B, and compress
the concatenation of A and B, i.e. AB; take the size of these
compressed files c(A), c(B) and c(AB); the similarity distance
is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)).
Indeed, if documents A and B share information, the compression
of AB will be much shorter than c(A) + c(B).

A good article (in French unfortunately) can be found at:

http://interstices.info/jcms/c_21828/classer-musiques-langues-images-textes-et-genomes

where a link points to "Baldr", a free Java application written
by a French CS professor to compute this distance pairwise for a
set of source codes and sort the result. What appears is that the
distance between two documents will be much smaller in case of
plagiarism than for any other two (even if good students will tend
to produce close source codes for the same exam).

I wrote a small Ocaml program (baldml) to perform the same task
(but without GUI):

./baldml.opt -algo bz2 -regexp ".+ml$" -n 3 dir

where you can choose the compression algorithm among bz2 or gzip,
specify a Str-style regexp to filter the files (Ocaml files in the
example, but I use it as well for C exam) and the number of sorted
lines you want in the output among the n(n-1)/2 unique pairs of the
n matching files recursively found in the directory "dir".

I can provide the 100 lines of code it if anyone is interested.

Hope this helps,

-- Nicolas


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Ocaml clone detector
  2009-09-03 15:06         ` Nicolas barnier
@ 2009-09-03 23:13           ` John Clements
  0 siblings, 0 replies; 8+ messages in thread
From: John Clements @ 2009-09-03 23:13 UTC (permalink / raw)
  To: Nicolas barnier; +Cc: caml-list


On Sep 3, 2009, at 8:06 AM, Nicolas barnier wrote:

> An amazing and simple technology to detect plagiarism is
> compression-based similarity distance. It is a side-effect
> of state-of-the-art compression algorithms that can be used
> to compute a distance for many kind of documents (it seems
> to work at least for program sources, books, music, DNA etc):
> take any two files A and B, compress A, compress B, and compress
> the concatenation of A and B, i.e. AB; take the size of these
> compressed files c(A), c(B) and c(AB); the similarity distance
> is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)).
> Indeed, if documents A and B share information, the compression
> of AB will be much shorter than c(A) + c(B).

Also see Alex Aiken's "MOSS" (measure of software similarity).  It's  
online, language-specific, works for a variety of languages.  Don't  
know how its algorithm compares to the one here. I suspect it's  
different insofar the one you describe is language-independent.

John Clements


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Ocaml clone detector
  2009-09-03  5:38       ` Andrej Bauer
  2009-09-03 15:06         ` Nicolas barnier
@ 2009-09-04 11:30         ` Stéphane Glondu
  1 sibling, 0 replies; 8+ messages in thread
From: Stéphane Glondu @ 2009-09-04 11:30 UTC (permalink / raw)
  To: Andrej Bauer; +Cc: caml-list

Andrej Bauer a écrit :
> As far as student plagiarism goes, we found out that for Java
> programs, you can pretty much detect frauds by erasing everything from
> the programs except parentheses ( ) { } and then comparing the
> remaining strings for editing distance. My explanation is that
> students who copy code don't want to spend much time on it. In order
> to chance the parenthesis they would have to understand the structure
> of the program, which they don't.

This reminds me of:

  http://www.cs.ucdavis.edu/~su/publications/icse07.pdf


Cheers,

-- 
Stéphane


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-09-04 11:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-03  0:21 Ocaml clone detector Kihong Heo
2009-09-03  1:19 ` [Caml-list] " Erik de Castro Lopo
2009-09-03  1:56   ` Jacques Carette
2009-09-03  3:14     ` Erik de Castro Lopo
2009-09-03  5:38       ` Andrej Bauer
2009-09-03 15:06         ` Nicolas barnier
2009-09-03 23:13           ` John Clements
2009-09-04 11:30         ` Stéphane Glondu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).