Menu

HowItWorks

Arthur Zaczek

How it works

Each file is split up into tokens and symbols by a Tokenizer. Comments and whitespaces are ignored.

Example:

if(a < b)
{
    /* Yes! */
    printf("Yes");
}

The result would be:

Token Description
if Token
( Symbol
a Token
< Symbol
b Token
) Symbol
{ Symbol
printf Token
( Symbol
"Yes" Token (as quoted string)
) Symbol
; Symbol
} Symbol

Then, each file is compared with all other files. Each Token is compared with all other token. Finally, the longest match are selected as the result match.

What is a match? A match is the longest chain of equal tokens, with some exceptions.

  1. Every {max-match-distance} token must match
  2. {min-common-token} % of token must match.

Example:

File A

if(a < b)
{
    /* Yes! */
    printf("Yes");
}

File B

if(x < y)
{
    // True
    printf("True");
}
A B Match
if if Yes
( ( Yes
a x No
< < Yes
b y No
) ) Yes
{ { Yes
printf printf Yes
( ( Yes
"Yes" "True" No
) ) Yes
; ; Yes
} } Yes
13 13 3

10 Token of 13 are the same, resulting in a 76.92 % similarity. It depends on your individual progarmming course, if this match count's as equal or not.


Related

Wiki: Home

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.