Each file is split up into tokens and symbols by a Tokenizer. Comments and whitespaces are ignored.
if(a < b) { /* Yes! */ printf("Yes"); }
The result would be:
Token | Description |
---|---|
if | Token |
( | Symbol |
a | Token |
< | Symbol |
b | Token |
) | Symbol |
{ | Symbol |
printf | Token |
( | Symbol |
"Yes" | Token (as quoted string) |
) | Symbol |
; | Symbol |
} | Symbol |
Then, each file is compared with all other files. Each Token is compared with all other token. Finally, the longest match are selected as the result match.
What is a match? A match is the longest chain of equal tokens, with some exceptions.
File A
if(a < b) { /* Yes! */ printf("Yes"); }
File B
if(x < y) { // True printf("True"); }
A | B | Match |
---|---|---|
if | if | Yes |
( | ( | Yes |
a | x | No |
< | < | Yes |
b | y | No |
) | ) | Yes |
{ | { | Yes |
printf | printf | Yes |
( | ( | Yes |
"Yes" | "True" | No |
) | ) | Yes |
; | ; | Yes |
} | } | Yes |
13 | 13 | 3 |
10 Token of 13 are the same, resulting in a 76.92 % similarity. It depends on your individual progarmming course, if this match count's as equal or not.