Menu

#5 Foreign langauge versions of some apps cause detection issue

open
psthomas
5
2010-09-03
2010-09-03
psthomas
No

Some users have reported that BlindElephant fails to detect (or imprecisely detects) foreign language versions of some apps. This is a consequence in part of the current technique (cryptographic hashing), and in part of the source packages initially chosen to create the datafiles (generally default or english, but occasionally all-languages packages which tend to prefer english documentation).

If BlindElephant attempts to fetch say, a readme.html file and has hashes for various the English versions of that, then it will be able to gain no information if it finds a Japanese version ("Retrieved file doesn't match known fingerprint. 55db377b389b213ea42eeda1ff99ea70"). Other files (say .js and .css) are usually common and provide good data, but app guessing (with a focus on just a few files) often doesn't get to these. This can be fixed in most cases by including additional language distributions in the datafile construction process where possible; this bug will be updated as new datafiles are available (might be a while).

Additionally, the use of MD5 is very much not the end of the story in using static files to identify applications, it just happened to be the easiest to implement up front. Fuzzy hashing (via something like ssdeep) and machine identified selected-substrings (similar in effect to Amazon's Statistically Improbably Phrases) both have promise and I'd welcome collaborators to implement either.

Please feel free to comment on apps or languages where this is particularly apparent and I'll attempt to prioritize those (no promises though).

Discussion


Log in to post a comment.