DuMP3 - duplicate & similar file finder / Discussion / Help: New version much slower?

Daniel - 2008-03-19

Hello;
Let me commend you on an excellent utility that has saved me loads of time. I have a vast collection of images that I am using the Average Color Fingerprint for comparison... the number is somewhere over 100,000 images of varying dimension. I know there are some duplicates as they are sorted by subject matter. I downloaded and started the latest release from a few days back. Seeing that there were database schema changes, I dropped the existing db tables and initiated a rescan. The problem is that handling only the first 7000 images has proven to take over 24 hours which normally took a few hours to complete on the previous version.
From what I can tell, there is much more SQL activity on the newer version than the older version, but it doesn't seem the back end is the bottleneck as resources are still at about 15% idle.

Thanks!
Daniel

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alexander Grasser - 2009-03-10
  
  Raegis,
  
  The difference is due to the fact that version 0.3.0 reads from the database when comparing, whilst version 0.2.x does not and keeps everything in memory. The objective of version 0.3 was to reduce the memory usage of Dump3. This unfortunately impacted performance, but I'm hoping to implement a faster hash search algorithm in future, which should speed up comparison drastically.
  
  On the other hand, if you stop Dump3 at any point during it's processing and start it again, the files that were already compared shouldn't be compared again because the result is stored.
  
  Alex
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alexander Grasser - 2008-03-19
  
  The new version of DuMP3 would have detected the missing tables and created them. So you made your life much more difficult than it needed to be. But the initial "speed loss" due to database access is gained later when DuMP3 is restarted. The comparison results are now read from the database and not calculated again. I may have to think about batching the results and writing them in a separate thread.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Daniel - 2008-03-20
    
    Well, what caused me to dump/rebuild the database was the combination of the two following errors that filled the dump3.log file. Looking at the stack trace, they seem to indicate a missing column as well as some duplicates on a private key. My guess is that it was generating new fingerprints for each file (causing the slowness) and then failing to insert them into the database. These showed up too many times to count.
    13:43:05,872 [Comparer:] WARN net.za.grasser.duplicate.persist.SQLDatabase - Error while finding similarity!
    com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Unknown column 'p1.fingerprintid' in 'where clause'
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1026)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3491)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3423)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1936)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2060)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
    at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885)
    at net.za.grasser.duplicate.persist.SQLDatabase.getSimilarity(SQLDatabase.java:1187)
    at net.za.grasser.duplicate.compare.Comparer.compare(Comparer.java:434)
    at net.za.grasser.duplicate.compare.Comparer.run(Comparer.java:583)
    at java.lang.Thread.run(Unknown Source)
    13:43:05,887 [Comparer:] WARN net.za.grasser.duplicate.persist.SQLDatabase - Fingerprint already exists!
    com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException: Duplicate entry '21-1' for key 1
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1011)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3491)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3423)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1936)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2060)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
    at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2019)
    at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1937)
    at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1922)
    at net.za.grasser.duplicate.persist.SQLDatabase.insertFingerprint(SQLDatabase.java:1332)
    at net.za.grasser.duplicate.persist.SQLDatabase.put(SQLDatabase.java:1671)
    at net.za.grasser.duplicate.persist.SQLDatabase.putSimilarity(SQLDatabase.java:1739)
    at net.za.grasser.duplicate.compare.Comparer.compare(Comparer.java:518)
    at net.za.grasser.duplicate.compare.Comparer.run(Comparer.java:583)
    at java.lang.Thread.run(Unknown Source)
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Daniel - 2008-03-26
    
    Alexander;
    I was again experiencing slowdown during comparisons yesterday and I began thinking about what you said in this post. Today, I looked at the DB tables and what is actually contained within. It seems that for every file scanned, there is an entry in the duplicate table comparing that file to all other files already scanned. This is causing huge growth in the database. It turns out that the slow down was not caused by the comparison, but rather the time taken to ask the database for results as the duplicate table grew. The formula for determining the number of items in the duplicate table is
    n!/(n-2)!
    where n=number of files scanned AND compared
    So, for four files, the table would look like this:
    1=NULL
    2=(a,b) (b,a)
    3=(a,b) (a,c) (b,a) (b,c) (c,a) (c,b)
    4=(a,b) (a,c) (a,d) (b,a) (b,c) (b,d) (c,a) (c,b) (c,d) (d,a) (d,b) (d,c)
    I abandoned the process when my duplicate table reached 15,019,060 rows at a size of 243.5Mb
    
    I'm not sure how many rows should have ended up in the database with the 13498 files scanned, but the comparisons never would have completed.
    As far as Calc could tell me (cant get higher than 170!), having only 170 files would result in 28730 rows in the database - that sounds like it would be manageable for MySQL.
    Files 170 Numerator 7.26E+306
    Denominator 2.53E+302
    total rows 28730
    However, I don't know how well others would fare with a larger amount of files.
    
    Also, it seems the float value for the similarity field is kept to 13 significant digits and might only need to be an int (for average color fingerprint at least) to save on space in the DB.
    
    This analysis seems to point to difficulties for comparing large numbers of files for the new version, but I don't experience that problem in the older version. I will check out the previous version source tomorrow and see what the big difference is when comparing files that already have a fingerprint. I suspect you will know right away, though :) For what it's worth, it appears to detect duplicates accurately in the older version.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Alexander Grasser - 2008-03-26
      
      Actually the number of entries in the duplicate table is n(n-1)/2 (search for "handshake problem") or slightly more since it saves the fingerprint ids and not the file ids. Each file can have more than one fingerprint, but if the files are only binary comparable, then it will save 1 entry.
      I do not store (a,b) (b,a) I only store (a,b) and the selects have OR clauses to check for (a,b) or (b,a).
      
      Operation for Version 0.2.x
      Find files (and save in db)
      generate fingerprints (and save in db)
      compare fingerprints algorithmically
      
      on next run for the same files
      find files (and retrieve ids from db)
      get fingerprint from db
      compare fingerprints algorithmically
      
      Operation for Version 0.3.x
      Find files (and save in db)
      generate fingerprints (and save in db)
      compare fingerprints algorithmically (and save compare result in db)
      
      on next run for the same files
      find files (and retrieve ids from db)
      get fingerprint (and ids) from db
      get compare result from db
      
      Version 0.2.x is faster (than 0.3.x) on the first run because it does not need to do the database accesses to determine the results, but slower on subsequent runs because the fingerprints need to be re-compared.
      Version 0.3.x slower on the first run because there is a select, a comparison and an insert for each fingerprint pair but much faster on subsequent runs over the same files because it only needs to select the result.
      
      Maybe in the next version I can find a happy medium between these two extremes.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Daniel - 2008-04-03
        
        You are right, I see how my math was a bit off. Perhaps making this a parameter in the config.xml could let the user decide how to do the comparisons. It seems it would be too difficult to find the medium ground to cover both the fast results gained from the database on smaller sets of comparisons while still maintaining a manageable set of data for larger sets of comparisons. For what it's worth, I am running an Athlon 64 X2 4400 and get about 130 comparisons/second on the older version. Even at 2200 files, that still takes only slightly more than half a minute to complete - so it seems with newer hardware the stored results may not even be needed.
        
        If you are interested in making this a config parameter, I'll see if I can work out a patch for the next release.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alexander Grasser - 2008-03-20
  
  you are correct. I forgot that I modified the fingerprint table to accommodate the new duplicate (results) table.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel - 2008-03-21
  
  After doing a comparison of times required to analyze the files, the difference is not that significant. I am going to hop back to the new version and re-analyze the files unless there is a predictable way to export the last version's data to the new format. Does that exist? I can write some scripts to do that and send it your way to add to a 'contrib' location if you'd like.
  
  Thanks!
  -Daniel
  P.S.
  I would like to contribute to this project in some way... how can I help? Documentation? Testing? Coding of menial stuff? I'm not so hot with java anymore, but can programatically figure out all sorts of things ;) Have you started on the back side of the Automatic Rules tab?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alexander Grasser - 2008-04-04
  
  Daniel
  
  My current thinking actually goes more towards caching the result reads/selects so that the database does not have to do a select for every fingerprint. And maybe doing a delayed write when the results are saved.
  
  But your idea of maybe putting an lower limit on the number of files before the results are saved may actually work as well.
  
  If we do it, then create an entry in config.xml that specifies the number of files that are compared before DuMP3 starts writing the results.
  "save after" = "5000" means then DuMP3 will start saving after the total number of files reaches 5000.
  
  I'll make you developer so long so that you can get the code from CVS.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Raegis - 2009-03-10
  
  I have a huge problem and I think you will understand it once you see this picture.
  http://img21.imageshack.us/img21/5866/vergl.jpg
  When the 0.3.2 version is somewhere between 200 and 300 compared pictures the speed is down to 0.01 comparsions per hour.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rose McCarry - 2009-09-23

i also know one interesting problems that is like this one. I had a lot of
duplicates of musical files on my PC. That's why I was recommended programme
Clone Remover. It is not only <a href="http://www.fileutilityblog.com
/archives/230">audio duplicate finder</a>. It can delete duplicates and has
other peculiarities. What do you think about it?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

New version much slower?

DuMP3 is a duplicate and similar file finder.

Forums

Help

New version much slower?

New version much slower?

DuMP3 is a duplicate and similar file finder.

Forums

Help

New version much slower? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

New version much slower?