Menu

New version much slower?

Help
Daniel
2008-03-19
2012-07-19
  • Daniel

    Daniel - 2008-03-19

    Hello;
    Let me commend you on an excellent utility that has saved me loads of time. I have a vast collection of images that I am using the Average Color Fingerprint for comparison... the number is somewhere over 100,000 images of varying dimension. I know there are some duplicates as they are sorted by subject matter. I downloaded and started the latest release from a few days back. Seeing that there were database schema changes, I dropped the existing db tables and initiated a rescan. The problem is that handling only the first 7000 images has proven to take over 24 hours which normally took a few hours to complete on the previous version.
    From what I can tell, there is much more SQL activity on the newer version than the older version, but it doesn't seem the back end is the bottleneck as resources are still at about 15% idle.

    Thanks!
    Daniel

     
    • Alexander Grasser

      Raegis,

      The difference is due to the fact that version 0.3.0 reads from the database when comparing, whilst version 0.2.x does not and keeps everything in memory. The objective of version 0.3 was to reduce the memory usage of Dump3. This unfortunately impacted performance, but I'm hoping to implement a faster hash search algorithm in future, which should speed up comparison drastically.

      On the other hand, if you stop Dump3 at any point during it's processing and start it again, the files that were already compared shouldn't be compared again because the result is stored.

      Alex

       
    • Alexander Grasser

      The new version of DuMP3 would have detected the missing tables and created them. So you made your life much more difficult than it needed to be. But the initial "speed loss" due to database access is gained later when DuMP3 is restarted. The comparison results are now read from the database and not calculated again. I may have to think about batching the results and writing them in a separate thread.

       
      • Daniel

        Daniel - 2008-03-20

        Well, what caused me to dump/rebuild the database was the combination of the two following errors that filled the dump3.log file. Looking at the stack trace, they seem to indicate a missing column as well as some duplicates on a private key. My guess is that it was generating new fingerprints for each file (causing the slowness) and then failing to insert them into the database. These showed up too many times to count.
        13:43:05,872 [Comparer:] WARN net.za.grasser.duplicate.persist.SQLDatabase - Error while finding similarity!
        com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Unknown column 'p1.fingerprintid' in 'where clause'
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1026)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3491)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3423)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1936)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2060)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
        at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885)
        at net.za.grasser.duplicate.persist.SQLDatabase.getSimilarity(SQLDatabase.java:1187)
        at net.za.grasser.duplicate.compare.Comparer.compare(Comparer.java:434)
        at net.za.grasser.duplicate.compare.Comparer.run(Comparer.java:583)
        at java.lang.Thread.run(Unknown Source)
        13:43:05,887 [Comparer:] WARN net.za.grasser.duplicate.persist.SQLDatabase - Fingerprint already exists!
        com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException: Duplicate entry '21-1' for key 1
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1011)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3491)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3423)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1936)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2060)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2019)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1937)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1922)
        at net.za.grasser.duplicate.persist.SQLDatabase.insertFingerprint(SQLDatabase.java:1332)
        at net.za.grasser.duplicate.persist.SQLDatabase.put(SQLDatabase.java:1671)
        at net.za.grasser.duplicate.persist.SQLDatabase.putSimilarity(SQLDatabase.java:1739)
        at net.za.grasser.duplicate.compare.Comparer.compare(Comparer.java:518)
        at net.za.grasser.duplicate.compare.Comparer.run(Comparer.java:583)
        at java.lang.Thread.run(Unknown Source)

         
      • Daniel

        Daniel - 2008-03-26

        Alexander;
        I was again experiencing slowdown during comparisons yesterday and I began thinking about what you said in this post. Today, I looked at the DB tables and what is actually contained within. It seems that for every file scanned, there is an entry in the duplicate table comparing that file to all other files already scanned. This is causing huge growth in the database. It turns out that the slow down was not caused by the comparison, but rather the time taken to ask the database for results as the duplicate table grew. The formula for determining the number of items in the duplicate table is
        n!/(n-2)!
        where n=number of files scanned AND compared
        So, for four files, the table would look like this:
        1=NULL
        2=(a,b) (b,a)
        3=(a,b) (a,c) (b,a) (b,c) (c,a) (c,b)
        4=(a,b) (a,c) (a,d) (b,a) (b,c) (b,d) (c,a) (c,b) (c,d) (d,a) (d,b) (d,c)
        I abandoned the process when my duplicate table reached 15,019,060 rows at a size of 243.5Mb

        I'm not sure how many rows should have ended up in the database with the 13498 files scanned, but the comparisons never would have completed.
        As far as Calc could tell me (cant get higher than 170!), having only 170 files would result in 28730 rows in the database - that sounds like it would be manageable for MySQL.
        Files 170 Numerator 7.26E+306
        Denominator 2.53E+302
        total rows 28730
        However, I don't know how well others would fare with a larger amount of files.

        Also, it seems the float value for the similarity field is kept to 13 significant digits and might only need to be an int (for average color fingerprint at least) to save on space in the DB.

        This analysis seems to point to difficulties for comparing large numbers of files for the new version, but I don't experience that problem in the older version. I will check out the previous version source tomorrow and see what the big difference is when comparing files that already have a fingerprint. I suspect you will know right away, though :) For what it's worth, it appears to detect duplicates accurately in the older version.

         
        • Alexander Grasser

          Actually the number of entries in the duplicate table is n(n-1)/2 (search for "handshake problem") or slightly more since it saves the fingerprint ids and not the file ids. Each file can have more than one fingerprint, but if the files are only binary comparable, then it will save 1 entry.
          I do not store (a,b) (b,a) I only store (a,b) and the selects have OR clauses to check for (a,b) or (b,a).

          • Operation for Version 0.2.x
            Find files (and save in db)
            generate fingerprints (and save in db)
            compare fingerprints algorithmically
          • on next run for the same files
            find files (and retrieve ids from db)
            get fingerprint from db
            compare fingerprints algorithmically

          • Operation for Version 0.3.x
            Find files (and save in db)
            generate fingerprints (and save in db)
            compare fingerprints algorithmically (and save compare result in db)

          • on next run for the same files
            find files (and retrieve ids from db)
            get fingerprint (and ids) from db
            get compare result from db

          Version 0.2.x is faster (than 0.3.x) on the first run because it does not need to do the database accesses to determine the results, but slower on subsequent runs because the fingerprints need to be re-compared.
          Version 0.3.x slower on the first run because there is a select, a comparison and an insert for each fingerprint pair but much faster on subsequent runs over the same files because it only needs to select the result.

          Maybe in the next version I can find a happy medium between these two extremes.

           
          • Daniel

            Daniel - 2008-04-03

            You are right, I see how my math was a bit off. Perhaps making this a parameter in the config.xml could let the user decide how to do the comparisons. It seems it would be too difficult to find the medium ground to cover both the fast results gained from the database on smaller sets of comparisons while still maintaining a manageable set of data for larger sets of comparisons. For what it's worth, I am running an Athlon 64 X2 4400 and get about 130 comparisons/second on the older version. Even at 2200 files, that still takes only slightly more than half a minute to complete - so it seems with newer hardware the stored results may not even be needed.

            If you are interested in making this a config parameter, I'll see if I can work out a patch for the next release.

             
    • Alexander Grasser

      you are correct. I forgot that I modified the fingerprint table to accommodate the new duplicate (results) table.

       
    • Daniel

      Daniel - 2008-03-21

      After doing a comparison of times required to analyze the files, the difference is not that significant. I am going to hop back to the new version and re-analyze the files unless there is a predictable way to export the last version's data to the new format. Does that exist? I can write some scripts to do that and send it your way to add to a 'contrib' location if you'd like.

      Thanks!
      -Daniel
      P.S.
      I would like to contribute to this project in some way... how can I help? Documentation? Testing? Coding of menial stuff? I'm not so hot with java anymore, but can programatically figure out all sorts of things ;) Have you started on the back side of the Automatic Rules tab?

       
    • Alexander Grasser

      Daniel

      My current thinking actually goes more towards caching the result reads/selects so that the database does not have to do a select for every fingerprint. And maybe doing a delayed write when the results are saved.

      But your idea of maybe putting an lower limit on the number of files before the results are saved may actually work as well.

      If we do it, then create an entry in config.xml that specifies the number of files that are compared before DuMP3 starts writing the results.
      "save after" = "5000" means then DuMP3 will start saving after the total number of files reaches 5000.

      I'll make you developer so long so that you can get the code from CVS.

       
    • Raegis

      Raegis - 2009-03-10

      I have a huge problem and I think you will understand it once you see this picture.
      http://img21.imageshack.us/img21/5866/vergl.jpg
      When the 0.3.2 version is somewhere between 200 and 300 compared pictures the speed is down to 0.01 comparsions per hour.

       
  • Rose McCarry

    Rose McCarry - 2009-09-23

    i also know one interesting problems that is like this one. I had a lot of
    duplicates of musical files on my PC. That's why I was recommended programme
    Clone Remover. It is not only <a href="http://www.fileutilityblog.com
    /archives/230">audio
    duplicate finder</a>. It can delete duplicates and has
    other peculiarities. What do you think about it?

     

Log in to post a comment.