ant load-documentcounts never finishes

  • Paul Jungwirth

    Paul Jungwirth - 2012-06-27


    I'm trying to install the latest version of Perseus on Xubuntu 12.04 with MySQL 5.5.24 (64-bit). So far everything has gone smoothly, but everything freezes in the middle of running this command:

    ant load-documentcounts -Doptions=-lemmas

    I know some of the jobs take a long time, but according to the documentation this one is supposed to finish in 4 minutes. The other jobs have been finishing in roughly 1/4 of the estimated time. This one was still running after letting it go ~16 hours.

    Here are the last few lines of output:
          07:43:29,579  INFO UpdateTimestampsCache:41 - starting update timestamps cache at region: org.hibernate.cache.UpdateTimestampsCache
          07:43:29,580  WARN EhCacheProvider:86 - Could not find configuration ; using defaults.
          07:43:29,580  INFO StandardQueryCache:52 - starting query cache at region: org.hibernate.cache.StandardQueryCache
          07:43:29,580  WARN EhCacheProvider:86 - Could not find configuration ; using defaults.
          07:43:29,636  INFO DocumentCountLoader:53 - Processing language: Greek
          07:43:29,649  INFO SQLHandler:109 - Using offline data source
          07:43:29,668  INFO SQLHandler:112 - driver = com.mysql.jdbc.Driver
          07:43:29,668  INFO SQLHandler:114 - dbURL = jdbc:mysql://localhost:3306/sor
          07:43:29,668  INFO SQLHandler:116 - login = perseus
          07:43:29,669  INFO SQLHandler:118 - password = ********
          07:43:29,732  INFO DocumentCountLoader:59 - Counted 399 documents

    When I try typing `show processlist` in MySQL as it runs, I can see that it executes this query, which appears to complete successfully, but after that it does nothing:

         Id: 114
       User: perseus
       Host: localhost:59178
         db: sor
    Command: Execute
       Time: 36
      State: Sending data
       Info: select entitydocu0_.entity_id as col_0_0_, count(distinct entitydocu0_.document_id) as col_1_0_, sum

    Once this query is done, I can see that none of the connections are busy:

    mysql> show processlist;
    | Id  | User    | Host            | db   | Command | Time | State | Info             |
    | 141 | root    | localhost       | NULL | Query   |    0 | NULL  | show processlist |
    | 142 | perseus | localhost:59796 | sor  | Sleep   |   94 |       | NULL             |
    | 143 | perseus | localhost:59795 | sor  | Sleep   |   94 |       | NULL             |
    | 144 | perseus | localhost:59797 | sor  | Sleep   |   94 |       | NULL             |
    | 145 | perseus | localhost:59798 | sor  | Sleep   |   93 |       | NULL             |
    | 146 | perseus | localhost:59799 | sor  | Sleep   |   94 |       | NULL             |
    | 148 | perseus | localhost:59801 | sor  | Sleep   |   93 |       | NULL             |
    7 rows in set (0.00 sec)

    Any idea what is going on? How can I troubleshoot this problem?


  • Paul Jungwirth

    Paul Jungwirth - 2012-06-27

    Just some more information here:

    The line that never returns is in persues/ie/freq/dao/

        return getSession().createQuery(
            "select f.entity, count(distinct f.documentID), " +
            "sum(f.maxFrequency), sum(f.minFrequency) " +
            "from EntityDocumentFrequency f where " +
            "f.documentID is not null and " +
            "f.entity.class = 'Lemma' and f.entity.language = :lang " +
        "group by f.entity")
        .setParameter("lang", language)

    If I split that out, it appears it's `scroll` that is hanging. If I call getQueryString after setParameter, I get this:

    select f.entity, count(distinct f.documentID), sum(f.maxFrequency), sum(f.minFrequency) from EntityDocumentFrequency f where f.documentID is not null and f.entity.class = 'Lemma' and f.entity.language = :lang group by f.entity

    By turning on SQL logging, I can also see the full query being run:

    [java] 08:25:14,743 DEBUG SQL:393 - select entitydocu0_.entity_id as col_0_0_, count(distinct entitydocu0_.document_id) as col_1_0_, sum(entitydocu0_.max_freq) as col_2_0_, sum(entitydocu0_.min_freq) as col_3_0_, as id5_, entity1_.auth_name as auth3_5_, entity1_.display_name as display4_5_, entity1_.sort_string as sort5_5_, entity1_.max_occ as max6_5_, entity1_.min_occ as min7_5_, entity1_.doc_count as doc8_5_, entity1_.idf as idf5_, entity1_1_.place_longitude as place2_6_, entity1_1_.place_latitude as place3_6_, entity1_1_.place_site_name as place4_6_, entity1_1_.place_state as place5_6_, entity1_1_.place_nation as place6_6_, entity1_2_.date_year as date2_8_, entity1_2_.date_month as date3_8_, entity1_2_.date_day as date4_8_, entity1_2_.date_hour as date5_8_, entity1_2_.date_minute as date6_8_, entity1_2_.date_second as date7_8_, entity1_2_.date_sec_fraction as date8_8_, entity1_3_.date_range_start_date_id as date2_9_, entity1_3_.date_range_end_date_id as date3_9_, entity1_4_.lemma_text as lemma2_10_, entity1_4_.bare_headword as bare3_10_, entity1_4_.lemma_sequence_number as lemma4_10_, entity1_4_.lemma_lang_id as lemma5_10_, entity1_4_.lemma_short_def as lemma6_10_, as name18_, entity1_5_.type as type18_, entity1_5_.location as location18_, entity1_5_.summary as summary18_, entity1_5_.perseus_version as perseus6_18_, entity1_5_.entered_by as entered7_18_, entity1_5_.sources_used as sources8_18_, entity1_5_.other_bibliography as other9_18_, entity1_5_.documentary_references as documen10_18_, entity1_6_.accession_number as accession2_19_, entity1_6_.dimensions as dimensions19_, entity1_6_.region as region19_, entity1_6_.start_date as start5_19_, entity1_6_.start_mod as start6_19_, entity1_6_.end_date as end7_19_, entity1_6_.end_mod as end8_19_, entity1_6_.unitary_date as unitary9_19_, entity1_6_.unitary_mod as unitary10_19_, entity1_6_.date_for_sort as date11_19_, entity1_6_.period as period19_, entity1_6_.period_for_sort as period13_19_, entity1_6_.culture as culture19_, entity1_6_.context as context19_, entity1_6_.context_mod as context16_19_, entity1_6_.findspot as findspot19_, entity1_6_.findspot_mod as findspot18_19_, entity1_6_.collection as collection19_, entity1_6_.date_description as date20_19_, entity1_6_.collection_history as collection21_19_, entity1_6_.donor as donor19_, entity1_6_.condit as condit19_, entity1_6_.condition_description as condition24_19_, entity1_6_.comparanda as comparanda19_, entity1_6_.material as material19_, entity1_6_.material_description as material27_19_, entity1_6_.other_notes as other28_19_, entity1_7_.actual_weight as actual2_20_, entity1_7_.commentary as commentary20_, entity1_7_.denomination as denomina4_20_, entity1_7_.die_axis as die5_20_, entity1_7_.issuing_authority as issuing6_20_, entity1_7_.obverse_legend as obverse7_20_, entity1_7_.obverse_type as obverse8_20_, entity1_7_.reverse_legend as reverse9_20_, entity1_7_.reverse_type as reverse10_20_, entity1_8_.category as category21_, entity1_8_.object_function as object3_21_, entity1_8_.graffiti as graffiti21_, entity1_8_.inscription as inscript5_21_, entity1_8_.inscription_bibliography as inscript6_21_, entity1_8_.original as original21_, entity1_8_.original_or_copy as original8_21_, entity1_8_.placement as placement21_, entity1_8_.primary_citation as primary10_21_, entity1_8_.scale as scale21_, entity1_8_.scale_for_sort as scale12_21_, entity1_8_.sculptor as sculptor21_, entity1_8_.sculptor_mod as sculptor14_21_, as style21_, entity1_8_.form_style_description as form16_21_, entity1_8_.subject_description as subject17_21_, entity1_8_.technique as technique21_, entity1_8_.technique_description as technique19_21_, entity1_8_.title as title21_, entity1_8_.sculpture_type as sculpture21_21_, entity1_8_.in_group as in22_21_, entity1_8_.in_whole as in23_21_, entity1_9_.assoc_building as assoc2_22_, entity1_9_.category as category22_, entity1_9_.object_function as object4_22_, entity1_9_.graffiti as graffiti22_, entity1_9_.inscription as inscript6_22_, entity1_9_.inscription_bibliography as inscript7_22_, entity1_9_.original as original22_, entity1_9_.original_or_copy as original9_22_, entity1_9_.placement as placement22_, entity1_9_.primary_citation as primary11_22_, entity1_9_.scale as scale22_, entity1_9_.scale_for_sort as scale13_22_, entity1_9_.sculptor as sculptor22_, entity1_9_.sculptor_mod as sculptor15_22_, as style22_, entity1_9_.form_style_description as form17_22_, entity1_9_.subject_description as subject18_22_, entity1_9_.technique as technique22_, entity1_9_.technique_description as technique20_22_, entity1_9_.title as title22_, entity1_9_.sculpture_type as sculpture22_22_, entity1_9_.in_group as in23_22_, entity1_9_.in_whole as in24_22_, entity1_10_.ceramic_phase as ceramic2_23_, entity1_10_.decoration_description as decoration3_23_, entity1_10_.essay_number as essay4_23_, entity1_10_.essay_text as essay5_23_, entity1_10_.graffiti as graffiti23_, entity1_10_.inscriptions as inscript7_23_, entity1_10_.painter as painter23_, entity1_10_.painter_mod as painter9_23_, entity1_10_.attributed_by as attributed10_23_, entity1_10_.potter as potter23_, entity1_10_.potter_mod as potter12_23_, entity1_10_.primary_citation as primary13_23_, entity1_10_.beazley_number as beazley14_23_, entity1_10_.relief as relief23_, entity1_10_.shape as shape23_, entity1_10_.shape_description as shape17_23_, entity1_10_.ware as ware23_, entity1_11_.architectural_order as architec2_24_, entity1_11_.architect as architect24_, entity1_11_.architect_evidence as architect4_24_, entity1_11_.building_type as building5_24_, entity1_11_.history as history24_, entity1_11_.plan as plan24_, entity1_11_.see_also as see8_24_, entity1_12_.extent as extent25_, entity1_12_.human_name as human3_25_, entity1_12_.region as region25_, entity1_12_.site_type as site5_25_, entity1_12_.description as descript6_25_, entity1_12_.exploration as explorat7_25_, entity1_12_.periods as periods25_, entity1_12_.physical as physical25_, entity1_.entity_type as entity2_5_ from hib_frequencies entitydocu0_ inner join hib_entities entity1_ on left outer join hib_places entity1_1_ on left outer join hib_dates entity1_2_ on left outer join hib_date_ranges entity1_3_ on left outer join hib_lemmas entity1_4_ on left outer join hib_artifacts entity1_5_ on left outer join hib_atomic_artifacts entity1_6_ on left outer join hib_coin_artifacts entity1_7_ on left outer join hib_gem_artifacts entity1_8_ on left outer join hib_sculpture_artifacts entity1_9_ on left outer join hib_vase_artifacts entity1_10_ on left outer join hib_building_artifacts entity1_11_ on left outer join hib_site_artifacts entity1_12_ on where entitydocu0_.type='E' and (entitydocu0_.document_id is not null) and entity1_.entity_type='Lemma' and entity1_4_.lemma_lang_id=? group by entitydocu0_.entity_id

    If I run that by hand, passing in 1 for lemma_lang_id (for Greek), I get zero results. Any idea why the `scroll` call would hang on a query that gives zero results?

    Also, any idea why there are zero results? All the other ant tasks up to now have completed successfully.


  • Bridget Almas

    Bridget Almas - 2012-07-01

    Hi, this is a particularly problematic area of the perseus database load process. It consumes a huge amount of temp space, and it might be hanging for you if your temp space has been filled up. 

    Have you tried installing the database snapshot, as described in the INSTALLWITHDATA.html file of the distribution? That is quite a bit easier and much less time consuming, and will give you a complete snapshot of the open source database.

    If you still want to proceed with installing from scratch, there have been some improvements to this part of the code recently which aren't available yet on sourceforge. Let me know if you are still interested in trying it and I'll send the updated code to you.

  • Paul Jungwirth

    Paul Jungwirth - 2012-07-13

    Hi, thanks for your reply! It doesn't seem to be a temp space issue. I'm not running /tmp on a separate partition, and the disk has plenty of space. Also the command doesn't seem to use much space, even after running for a while:

    $ sudo du -sh /tmp/* | sort-by-filesize 
    0   /tmp/
    4.0K    /tmp/keyring-YClWMm
    4.0K    /tmp/plugtmp
    4.0K    /tmp/pulse-PKdhtXMmr18n
    4.0K    /tmp/ssh-maYuHDIT1937
    4.0K    /tmp/vDIerKB
    4.0K    /tmp/voH4MFb
    8.0K    /tmp/pulse-2L9K88eMlGn7
    12K /tmp/CRX_75DAF8CB7768
    100K    /tmp/hsperfdata_paul
    1.5M    /tmp/Oregon_Proposed_Wilderness.pdf

    I'd be delighted to see your improved code. Is there an svn repository I can pull it from?



Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks