OSS v1.5.4-SNAPSHOT - build 9d2137feea and File Crawling

Help
Tiani
2014-07-02
2014-07-03
  • Tiani
    Tiani
    2014-07-02

    Dear,

    I am testing this build and have the following remarques: OSS running on Windows 2008 (virtual) 4Gb Ram, Java VM is 2Gb dedicated, 1 processor (I will definitly upgrade to 4)

    I am crowling a network share (NAS) which is about more than 450GB and 700 000 files (doc, pdf, xls etc) the memory level is 1.7Gb and processor is 100% when crawling.

    • I ran last night the file crawling, this morning I found the server hanging with the log below.
    • I then restarted the server.
    • when crawling this version manage far better the temp folder, with last build I experienced my disk full because of the temp folder
    • So far 45878 documents have been indexed before hanging.

    • I will continue indexing is there any best practice for indexing such a huge share?

    Many Thanks

    00:08:38,143 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    00:08:39,596 INFO: root - RELOAD - Hourly - Tue Jul 01 23:00:00 CEST 2014 - Count:4 - Average:843.50006 - Min:843 - Max:844
    00:38:50,384 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    01:09:03,285 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    01:09:04,519 INFO: root - RELOAD - Hourly - Wed Jul 02 00:00:00 CEST 2014 - Count:4 - Average:1097.5 - Min:828 - Max:1687
    01:39:17,258 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    02:09:29,601 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    02:09:31,054 INFO: root - RELOAD - Hourly - Wed Jul 02 01:00:00 CEST 2014 - Count:4 - Average:835.0 - Min:828 - Max:842
    02:39:44,023 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    03:09:58,849 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    03:10:00,692 INFO: root - RELOAD - Hourly - Wed Jul 02 02:00:00 CEST 2014 - Count:4 - Average:847.5 - Min:843 - Max:859
    03:40:15,249 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    04:10:30,141 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    04:10:32,170 INFO: root - RELOAD - Hourly - Wed Jul 02 03:00:00 CEST 2014 - Count:4 - Average:995.75006 - Min:827 - Max:1469
    04:40:41,361 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getFileType (106)
    05:10:56,534 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getFileType (106)
    05:10:58,377 INFO: root - RELOAD - Hourly - Wed Jul 02 04:00:00 CEST 2014 - Count:4 - Average:987.25 - Min:812 - Max:1467
    05:41:08,182 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    06:11:21,735 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getFileSize (223)
    06:11:23,579 INFO: root - RELOAD - Hourly - Wed Jul 02 05:00:00 CEST 2014 - Count:4 - Average:944.75006 - Min:827 - Max:1281
    06:41:32,135 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getFileSize (223)
    07:11:42,416 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    07:11:43,650 INFO: root - RELOAD - Hourly - Wed Jul 02 06:00:00 CEST 2014 - Count:4 - Average:941.00006 - Min:812 - Max:1281
    07:41:54,713 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    08:12:05,523 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    08:12:06,976 INFO: root - RELOAD - Hourly - Wed Jul 02 07:00:00 CEST 2014 - Count:4 - Average:949.25006 - Min:828 - Max:1281
    08:42:16,496 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getFileSize (223)
    09:12:30,684 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getSecurity (237)
    09:12:32,121 INFO: root - RELOAD - Hourly - Wed Jul 02 08:00:00 CEST 2014 - Count:4 - Average:945.0 - Min:812 - Max:1281
    09:42:42,166 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.crawler.file.process.fileInstances.SmbFileInstance.getFileSize (223)
    10:45:29,086 WARN: root - Thread aborting (time out): com.jaeksoft.searchlib.parser.htmlParser.HtmlCleanerParser.getDocument (67)

     
  • Hi,

    The memory is definitely the main point. We often use OpenSearchServer on huge indexes. The crawling process himself is safe (browsing the file system) even in low memory conditions.

    However, to extract full-text information, we use external libraries (PDFBox, Apache POI, etc.) which may require a lot of memory. More memory is provided, better it is.

    As a best practice, we often use 6GB of memory (8GB physical).

     
  • Tiani
    Tiani
    2014-07-03

    Hi,

    I added memory to my VM: now is 8Gb
    I dedicated 6Gb to JVM
    I upgraded cpu to 4

    I ran the file crawling again with the option run forever and filed "Job to run when each session ends:" empty

    It does not index the entire share, it maybe indexed 10%

    any idea? nothing in the log, I have full admin right on the share, I suspected a directory with a "+" whhich has not been indexed but when I add it as a location it works!!
    So why OSS index just a part of my drive?

     
  • Tiani
    Tiani
    2014-07-03

    when indexing this directory I got in log:
    Error while working on URL: smb://myserver/Projects/B1+E4%20cost-price%20pass-through/Material/data/QREA%2012.3_II.3_labour%20cost%20pass-tru_AB_AD_PP_all%20graphs%20final.xlsx : A part with the name '/xl/drawings/drawing12.xml' already exists : Packages shall not contain equivalent part names and package implementers shall neither create nor recognize packages with equivalent part names. [M1.12]