file crawler is not crawling all subdirectories in path

Help
Stefan
2013-09-23
2013-10-24
  • Stefan
    Stefan
    2013-09-23

    Hi All,

    I have got another problem while learning to work with the OSS tool.

    I wanted to try out how big an index of my filesystem would get and how much time the creation needs.
    So I configured the file crawler to look at location "\hobby\X/".
    It started to crawl and finished without an error message.
    While studying the result i noticed, that it just had skipped some directories.
    There are files like .pdfs and .txt in it but he just skipps some subdirectories.

    What is my mistake?

    Thank you for your time!

    Best Regards

    Stefan

    Infos:
    Win version,
    1.4 stable rev2274 build 240
    Index as fileCrawler

     
    Last edit: Stefan 2013-09-23
    Attachments
  • Stefan
    Stefan
    2013-10-08

    Hi again,

    still this problem is not solved for me.

    So again some informations and what I tried.

    I want to use OSS to index a big file system mounted by SMB protocol.

    1)
    Working with the Windows Version of OSS:
    Starting OSS, creating a new index(file crawler preset) and let him crawl my directory.
    It just finished by around 8000 files, however the file system got around 500k files on it. It also skipped most of the subdirectories, thats why he didn't find all files(i guess).
    That was with all parsers 'on', default preset.
    Now I tried it with all parsers off, just the file system parser remaining in parser list. Same as above.

    2)
    Next Step was to use a linux machine for OSS. Unfortunately I just got a small linux server, so that i am limited to 1GB RAM.
    2.1)
    default file crawler preset -> same result as with windows.
    2.2)
    Turning off all parsers except the file system parser gives me around 500k files and all my sub directories!!
    2.3)
    So it seems it got sth to do with the parsers. So I am trying a long term analysis.
    My results so far are surprising but not helpful for me, maybe the reason is in there, but I just cannot find it. It also seems like the index is not going above the 1.2GB in my tries. There is defenetly more free disk space avaible.

    Maybe you have a hint for me, when looking at my data?
    The parser for each run are listed in the last column. The '(>X)' indicates the 'fail over to X' option is set for this parser.

    Thank you for your help!

    Best Regards

    Stefan

    (@Win and @Linux I am using the lastest stable release)

     
    Attachments
  • Stefan
    Stefan
    2013-10-10

    No suggestions?..

    I am not able to get any solution here myself i guess.

     
  • Naveen A.N
    Naveen A.N
    2013-10-11

    Hello Stefan,

    Sorry for the delay in response.

    It would be helpful for us to have the two log files oss.log and the catalina.log

    And can u please check the "URL Browser" that it contains any "ParserError" in the "Parsing" column.

    A screenshot of the "URL Browser" will also be helpful.

    Naveen.A.N

     
  • Stefan
    Stefan
    2013-10-11

    Hi Naveen,

    no problem.

    There is just one "Parser Error" in testrun LTT08-mm.

    The two log files are attached.

     
    Last edit: Stefan 2013-10-11
    Attachments
  • Stefan
    Stefan
    2013-10-11

    second File.

     
    Attachments
  • Stefan
    Stefan
    2013-10-17

    Hi again,

    some more analysis of the problem:
    Till now it was like that:
    On linux you got parser dependent runs, where some parser can reach all files and some others lead to a stop of the whole crawling process.

    On Windows all parser result in the same situation with around 8700 files parsed.

    Now i was trying to debug the software with eclipse and own logs.
    Now i know, that the on windows there is a problem with directories where i do not have access to.
    In CrawFileThread.java the lines
    "
    if (!checkDirectory((ItemDirectoryIterator) itemIterator,
    crawlQueue))
    {
    "
    will result in a hard break of the crawler.

    Commenting that will lead to runs with way more files on Windows, linux do not care about this changes.

    So next thing i found out is, that the crawFileThread runner is getting aborted before he got all files. I just tried to prevent the abort by commenting the "break" command and wanted to see what will happen.
    Now windows can reach all the files with the "file system parser". But some other parsers like the text parser will still not reach all files.
    By looking at my logs i saw, that the text parser itself in method "parseContent" is not finishing and than the CrawlFileMaster ended somehow. Maybe a timeout?

    To put it in a nutshell, i guess the difference between linux and windows are coming from the checkDirectory method and an abort i could not locate.
    Than both linux and windows seem to have a problem with some files related to different parsers.

    So far my analysis.

    Maybe you got an idea?

    Best Regards

    Stefan

     
    Last edit: Stefan 2013-10-17
  • Stefan
    Stefan
    2013-10-24

    no more ideas?;/