Menu

file crawler uncomplete

Help
O Jung
2016-03-02
2016-03-02
  • O Jung

    O Jung - 2016-03-02

    Dear OSS-Team and community,

    I´ve installed OSS v1.5.10 - build dab09220cf​ as a docker Container in Debian Jessie 8.
    Everything ist fine but the file crawler crawl and indexed only 12 items (4 Directories and 8 files)
    But there are many more (800)
    In the oss.log i get only webcrawler entries and some warnings, but nothing about the file crawling events.

    The source filespace to be indexed is mapped in the container /srv/opensearchserver with all files and directories have the same permissions (755).
    How can i check why the file crawler stop at 12 items?
    Which parameters can I use to extend/manipulate the file crawler work?

    Thanks for any hint.

    Best regards,

    Oliver
    OSS.log:

    14:45:09,274 WARN: root - Error while working on URL: http://www.bionic-design.de/Logos/F-flagge12.png : INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified.
    org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified.
    \tat org.apache.xerces.dom.CoreDocumentImpl.checkQName(Unknown Source)
    \tat org.apache.xerces.dom.AttrNSImpl.setName(Unknown Source)
    \tat org.apache.xerces.dom.AttrNSImpl.<init>(Unknown Source)
    \tat org.apache.xerces.dom.CoreDocumentImpl.createAttributeNS(Unknown Source)
    \tat org.apache.xerces.dom.ElementImpl.setAttributeNS(Unknown Source)
    \tat org.apache.xalan.xsltc.trax.SAX2DOM.startElement(SAX2DOM.java:148)
    \tat org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
    \tat org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
    \tat org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
    \tat org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:463)
    \tat org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
    \tat com.jaeksoft.searchlib.parser.htmlParser.TagsoupParser.getDomHtmlNode(TagsoupParser.java:51)
    \tat com.jaeksoft.searchlib.parser.htmlParser.TagsoupParser.getDocument(TagsoupParser.java:60)
    \tat com.jaeksoft.searchlib.parser.htmlParser.TagsoupParser.getDocument(TagsoupParser.java:38)
    \tat com.jaeksoft.searchlib.parser.htmlParser.HtmlDocumentProvider.getDocument(HtmlDocumentProvider.java:98)
    \tat com.jaeksoft.searchlib.parser.htmlParser.HtmlDocumentProvider.init(HtmlDocumentProvider.java:75)
    \tat com.jaeksoft.searchlib.parser.htmlParser.HtmlParserEnum.getHtmlParser(HtmlParserEnum.java:130)
    \tat com.jaeksoft.searchlib.parser.htmlParser.HtmlParserEnum.findBestProvider(HtmlParserEnum.java:101)
    \tat com.jaeksoft.searchlib.parser.htmlParser.HtmlParserEnum.getHtmlParser(HtmlParserEnum.java:126)
    \tat com.jaeksoft.searchlib.parser.HtmlParser.getHtmlDocumentProvider(HtmlParser.java:305)
    \tat com.jaeksoft.searchlib.parser.HtmlParser.parseContent(HtmlParser.java:388)
    \tat com.jaeksoft.searchlib.parser.Parser.doParserContent(Parser.java:172)
    \tat com.jaeksoft.searchlib.parser.ParserSelector.parserLoop(ParserSelector.java:502)
    \tat com.jaeksoft.searchlib.parser.ParserSelector.parseStream(ParserSelector.java:535)
    \tat com.jaeksoft.searchlib.crawler.web.spider.Crawl.parseContent(Crawl.java:163)
    \tat com.jaeksoft.searchlib.crawler.web.spider.Crawl.download(Crawl.java:324)
    \tat com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.crawl(WebCrawlThread.java:182)
    \tat com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.runner(WebCrawlThread.java:126)
    \tat com.jaeksoft.searchlib.process.ThreadAbstract.run(ThreadAbstract.java:291)
    \tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    \tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    \tat java.lang.Thread.run(Thread.java:745)
    14:45:19,344 WARN: root - none
    15:19:04,539 INFO: root - RELOAD - Hourly - Tue Mar 01 12:00:00 UTC 2016 - Count:3 - Average:5.0 - Min:1 - Max:9
    17:26:59,293 INFO: root - RELOAD - Hourly - Tue Mar 01 15:00:00 UTC 2016 - Count:3 - Average:9.333334 - Min:4 - Max:13
    17:34:28,509 INFO: root - SEARCH - Hourly - Tue Mar 01 13:00:00 UTC 2016 - Count:2 - Average:16.0 - Min:3 - Max:29

    to day reindexed fresh oss.log
    08:10:17,894 INFO: root - RELOAD - Hourly - Tue Mar 01 17:00:00 UTC 2016 - Count:8 - Average:9.624999 - Min:3 - Max:19
    Version

    System:

    OpenSearchServer v1.5.10 - build dab09220cf
    The running OpenSearchServer version
    Available processors 8
    The maximum number of processors available to the virtual machine
    Free memory 4.4 GB
    The amount of free memory in the Java Virtual Machine
    Free memory rate 91.8 %
    The rate of free memory in the Java Virtual Machine
    Max memory 4.8 GB
    The maximum amount of memory that the Java virtual machine will attempt to use
    Total memory 4.8 GB
    The total amount of memory in the Java virtual machine
    Data directory path
    /srv/opensearchserver/data
    The location of the directory containing the indices
    Free disk space
    743.7 GB
    The free space on a drive or volume
    Disk space rate 84 %
    The rate of free space on a drive or volume
    Total disk space
    885.3 GB
    The total space on a drive or volume
    Index count 3
    The total number of indices

     
    • O Jung

      O Jung - 2016-03-02

      Dear all,
      now I installed OSS v1.5.13 via .deb directly in jessie.
      Now the file crawler on the sames content works fine.
      There has to be a problem in the docker container.

      Best regards,

      Oliver

      CLOSED for OSS but not for Docker Container Manager (Alexandre?)

       

Log in to post a comment.