I´ve installed OSS v1.5.10 - build dab09220cf as a docker Container in Debian Jessie 8.
Everything ist fine but the file crawler crawl and indexed only 12 items (4 Directories and 8 files)
But there are many more (800)
In the oss.log i get only webcrawler entries and some warnings, but nothing about the file crawling events.
The source filespace to be indexed is mapped in the container /srv/opensearchserver with all files and directories have the same permissions (755).
How can i check why the file crawler stop at 12 items?
Which parameters can I use to extend/manipulate the file crawler work?
Thanks for any hint.
Best regards,
Oliver OSS.log:
14:45:09,274 WARN: root - Error while working on URL: http://www.bionic-design.de/Logos/F-flagge12.png : INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified.
org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified.
\tat org.apache.xerces.dom.CoreDocumentImpl.checkQName(Unknown Source)
\tat org.apache.xerces.dom.AttrNSImpl.setName(Unknown Source)
\tat org.apache.xerces.dom.AttrNSImpl.<init>(Unknown Source)
\tat org.apache.xerces.dom.CoreDocumentImpl.createAttributeNS(Unknown Source)
\tat org.apache.xerces.dom.ElementImpl.setAttributeNS(Unknown Source)
\tat org.apache.xalan.xsltc.trax.SAX2DOM.startElement(SAX2DOM.java:148)
\tat org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
\tat org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
\tat org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
\tat org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:463)
\tat org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
\tat com.jaeksoft.searchlib.parser.htmlParser.TagsoupParser.getDomHtmlNode(TagsoupParser.java:51)
\tat com.jaeksoft.searchlib.parser.htmlParser.TagsoupParser.getDocument(TagsoupParser.java:60)
\tat com.jaeksoft.searchlib.parser.htmlParser.TagsoupParser.getDocument(TagsoupParser.java:38)
\tat com.jaeksoft.searchlib.parser.htmlParser.HtmlDocumentProvider.getDocument(HtmlDocumentProvider.java:98)
\tat com.jaeksoft.searchlib.parser.htmlParser.HtmlDocumentProvider.init(HtmlDocumentProvider.java:75)
\tat com.jaeksoft.searchlib.parser.htmlParser.HtmlParserEnum.getHtmlParser(HtmlParserEnum.java:130)
\tat com.jaeksoft.searchlib.parser.htmlParser.HtmlParserEnum.findBestProvider(HtmlParserEnum.java:101)
\tat com.jaeksoft.searchlib.parser.htmlParser.HtmlParserEnum.getHtmlParser(HtmlParserEnum.java:126)
\tat com.jaeksoft.searchlib.parser.HtmlParser.getHtmlDocumentProvider(HtmlParser.java:305)
\tat com.jaeksoft.searchlib.parser.HtmlParser.parseContent(HtmlParser.java:388)
\tat com.jaeksoft.searchlib.parser.Parser.doParserContent(Parser.java:172)
\tat com.jaeksoft.searchlib.parser.ParserSelector.parserLoop(ParserSelector.java:502)
\tat com.jaeksoft.searchlib.parser.ParserSelector.parseStream(ParserSelector.java:535)
\tat com.jaeksoft.searchlib.crawler.web.spider.Crawl.parseContent(Crawl.java:163)
\tat com.jaeksoft.searchlib.crawler.web.spider.Crawl.download(Crawl.java:324)
\tat com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.crawl(WebCrawlThread.java:182)
\tat com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.runner(WebCrawlThread.java:126)
\tat com.jaeksoft.searchlib.process.ThreadAbstract.run(ThreadAbstract.java:291)
\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
\tat java.lang.Thread.run(Thread.java:745)
14:45:19,344 WARN: root - none
15:19:04,539 INFO: root - RELOAD - Hourly - Tue Mar 01 12:00:00 UTC 2016 - Count:3 - Average:5.0 - Min:1 - Max:9
17:26:59,293 INFO: root - RELOAD - Hourly - Tue Mar 01 15:00:00 UTC 2016 - Count:3 - Average:9.333334 - Min:4 - Max:13
17:34:28,509 INFO: root - SEARCH - Hourly - Tue Mar 01 13:00:00 UTC 2016 - Count:2 - Average:16.0 - Min:3 - Max:29
to day reindexed fresh oss.log
08:10:17,894 INFO: root - RELOAD - Hourly - Tue Mar 01 17:00:00 UTC 2016 - Count:8 - Average:9.624999 - Min:3 - Max:19
Version
System:
OpenSearchServer v1.5.10 - build dab09220cf
The running OpenSearchServer version
Available processors 8
The maximum number of processors available to the virtual machine
Free memory 4.4 GB
The amount of free memory in the Java Virtual Machine
Free memory rate 91.8 %
The rate of free memory in the Java Virtual Machine
Max memory 4.8 GB
The maximum amount of memory that the Java virtual machine will attempt to use
Total memory 4.8 GB
The total amount of memory in the Java virtual machine
Data directory path
/srv/opensearchserver/data
The location of the directory containing the indices
Free disk space
743.7 GB
The free space on a drive or volume
Disk space rate 84 %
The rate of free space on a drive or volume
Total disk space
885.3 GB
The total space on a drive or volume
Index count 3
The total number of indices
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear all,
now I installed OSS v1.5.13 via .deb directly in jessie.
Now the file crawler on the sames content works fine.
There has to be a problem in the docker container.
Best regards,
Oliver
CLOSED for OSS but not for Docker Container Manager (Alexandre?)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear OSS-Team and community,
I´ve installed OSS v1.5.10 - build dab09220cf as a docker Container in Debian Jessie 8.
Everything ist fine but the file crawler crawl and indexed only 12 items (4 Directories and 8 files)
But there are many more (800)
In the oss.log i get only webcrawler entries and some warnings, but nothing about the file crawling events.
The source filespace to be indexed is mapped in the container /srv/opensearchserver with all files and directories have the same permissions (755).
How can i check why the file crawler stop at 12 items?
Which parameters can I use to extend/manipulate the file crawler work?
Thanks for any hint.
Best regards,
Oliver
OSS.log:
14:45:09,274 WARN: root - Error while working on URL: http://www.bionic-design.de/Logos/F-flagge12.png : INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified.
org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified.
\tat org.apache.xerces.dom.CoreDocumentImpl.checkQName(Unknown Source)
\tat org.apache.xerces.dom.AttrNSImpl.setName(Unknown Source)
\tat org.apache.xerces.dom.AttrNSImpl.<init>(Unknown Source)
\tat org.apache.xerces.dom.CoreDocumentImpl.createAttributeNS(Unknown Source)
\tat org.apache.xerces.dom.ElementImpl.setAttributeNS(Unknown Source)
\tat org.apache.xalan.xsltc.trax.SAX2DOM.startElement(SAX2DOM.java:148)
\tat org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
\tat org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
\tat org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
\tat org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:463)
\tat org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
\tat com.jaeksoft.searchlib.parser.htmlParser.TagsoupParser.getDomHtmlNode(TagsoupParser.java:51)
\tat com.jaeksoft.searchlib.parser.htmlParser.TagsoupParser.getDocument(TagsoupParser.java:60)
\tat com.jaeksoft.searchlib.parser.htmlParser.TagsoupParser.getDocument(TagsoupParser.java:38)
\tat com.jaeksoft.searchlib.parser.htmlParser.HtmlDocumentProvider.getDocument(HtmlDocumentProvider.java:98)
\tat com.jaeksoft.searchlib.parser.htmlParser.HtmlDocumentProvider.init(HtmlDocumentProvider.java:75)
\tat com.jaeksoft.searchlib.parser.htmlParser.HtmlParserEnum.getHtmlParser(HtmlParserEnum.java:130)
\tat com.jaeksoft.searchlib.parser.htmlParser.HtmlParserEnum.findBestProvider(HtmlParserEnum.java:101)
\tat com.jaeksoft.searchlib.parser.htmlParser.HtmlParserEnum.getHtmlParser(HtmlParserEnum.java:126)
\tat com.jaeksoft.searchlib.parser.HtmlParser.getHtmlDocumentProvider(HtmlParser.java:305)
\tat com.jaeksoft.searchlib.parser.HtmlParser.parseContent(HtmlParser.java:388)
\tat com.jaeksoft.searchlib.parser.Parser.doParserContent(Parser.java:172)
\tat com.jaeksoft.searchlib.parser.ParserSelector.parserLoop(ParserSelector.java:502)
\tat com.jaeksoft.searchlib.parser.ParserSelector.parseStream(ParserSelector.java:535)
\tat com.jaeksoft.searchlib.crawler.web.spider.Crawl.parseContent(Crawl.java:163)
\tat com.jaeksoft.searchlib.crawler.web.spider.Crawl.download(Crawl.java:324)
\tat com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.crawl(WebCrawlThread.java:182)
\tat com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.runner(WebCrawlThread.java:126)
\tat com.jaeksoft.searchlib.process.ThreadAbstract.run(ThreadAbstract.java:291)
\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
\tat java.lang.Thread.run(Thread.java:745)
14:45:19,344 WARN: root - none
15:19:04,539 INFO: root - RELOAD - Hourly - Tue Mar 01 12:00:00 UTC 2016 - Count:3 - Average:5.0 - Min:1 - Max:9
17:26:59,293 INFO: root - RELOAD - Hourly - Tue Mar 01 15:00:00 UTC 2016 - Count:3 - Average:9.333334 - Min:4 - Max:13
17:34:28,509 INFO: root - SEARCH - Hourly - Tue Mar 01 13:00:00 UTC 2016 - Count:2 - Average:16.0 - Min:3 - Max:29
to day reindexed fresh oss.log
08:10:17,894 INFO: root - RELOAD - Hourly - Tue Mar 01 17:00:00 UTC 2016 - Count:8 - Average:9.624999 - Min:3 - Max:19
Version
System:
OpenSearchServer v1.5.10 - build dab09220cf
The running OpenSearchServer version
Available processors 8
The maximum number of processors available to the virtual machine
Free memory 4.4 GB
The amount of free memory in the Java Virtual Machine
Free memory rate 91.8 %
The rate of free memory in the Java Virtual Machine
Max memory 4.8 GB
The maximum amount of memory that the Java virtual machine will attempt to use
Total memory 4.8 GB
The total amount of memory in the Java virtual machine
Data directory path
/srv/opensearchserver/data
The location of the directory containing the indices
Free disk space
743.7 GB
The free space on a drive or volume
Disk space rate 84 %
The rate of free space on a drive or volume
Total disk space
885.3 GB
The total space on a drive or volume
Index count 3
The total number of indices
Dear all,
now I installed OSS v1.5.13 via .deb directly in jessie.
Now the file crawler on the sames content works fine.
There has to be a problem in the docker container.
Best regards,
Oliver
CLOSED for OSS but not for Docker Container Manager (Alexandre?)