Hello!
I've got the following error message:
java.lang.IllegalArgumentException
\tat java.net.URI.create(URI.java:841)
\tat com.jaeksoft.searchlib.util.LinkUtils.getLink(Unknown Source)
\tat com.jaeksoft.searchlib.parser.HtmlParser.parseContent(Unknown Source)
\tat com.jaeksoft.searchlib.parser.Parser.parseContent(Unknown Source)
\tat com.jaeksoft.searchlib.crawler.web.spider.Crawl.parseContent(Unknown Source)
\tat com.jaeksoft.searchlib.crawler.web.spider.Crawl.download(Unknown Source)
\tat com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.crawl(Unknown Source)
\tat com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.runner(Unknown Source)
\tat com.jaeksoft.searchlib.process.ThreadAbstract.run(Unknown Source)
\tat java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
\tat java.lang.Thread.run(Thread.java:662)
Caused by: java.net.URISyntaxException: Illegal character in fragment at index 86: http://example.org/doc/39/03/01/00/javax/swing/Icon.html#paintIcon(java.awt.Component, java.awt.Graphics, int, int)
\tat java.net.URI$Parser.fail(URI.java:2810)
\tat java.net.URI$Parser.checkChars(URI.java:2983)
\tat java.net.URI$Parser.parse(URI.java:3029)
\tat java.net.URI.<init>(URI.java:577)
\tat java.net.URI.create(URI.java:839)
\t... 11 more
It's a awful fragment "#paintIcon(java.awt.Component, java.awt.Graphics, int, int)", but it exists.
So I had a look to RFC 3986:
fragment = ( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "" / "+" / "," / ";" / "="
The used space is wrong and should be encoded but "(", ")", ",", "." are allowed.
Should to replace the space automatically by - for example - an "%20" and handle this kind of URI instead of throwing an exception?
Thanks in advance!
Best Regards.
Hello!
Meanwhile there are thousands of fetch errors due to this error. It's Java documentation and on nearly each page there is an URL with a fragment like
ConverterException(java.lang.String, java.lang.Throwable)
So an error occurs and the page got an status "error". Hopefully the page was indexed, the index status is "indexed".
BTW: Most of the pages are handled correctly. :-)
Best Regards.
Hello!
Now the same error occured with normal URLs:
java.lang.IllegalArgumentException
at java.net.URI.create(URI.java:841)
at com.jaeksoft.searchlib.util.LinkUtils.getLink(Unknown Source)
at com.jaeksoft.searchlib.parser.HtmlParser.parseContent(Unknown Source)
at com.jaeksoft.searchlib.parser.Parser.parseContent(Unknown Source)
at com.jaeksoft.searchlib.crawler.web.spider.Crawl.parseContent(Unknown Source)
at com.jaeksoft.searchlib.crawler.web.spider.Crawl.download(Unknown Source)
at com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.crawl(Unknown Source)
at com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.runner(Unknown Source)
at com.jaeksoft.searchlib.process.ThreadAbstract.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.URISyntaxException: Illegal character in path at index 64: http://example.org/doc/15/06/05/00/berlin/Sample0/slides/Sample0 (109).html
at java.net.URI$Parser.fail(URI.java:2810)
at java.net.URI$Parser.checkChars(URI.java:2983)
at java.net.URI$Parser.parseHierarchical(URI.java:3067)
at java.net.URI$Parser.parse(URI.java:3015)
at java.net.URI.<init>(URI.java:577)
at java.net.URI.create(URI.java:839)
... 11 more
The character at index 64 is the space.
Best Regards.