Menu

#147 Illegal Argument parsed (java.lang.IllegalArgumentException)

v1.2.4
open
nobody
None
5
2014-10-27
2012-05-23
DR12
No

Hello!

I've got the following error message:

java.lang.IllegalArgumentException
\tat java.net.URI.create(URI.java:841)
\tat com.jaeksoft.searchlib.util.LinkUtils.getLink(Unknown Source)
\tat com.jaeksoft.searchlib.parser.HtmlParser.parseContent(Unknown Source)
\tat com.jaeksoft.searchlib.parser.Parser.parseContent(Unknown Source)
\tat com.jaeksoft.searchlib.crawler.web.spider.Crawl.parseContent(Unknown Source)
\tat com.jaeksoft.searchlib.crawler.web.spider.Crawl.download(Unknown Source)
\tat com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.crawl(Unknown Source)
\tat com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.runner(Unknown Source)
\tat com.jaeksoft.searchlib.process.ThreadAbstract.run(Unknown Source)
\tat java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
\tat java.lang.Thread.run(Thread.java:662)
Caused by: java.net.URISyntaxException: Illegal character in fragment at index 86: http://example.org/doc/39/03/01/00/javax/swing/Icon.html#paintIcon(java.awt.Component, java.awt.Graphics, int, int)
\tat java.net.URI$Parser.fail(URI.java:2810)
\tat java.net.URI$Parser.checkChars(URI.java:2983)
\tat java.net.URI$Parser.parse(URI.java:3029)
\tat java.net.URI.<init>(URI.java:577)
\tat java.net.URI.create(URI.java:839)
\t... 11 more

It's a awful fragment "#paintIcon(java.awt.Component, java.awt.Graphics, int, int)", but it exists.

So I had a look to RFC 3986:

fragment = ( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "
" / "+" / "," / ";" / "="

The used space is wrong and should be encoded but "(", ")", ",", "." are allowed.

Should to replace the space automatically by - for example - an "%20" and handle this kind of URI instead of throwing an exception?

Thanks in advance!

Best Regards.

Discussion

  • DR12

    DR12 - 2012-05-24

    Hello!

    Meanwhile there are thousands of fetch errors due to this error. It's Java documentation and on nearly each page there is an URL with a fragment like

    ConverterException(java.lang.String, java.lang.Throwable)

    So an error occurs and the page got an status "error". Hopefully the page was indexed, the index status is "indexed".

    BTW: Most of the pages are handled correctly. :-)

    Best Regards.

     
  • DR12

    DR12 - 2012-05-25

    Hello!

    Now the same error occured with normal URLs:

    java.lang.IllegalArgumentException
    at java.net.URI.create(URI.java:841)
    at com.jaeksoft.searchlib.util.LinkUtils.getLink(Unknown Source)
    at com.jaeksoft.searchlib.parser.HtmlParser.parseContent(Unknown Source)
    at com.jaeksoft.searchlib.parser.Parser.parseContent(Unknown Source)
    at com.jaeksoft.searchlib.crawler.web.spider.Crawl.parseContent(Unknown Source)
    at com.jaeksoft.searchlib.crawler.web.spider.Crawl.download(Unknown Source)
    at com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.crawl(Unknown Source)
    at com.jaeksoft.searchlib.crawler.web.process.WebCrawlThread.runner(Unknown Source)
    at com.jaeksoft.searchlib.process.ThreadAbstract.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
    Caused by: java.net.URISyntaxException: Illegal character in path at index 64: http://example.org/doc/15/06/05/00/berlin/Sample0/slides/Sample0 (109).html
    at java.net.URI$Parser.fail(URI.java:2810)
    at java.net.URI$Parser.checkChars(URI.java:2983)
    at java.net.URI$Parser.parseHierarchical(URI.java:3067)
    at java.net.URI$Parser.parse(URI.java:3015)
    at java.net.URI.<init>(URI.java:577)
    at java.net.URI.create(URI.java:839)
    ... 11 more

    The character at index 64 is the space.

    Best Regards.

     

Log in to post a comment.