Hi,
I am getting following error when try extract links from a web site. Any help please. Many Thanks
Shantha
D:\htmlparser1_4_2>java Robot http://www.keele.ac.uk/depts/cs/dake/vldb2000/pan
l2020/DeenVLDB2/index.htm
Crawlin Site http://www.keele.ac.uk/depts/cs/dake/vldb2000/panel2020/DeenVLDB2/
ndex.htm 1
Exception in thread "main" org.htmlparser.util.ParserException: null;
sun.io.MalformedInputException
at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:152)
at java.io.InputStreamReader.convertInto(InputStreamReader.java:137)
at java.io.InputStreamReader.fill(InputStreamReader.java:186)
at java.io.InputStreamReader.read(InputStreamReader.java:249)
at org.htmlparser.lexer.Source.fill(Source.java:239)
at org.htmlparser.lexer.Source.read(Source.java:322)
at org.htmlparser.lexer.Source.read(Source.java:347)
at org.htmlparser.lexer.Page.setEncoding(Page.java:698)
at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:115)
at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69)
at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner
java:162)
at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
at Robot.crawl(Robot.java:200)
at Robot.main(Robot.java:106)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
From the stack trace, there is a problem trying to interpret the page as UTF-8. The META tag in the HEAD:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
causes the parser to retrace the characters read in so far (which isn't very many) using the UTF-8 encoding scheme, in response to the doSemanticAction() method of the META tag.
By the way, I don't get this error, so it may be something in your environment. Perhaps a language setting, or somthing. Alternatively, it could be a bug in your JVM, since the byte stream looks pretty normal.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am getting following error when try extract links from a web site. Any help please. Many Thanks
Shantha
D:\htmlparser1_4_2>java Robot http://www.keele.ac.uk/depts/cs/dake/vldb2000/pan
l2020/DeenVLDB2/index.htm
Crawlin Site http://www.keele.ac.uk/depts/cs/dake/vldb2000/panel2020/DeenVLDB2/
ndex.htm 1
Exception in thread "main" org.htmlparser.util.ParserException: null;
sun.io.MalformedInputException
at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:152)
at java.io.InputStreamReader.convertInto(InputStreamReader.java:137)
at java.io.InputStreamReader.fill(InputStreamReader.java:186)
at java.io.InputStreamReader.read(InputStreamReader.java:249)
at org.htmlparser.lexer.Source.fill(Source.java:239)
at org.htmlparser.lexer.Source.read(Source.java:322)
at org.htmlparser.lexer.Source.read(Source.java:347)
at org.htmlparser.lexer.Page.setEncoding(Page.java:698)
at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:115)
at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69)
at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner
java:162)
at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
at Robot.crawl(Robot.java:200)
at Robot.main(Robot.java:106)
From the stack trace, there is a problem trying to interpret the page as UTF-8. The META tag in the HEAD:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
causes the parser to retrace the characters read in so far (which isn't very many) using the UTF-8 encoding scheme, in response to the doSemanticAction() method of the META tag.
By the way, I don't get this error, so it may be something in your environment. Perhaps a language setting, or somthing. Alternatively, it could be a bug in your JVM, since the byte stream looks pretty normal.