public class SampleParsing
{
public static void parseData(String web_url)throws Exception
{
try
{
System.out.println(web_url);
Parser parser = new Parser(htmlFileToParse);
NodeList td_list = parser.parse( new AndFilter (new TagNameFilter("table"), new HasAttributeFilter("valign","top")));
The URL you mention has this apparently harmless HTML comment at the top of the page:
<!--------------------------------------------->
It appears that if the parser encounters a THIRD dash before the last ">", it gets confused and thinks the comment keeps on going, until I'm not sure when... Hence, the parser causes large amounts of HTML to get absorbed by the comment.
If you want to test your code, make a copy of the web page with these problematic comments stripped out.
These two bugs put a real crimp in being able to use this htmlparser for random real world webpages, because nowadays, a lot of javascript gets embedded in a lot of webpages for tracking and stuff.
Give your code another try. It might actually be working if it weren't for the bug(s).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I wrote the below code to get all the locations as one of the user in previous thread tried to do... I dont know why my code is not working. Can i know the reason. Anybody help me. I am verymuch new to the parsing. My code is not working with the link
http://cke.know-where.com/hardees/cgi/selection?mapid=US&lang=en&design=default&addr=&city=®ion=&zip=19362&phone=
Thanks in advance
-Jeff
public class SampleParsing
{
public static void parseData(String web_url)throws Exception
{
try
{
System.out.println(web_url);
Parser parser = new Parser(htmlFileToParse);
NodeList td_list = parser.parse( new AndFilter (new TagNameFilter("table"), new HasAttributeFilter("valign","top")));
System.out.println(td_list.size());
for(int i=0;i<td_list.size();i++)
{
String s=td_list.elementAt(i).toPlainTextString();
System.out.println(s);
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
public static void main(String args[])throws Exception
{
String weburl = null;
try
{
weburl = "http://cke.know-where.com/hardees/cgi/selection?mapid=US&lang=en&design=default&addr=&city=®ion=&zip=19362&phone=";
parseData(weburl);
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
You got snared by the same bug I just encountered tonight. I've reported it as:
http://sourceforge.net/tracker/index.php?func=detail&aid=1925846&group_id=24399&atid=381399
The URL you mention has this apparently harmless HTML comment at the top of the page:
<!--------------------------------------------->
It appears that if the parser encounters a THIRD dash before the last ">", it gets confused and thinks the comment keeps on going, until I'm not sure when... Hence, the parser causes large amounts of HTML to get absorbed by the comment.
If you want to test your code, make a copy of the web page with these problematic comments stripped out.
I also think this bug might be related:
http://sourceforge.net/tracker/index.php?func=detail&aid=1845913&group_id=24399&atid=381399
These two bugs put a real crimp in being able to use this htmlparser for random real world webpages, because nowadays, a lot of javascript gets embedded in a lot of webpages for tracking and stuff.
Give your code another try. It might actually be working if it weren't for the bug(s).
Latest info! I got a response to my bug report.
The "problem" (or not) is that the default value of:
Lexer.STRICT_REMARKS is true, which causes the "misbehavior" that I believe you and are observing.
To get your code to do "right thing" (or at least what I believe is the right thing), you need to add to your program:
Lexer.STRICT_REMARKS = false;
If you add this to your code, I believe then your program has a good shot at working.