Menu

Why My Parser is not working

Help
Jeff
2007-11-19
2013-04-27
  • Jeff

    Jeff - 2007-11-19

    I wrote the below code to get all the locations as one of the user in previous thread tried to do... I dont know why my code is not working. Can i know the reason. Anybody help me. I am verymuch new to the parsing. My code is not working with the link
    http://cke.know-where.com/hardees/cgi/selection?mapid=US&lang=en&design=default&addr=&city=&region=&zip=19362&phone=
    Thanks in advance
    -Jeff

    public class SampleParsing
    {
        public static void parseData(String web_url)throws Exception
        {
            try
            {
                System.out.println(web_url);
                Parser parser = new Parser(htmlFileToParse);
               
                NodeList td_list = parser.parse( new AndFilter (new TagNameFilter("table"), new HasAttributeFilter("valign","top")));
               
                System.out.println(td_list.size());
               
                for(int i=0;i<td_list.size();i++)
                {
                    String s=td_list.elementAt(i).toPlainTextString();
                    System.out.println(s);
                } 
            }
            catch(Exception e)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
            {
                e.printStackTrace();
            }
        }
       
       
       
        public static void main(String args[])throws Exception
        {
            String weburl = null;
            try
            {
                weburl = "http://cke.know-where.com/hardees/cgi/selection?mapid=US&lang=en&design=default&addr=&city=&region=&zip=19362&phone=";
                parseData(weburl);
            }
            catch(Exception e)
            {
                e.printStackTrace();
            }
        }
       
    }

     
    • Clem Wang

      Clem Wang - 2008-03-26

      You got snared by the same bug I just encountered tonight.  I've reported it as:

      http://sourceforge.net/tracker/index.php?func=detail&aid=1925846&group_id=24399&atid=381399

      The URL you mention has this apparently harmless HTML comment at the top of the page:

      <!--------------------------------------------->

      It appears that if the parser encounters a THIRD dash before the last ">", it gets confused and thinks the comment keeps on going, until I'm not sure when...  Hence, the parser causes large amounts of HTML to get absorbed by the comment.

      If you want to test your code, make a copy of the web page with these problematic comments stripped out.

      I also think this bug might be related:
      http://sourceforge.net/tracker/index.php?func=detail&aid=1845913&group_id=24399&atid=381399

      These two bugs put a real crimp in being able to use this htmlparser for random real world webpages, because nowadays, a lot of javascript gets embedded in a lot of webpages for tracking and stuff.

      Give your code another try.  It might actually be working if it weren't for the bug(s).

       
    • Clem Wang

      Clem Wang - 2008-03-26

      Latest info!  I got a response to my bug report.

      The "problem" (or not) is that the default value of:

      Lexer.STRICT_REMARKS   is true, which causes the "misbehavior" that I believe you and are observing.

      To get your code to do "right thing" (or at least what I believe is the right thing), you need to add to your program:

      Lexer.STRICT_REMARKS = false;

      If you add this to your code, I believe then your program has a good shot at working.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.