Menu

parser hidden html tags

Guzman
2011-04-06
2013-01-03
  • Guzman

    Guzman - 2011-04-06

    Hi,

    I'm trying to parse hidden html tags, but when the program execute next line:

    List<Element> divs=source.getAllElementsByClass("specifications")

    over this HTML tags:

    <div class="specifications" tabname="specifications" style="display: none; left: 0px; visibility: visible;">
    <h3 class="tit-features">Características principales </h3>
    <ul class="icon-list">
    <li style="height: 15px;">Sintonizador TDT HD y Cable</li>
    ….
    </div>

    Out the program is an empty list element, so how could I parse this kind of hidden tags ?
    Thanks in advance,

     
  • Martin Jericho

    Martin Jericho - 2011-04-06

    The problem has nothing to do with tags being hidden, it would most likely be caused by some other problem in your code or the HTML.

    If you send a full copy of your code and HTML I can tell you where the problem is. Alternatively you can check the log output from the parser to see if there are problems with the HTML.

    Cheers 
    Martin

     
  • Martin Jericho

    Martin Jericho - 2011-04-07

    The problem is that the web page is returning different content based on the User-Agent header. The default user agent sent by the Java runtime is something like "Java/1.6.0_13", so you'll need to set it to emulate a real browser.

    A solution is provided in the following forum post: 
    https://sourceforge.net/projects/jerichohtml/forums/forum/350025/topic/3456586?message=7776734

    For example, using the code:

    URLConnection urlConnection=new URL("http://www.lg.com/es/tv-audio-video/television/LG-led-37LV570S.jsp").openConnection(); 
    urlConnection.setRequestProperty("User-agent","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.2.16) Gecko/20110319 Firefox/3.6.16 ( .NET CLR 3.5.30729)"); 
    Source source=new Source(urlConnection); 
    List<Element> divs=source.getAllElementsByClass("specifications"); 
    for (Element element : divs) System.out.println(element.getDebugInfo());

    I get the result:

    Element <li >-</li> ((r183,c5,p14819)-(r183,c203,p15017)) 
    Element <div >-</div> ((r1551,c2,p52227)-(r1802,c7,p57750))

     
  • Guzman

    Guzman - 2011-04-12

    Thanks martin. it works. 

    SALUDOS DESDE ESPAÑA !!!

     

Log in to post a comment.