From: David M. G. <mic...@gm...> - 2014-04-29 11:50:11
|
Hi all, I crawled a page and the HtmlPage.asText function did not return the desired result: I tracked it down that somehow that the table rows were ignored. The reason is that in the crawled page the table trs were wrapped into a doc tag. <table width="100%" border="0" cellspacing="0" cellpadding="0"> <tbody> <tr class="nopadding"> <td> <a name="DOCNO_1"> </a> </td> </tr> <doc> <tr valign="baseline" height="8" class="toprow"> <th width="6%" align="left" valign="center" nowrap="nowrap"> <span> <input type="checkbox" id="frm_control_box" title="Click here to select or de-select all" name="frm_control_box" value="checkbox" onclick="javascript:subSetAllSelectionStatus()"/> </span> </th> <th width="94%" align="left" valign="center" nowrap="nowrap"> <span> Results </span> </th> </tr> <tr class="noshaderow1st" style="padding-bottom: 8px;" height="8" valign="baseline"> <td width="6%" align="left" nowrap="nowrap" valign="top"> <input onclick="javascript:manageBox('1')" type="checkbox" value="1" name="frm_tagged_documents" title="Click here to deliver or to view tagged documents" id="frm_tagged_documents1"/> <label style="{cursor: pointer; cursor: hand;}" for="frm_tagged_documents1"> 1. </label> </td> <td width="94%" align="left" valign="top"> <a href="aaa" target="_parent"> aaa </a> <br class="br"/> <span class="notranslate"> bbb </span> , November 19, 2011, Pg. 7, 758 words </td> </tr> </doc> <tr class="nopadding"> <td> <a name="DOCNO_2"> </a> </td> </tr> <doc> <tr class="shaderow1st" style="padding-bottom: 8px;" height="8" valign="baseline"> <td width="6%" align="left" nowrap="nowrap" valign="top"> <input onclick="javascript:manageBox('2')" type="checkbox" value="2" name="frm_tagged_documents" title="Click here to deliver or to view tagged documents" id="frm_tagged_documents2"/> <label style="{cursor: pointer; cursor: hand;}" for="frm_tagged_documents2"> 2. </label> </td> <td width="94%" align="left" valign="top"> <a href="ccc" target="_parent"> ddd </a> <br class="br"/> <span class="notranslate"> eee </span> , November 19, 2011, Pg. 18, 1216 words, MICHAEL HENDERSON </td> </tr> </doc> </tbody> In firefox the page is displayed nice. Is it somehow possible to tell htmlunit to ignore the doc tag and recurse into it to find the tr tag? Thanks, David |