From: David M. G. <mic...@gm...> - 2014-04-29 12:33:50
|
Hi, When looking on the function which creates the row iterator: private void setNextRow(final DomNode node) { nextRow_ = null; for (DomNode next = node; next != null; next = next.getNextSibling()) { if (next instanceof HtmlTableRow) { nextRow_ = (HtmlTableRow) next; return; } else if (currentGroup_ == null && next instanceof TableRowGroup) { currentGroup_ = (TableRowGroup) next; setNextRow(next.getFirstChild()); return; } } if (currentGroup_ != null) { final DomNode group = currentGroup_; currentGroup_ = null; setNextRow(group.getNextSibling()); } } public Iterator<HtmlTableRow> iterator() { return this; } } As we see it just descends into the lower tags, if next is an instance of a TableRowGroup, like tbody, but not if there is a custom tag. I don't know if it is a root cause because maybe the html parser should ignore these tags. When inspecting the element with firefox i see that he does it (inspect element). BR, David On Tue, Apr 29, 2014 at 2:50 PM, David Michael Gang <mic...@gm...>wrote: > Hi all, > > I crawled a page and the HtmlPage.asText function did not return the > desired result: > I tracked it down that somehow that the table rows were ignored. > The reason is that in the crawled page the table trs were wrapped into a > doc tag. > > <table width="100%" border="0" cellspacing="0" cellpadding="0"> > <tbody> > <tr class="nopadding"> > <td> > <a name="DOCNO_1"> > </a> > </td> > </tr> > <doc> > <tr valign="baseline" height="8" class="toprow"> > <th width="6%" align="left" valign="center" > nowrap="nowrap"> > <span> > <input type="checkbox" id="frm_control_box" > title="Click here to select or de-select all" name="frm_control_box" > value="checkbox" onclick="javascript:subSetAllSelectionStatus()"/> > </span> > </th> > <th width="94%" align="left" valign="center" > nowrap="nowrap"> > <span> > Results > </span> > </th> > </tr> > <tr class="noshaderow1st" style="padding-bottom: > 8px;" height="8" valign="baseline"> > <td width="6%" align="left" nowrap="nowrap" > valign="top"> > <input onclick="javascript:manageBox('1')" > type="checkbox" value="1" name="frm_tagged_documents" title="Click here to > deliver or to view tagged documents" id="frm_tagged_documents1"/> > <label style="{cursor: pointer; cursor: hand;}" > for="frm_tagged_documents1"> > 1. > </label> > </td> > <td width="94%" align="left" valign="top"> > <a href="aaa" target="_parent"> > aaa > </a> > <br class="br"/> > <span class="notranslate"> > bbb > </span> > , November 19, 2011, Pg. 7, 758 words > </td> > </tr> > </doc> > <tr class="nopadding"> > <td> > <a name="DOCNO_2"> > </a> > </td> > </tr> > <doc> > <tr class="shaderow1st" style="padding-bottom: 8px;" > height="8" valign="baseline"> > <td width="6%" align="left" nowrap="nowrap" > valign="top"> > <input onclick="javascript:manageBox('2')" > type="checkbox" value="2" name="frm_tagged_documents" title="Click here to > deliver or to view tagged documents" id="frm_tagged_documents2"/> > <label style="{cursor: pointer; cursor: hand;}" > for="frm_tagged_documents2"> > 2. > </label> > </td> > <td width="94%" align="left" valign="top"> > <a href="ccc" target="_parent"> > ddd > </a> > <br class="br"/> > <span class="notranslate"> > eee > </span> > , November 19, 2011, Pg. 18, 1216 words, MICHAEL > HENDERSON > </td> > </tr> > </doc> > > </tbody> > In firefox the page is displayed nice. > Is it somehow possible to tell htmlunit to ignore the doc tag and recurse > into it to find the tr tag? > > > Thanks, > David > |