#1598 problems with crawling a table when html contains custom tags

Latest SVN
closed
None
1
2015-01-13
2014-04-30
david gang
No

This is a defect related to the thread
https://sourceforge.net/p/htmlunit/mailman/message/32284590/

I have a page (compressed as l.zip) where between the tbody and tr tag we have a custom tag.
The asText function of htmlpage does not descend into it to fetch the text.
The text is accessible by other parsers like jsoup:

package test;

import java.io.File;
import java.io.IOException;

import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.Jsoup;
public class JsoupTest1 {

public static void main(String[] args) throws IOException {
    File in = new File("l.html");
    Document doc = Jsoup.parse(in, null);
    Elements elems = doc.select("table");

    for (Element elem:elems) {
        System.out.println(elem.text());
    }

}

}

The solution should be that asText function will fetch the content of the table.

1 Attachments

Discussion

  • david gang
    david gang
    2014-05-01

    (sorry for the typos)
    I tried to research the issue:
    I have the following code

        html = "<html><head>
                </head><body>
                <table id='myId'>
                <caption>This is the caption</caption>
                <doc1>
                <tr>
                <td>cell 1,1</td>
                <td>cell 1,2</td>
                </tr>
                </doc1>
                <tr>
                <td>cell 2,1</td>
                <td>cell 2,2</td>
                </tr>
                </table>
                </body></html>"
    
        final HtmlPage page = loadPage(html);
        final HtmlElement table = page.getHtmlElementById("myId");
        System.out.println(table.asXml());
    

    gives:

    <table id="myId">
      <caption>
        This is the caption
      </caption>
      <doc1>
        <tbody>
          <tr>
            <td>
              cell 1,1
            </td>
            <td>
              cell 1,2
            </td>
          </tr>
        </tbody>
      </doc1>
      <tbody>
        <tr>
          <td>
            cell 2,1
          </td>
          <td>
            cell 2,2
          </td>
        </tr>
      </tbody>
    </table>
    

    This means that doc1 which is the child of the table is ignored by the row iterator.
    I don't know still how to solve this issue.
    Does anyone have a hint how to solve this?

    Thanks,
    David

     
    Last edit: david gang 2014-05-01
  • Marc Guillemot
    Marc Guillemot
    2014-05-09

    This looks like a parsing problem both FF and Chrome parse your HTML code to

    <doc1>
    </doc1>
    <table>
    ...
    </table>
    
     
  • Marc Guillemot
    Marc Guillemot
    2014-05-09

    • status: open --> accepted
    • assigned_to: Marc Guillemot
     
  • Marc Guillemot
    Marc Guillemot
    2014-05-12

    This should now be fixed in SVN. Thanks for reporting.

     
  • Marc Guillemot
    Marc Guillemot
    2014-05-12

    • status: accepted --> closed