ugent bug with invalid dom

  • mironcaius

    mironcaius - 2009-10-21

    I have to decide if i am going to use cobra parser for my current project. The dom support would help me a lot but as i have tested the dom is invalid compared to the html strucure and dom inspector.

    I am parsing google and ebay with this code, what am i doing wrong ?

        public static void main(String args) throws Exception {
            UserAgentContext uacontext = new SimpleUserAgentContext();
            DocumentBuilderImpl builder = new DocumentBuilderImpl(uacontext);
            URL url = new URL(TEST_URI);
            InputStream in = url.openConnection().getInputStream();
            try {
                Reader reader = new InputStreamReader(in, "UTF-8");
                InputSourceImpl inputSource = new InputSourceImpl(reader, TEST_URI);

                UserAgentContext context = new SimpleUserAgentContext();
                DocumentBuilderImpl dbi = new DocumentBuilderImpl(context);
    //             A document URI and a charset should be provided.
                Document document = dbi.parse(inputSource);
                NodeList elementList = document.getChildNodes();
                //List elementList = source.getChildElements();
                displayDomTree(elementList.item(0), 0);
            } finally {
        static void displayDomTree( Node child, int level) {

      if (child == null)
      NodeList children = child.getChildNodes();
      for (int i = 0; i < level; i++) {
    System.out.print(" ");
      System.out.println(level+ " "+child.getNodeName());
      for (int i = 0;i< children.getLength();i++) {
      Node element = children.item(i);
      // add the child
      //TreeItem childItem = new TreeItem(treeRef, SWT.NONE);
      //childItem.setText("Node " + level + " (" + element.getName() + ")");
      // display the children
      displayDomTree( element, ++level);

    And i get the following output:

       1 body
      2 textarea
      2 div

    Which is wrong because after textarea.. their is a script tag and an iframe. what am i doing wrong ?

  • donnut

    donnut - 2009-10-21


    I tried to replicate the described behaviour.

    Indeed, the parser meshes up the DOM. However, if you copy the content of into a local file and rerun the parser the results are ok! It reports both the script and the iframe.
    Parsing the local file, Java start off with some warning messages complaining about missing  script files. The answer is probably somewhere in the scripts used, but I'm not a javascript geek…

  • mironcaius

    mironcaius - 2009-10-21

    Is it possible to remove js and css parsing.. or some method to fix this ?


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks