ugent bug with invalid dom

mironcaius
2009-10-21
2013-04-29
  • mironcaius
    mironcaius
    2009-10-21

    Hello,
    I have to decide if i am going to use cobra parser for my current project. The dom support would help me a lot but as i have tested the dom is invalid compared to the html strucure and dom inspector.

    I am parsing google and ebay with this code, what am i doing wrong ?

        public static void main(String args) throws Exception {
            UserAgentContext uacontext = new SimpleUserAgentContext();
            DocumentBuilderImpl builder = new DocumentBuilderImpl(uacontext);
            Logger.getLogger("org.lobobrowser").setLevel(Level.WARNING);
            URL url = new URL(TEST_URI);
            InputStream in = url.openConnection().getInputStream();
            try {
                Reader reader = new InputStreamReader(in, "UTF-8");
                InputSourceImpl inputSource = new InputSourceImpl(reader, TEST_URI);

                UserAgentContext context = new SimpleUserAgentContext();
                DocumentBuilderImpl dbi = new DocumentBuilderImpl(context);
    //             A document URI and a charset should be provided.
                Document document = dbi.parse(inputSource);
               
                NodeList elementList = document.getChildNodes();
                //List elementList = source.getChildElements();
                displayDomTree(elementList.item(0), 0);
            } finally {
                in.close();
            }
        }
       
        static void displayDomTree( Node child, int level) {

      if (child == null)
      return;
      NodeList children = child.getChildNodes();
      for (int i = 0; i < level; i++) {
    System.out.print(" ");
    }
      System.out.println(level+ " "+child.getNodeName());
      for (int i = 0;i< children.getLength();i++) {
      Node element = children.item(i);
      // add the child
      //TreeItem childItem = new TreeItem(treeRef, SWT.NONE);
      //childItem.setText("Node " + level + " (" + element.getName() + ")");
      // display the children
      displayDomTree( element, ++level);
      -level;
      }
      }

    And i get the following output:

       1 body
      2 textarea
      2 div

    Which is wrong because after textarea.. their is a script tag and an iframe. what am i doing wrong ?
    Thanks

     
  • donnut
    donnut
    2009-10-21

    Hi,

    I tried to replicate the described behaviour.

    Indeed, the parser meshes up the DOM. However, if you copy the content of www.google.com into a local file and rerun the parser the results are ok! It reports both the script and the iframe.
    Parsing the local file, Java start off with some warning messages complaining about missing  script files. The answer is probably somewhere in the scripts used, but I'm not a javascript geek…

     
  • mironcaius
    mironcaius
    2009-10-21

    Is it possible to remove js and css parsing.. or some method to fix this ?