Menu

How can I get a Document object

Help
2009-01-22
2014-07-09
  • David Martirosyan

    Is there a way to create a org.w3c.dom.Document object with the cleaned html.
    I want to use the Document with a xslt file to reformat the html.

     
    • Vladimir Nikic

      Vladimir Nikic - 2009-01-22

      Yes, check documentation at:

      http://htmlcleaner.sourceforge.net/javause.php

      Regards, Vladimir.

       
    • David Martirosyan

      Thanks Vladimir,
      Here's the code I came up with.
      It seems to work.  Please let me know if this is the right way of doing it.

          public static Document getDocumentFromHtml(String html){
              try {
                  HtmlCleaner cleaner = new HtmlCleaner();
                  TagNode rootNode = cleaner.clean(html);
                  DomSerializer domSerializer = new DomSerializer(new CleanerProperties());

                  return domSerializer.createDOM(rootNode);
              } catch (ParserConfigurationException e) {
                  e.printStackTrace();
              } catch (IOException e) {
                  e.printStackTrace();
              }
              return null;
          }

       
      • Vladimir Nikic

        Vladimir Nikic - 2009-01-23

        That's it.
        Vladimir.

         
  • Jose Aguirre

    Jose Aguirre - 2014-01-27

    Hello, I'm using the DomSerializer Class to obtain DOM Object from a TagNode.

    When I Debug the code the Document object is returned null after using DomSerializer.createDOM().

    My code:

    public static Document transformResultPage(InputStream inputStream){
            HtmlCleaner cleaner = new HtmlCleaner(); //CREATE HTMLCLEANER INSTANCE
            Document domTree;                        //RESULT OBJECT
            DomSerializer domSerializer = new DomSerializer(cleaner.getProperties()); //DOMSERIALIZER INSTANCE USING CLEANER PROPERTIES
    
            try{
                TagNode tagNode =  cleaner.clean(inputStream); //GET THE TAGNODE
                // AT THIS POINT I CAN SEE THAT THE tagNode OBJECT HAVE DATA FROM THE INPUTSTREAM
                domTree = domSerializer.createDOM(tagNode); //GET THE Document OBJECT FROM TagNode USING DomSerializer INSTANCE
                // AT THIS POINT THE VALUE OF THE OBJECT domTree IS null, I DON'T KNOW WHY!!
                return domTree;
            }catch(IOException e){              
                e.printStackTrace();
                return null;
            }catch(ParserConfigurationException e){
                e.printStackTrace();
                return null;
            }
    }
    public static void main(String argv[]) throws IOException {
            URL pageUrl = new URL("http://www.google.com");
            URLConnection urlConnection = pageUrl.openConnection();
            Document dom = transformResultPage(urlConnection.getInputStream());
    }
    

    I hope you can help me

    Thanks

     
  • Scott Wilson

    Scott Wilson - 2014-01-28

    Hi Jose,

    There seems to be a problem with Java's DOMImplementation class when the input uses the HTML5 DocType. For example, if you add:

    tagNode.setDocType(null);

    ... to your code snippet above, then the output dom is as you would expect.

    I'll create a new issue for this - we should have a better workaround.

     
    • NAJAMUDDIN KHAN

      NAJAMUDDIN KHAN - 2014-04-21

      I am doing the same as Jose and tried setting the doctype to null, wouldn't get a DOM object still, is this still an issue and not fixed yet?

       
  • Scott Wilson

    Scott Wilson - 2014-04-22

    There seems to be an issue with the DOMImplementation in earlier JDKs (certainly 5 and some versions of 6). If you can use JDK 7 you should be fine.

    Otherwise, a workaround is to use a different DOCTYPE.

     
  • Alaa

    Alaa - 2014-06-13

    Hi I try to get some information using the following code, but I got the following result

    run3:

    span=>Samsung UN39FH5000 39-Inch 1080p 60Hz LED TV
    del=>$549.99
    span=>

    span=>LG Electronics 42LB6300 42-Inch 1080p 120Hz Smart LED TV
    del=>$849.99
    span=>

    span=>Seiki SE32HY10 32-Inch 720p 60Hz LED HDTV (Black)
    del=>$289.99
    span=>$179.99

    span=>LG Electronics 39LN5300 39-Inch 1080p 60Hz LED TV
    del=>$479.99
    span=>

    span=>VIZIO E480i-B2 48-Inch 1080p Smart LED HDTV
    del=>$589.99
    span=>$568.00

    span=>Samsung UN65H6350 65-Inch 1080p 120Hz Smart LED TV
    del=>$2,199.99
    span=>

    span=>Samsung UN50EH6000 50-Inch 1080p 120Hz LED HDTV (2013 Model)
    del=>$1,249.99
    span=>$797.99

    span=>TCL LE40FHDE3010 40-Inch 1080p 60Hz LED HDTV (Black)
    del=>$349.99
    span=>$279.99

    span=>Samsung UN32H6350 32-Inch 1080p 120Hz Smart LED TV
    del=>$649.99
    span=>

    span=>VIZIO E320i-B2 32-Inch 720p 60Hz Smart LED HDTV
    del=>$269.99
    span=>$252.10

    span=>LG Electronics 42LN5400 42-Inch 1080p 120Hz LED TV
    del=>$699.00
    span=>

    span=>Samsung UN55H7150 55-Inch 1080p 240Hz 3D Smart LED TV
    del=>$1,899.99
    span=>$1,497.99

    span=>Sharp LC-80LE650U 80-inch Aquos HD 1080p 120Hz Smart LED TV
    del=>$4,999.99
    span=>

    span=>Samsung UN19F4000 19-Inch 720p 60Hz Slim LED HDTV
    del=>$229.00
    span=>

    span=>Samsung UN40H5500 40-Inch 1080p 60Hz Smart LED TV
    del=>$629.99
    span=>

    span=>Samsung UN75H6350 75-Inch 1080p 120Hz Smart LED TV
    del=>$4,299.99
    span=>

    span=>Samsung UN60H7150 60-Inch 1080p 240Hz 3D Smart LED TV
    del=>$2,199.99
    span=>$1,797.99

    span=>Samsung UN65HU8550 65-Inch 4K Ultra HD 120Hz 3D Smart LED HDTV
    del=>$3,999.99
    span=>$3,297.99

    span=>Samsung UN32H5500 32-Inch 1080p 60Hz Smart LED TV
    del=>$479.99
    span=>

    span=>VIZIO M801d-A3R 80-Inch 1080p LED 3D Smart TV with 8 3D glasses (2013 Model)
    del=>$3,799.99
    span=>$2,999.99

    span=>LG Electronics 47LB6300 47-Inch 1080p 60Hz Smart LED TV
    del=>$999.99
    span=>

    span=>VIZIO M701d-A3R 70-Inch 1080p 3D Smart LED HDTV
    del=>$2,499.99
    span=>$1,999.99

    span=>Samsung UN46EH5000 46-Inch 1080p 60Hz LED HDTV (Black)
    del=>$699.99
    span=>

    span=>VIZIO E280i-B1 28-Inch 720p 60Hz Smart LED HDTV
    del=>$229.99
    span=>$228.00

    if you note some of the prices i am not able to got while it is there in the page. i try to print the TagNode in the readDocument method
    as follows
    TagNode node = cleaner.clean(reader);
    System.out.println(node.getText()) ;
    and I dont have these data after it cleaned. is there is any properties I can set to get all the data from the source. I appreciate any help you provide.

    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.net.URL;
    import java.net.URLConnection;
    import java.nio.charset.Charset;
    import java.util.List;
    import java.util.logging.Level;
    import java.util.logging.Logger;
    import javax.xml.parsers.ParserConfigurationException;
    import javax.xml.xpath.XPath;
    import javax.xml.xpath.XPathConstants;
    import javax.xml.xpath.XPathExpressionException;
    import javax.xml.xpath.XPathFactory;
    import org.htmlcleaner.CleanerProperties;
    import org.htmlcleaner.ContentNode;
    import org.htmlcleaner.DomSerializer;
    import org.htmlcleaner.HtmlCleaner;
    import org.htmlcleaner.TagNode;
    import org.htmlcleaner.XPatherException;
    import org.w3c.dom.Document;
    import org.w3c.dom.NodeList;

    /*

    • To change this template, choose Tools | Templates
    • and open the template in the editor.
      */

    public class TestXPath {

    public static void main(String args[]) {
        TestXPath txpath = new TestXPath();
        TagNode node = txpath.readDocument();
        txpath.run3(node);
    
    }
    
    private void run3(TagNode node) {
        System.out.println("run3:");
        try {
            Object[] categoryResult = node.evaluateXPath("//*[@id=\"atfResults\"]");
    
            if (categoryResult.length > 0) {
    
                TagNode categoryNode = (TagNode) categoryResult[0];
                // get name, price, discount ration, and rating
                TagNode[] itemList = categoryNode.getChildTags();
    
                for(TagNode tag : itemList){
    
                    Object[] nameObj = tag.evaluateXPath("//span[@class=\"lrg bold\"]") ;
                    Object[] priceObj = tag.evaluateXPath("//li[@class=\"newp\"]/div/a/del") ; 
                    Object[] discountPriceObj = tag.evaluateXPath("//li[@class=\"newp\"]/div/a/span") ;
    
                    if(nameObj.length > 0){
                        TagNode nameTN = (TagNode) nameObj[0] ; 
                        System.out.println(nameTN +  "=>" + nameTN.getText().toString().trim());
                    }
                    if(priceObj.length > 0){
                        TagNode priceTN = (TagNode) priceObj[0];
                        System.out.println(priceTN + "=>" + priceTN.getText());
                    }
    
    
                    if(discountPriceObj.length > 0){
                    TagNode discountPriceTN = (TagNode) discountPriceObj[0] ; 
                    System.out.println(discountPriceTN + "=>" + discountPriceTN.getText().toString().trim());
                    }
    
                    System.out.println();
                   // nameObj = null ; 
                   // priceObj = null ;
    
    
                     Thread.sleep(100);
                }
    
            }
        } catch (XPatherException ex) {
            ex.printStackTrace();
            Logger.getLogger(TestXPath.class.getName()).log(Level.SEVERE, null, ex);
        } catch (InterruptedException ex) {
            Logger.getLogger(TestXPath.class.getName()).log(Level.SEVERE, null, ex);
        }
    
    }
    
    private TagNode readDocument() {
        try {
    
            String content = null;
            String strurl = "http://www.amazon.com/s/ref=lp_172659_pg_2?rh=n%3A172282%2Cn%3A!493964%2Cn%3A1266092011%2Cn%3A172659&page=2&ie=UTF8&qid=1402610812";
            HtmlCleaner cleaner = new HtmlCleaner();
            CleanerProperties props = cleaner.getProperties();
    
    
            // open a connection to the desired URL
            URL url = new URL(strurl);
            URLConnection conn = url.openConnection();
    
            InputStream in = conn.getInputStream();
            InputStreamReader reader = new InputStreamReader(in, "UTF-8");
    
            //use the cleaner to "clean" the HTML and return it as a TagNode object
            TagNode node = cleaner.clean(reader);
            return node;
    
        } catch (Exception exp) {
             exp.printStackTrace();
        }
        return null;
    }
    

    }

     
  • Scott Wilson

    Scott Wilson - 2014-07-09

    Some of those span tags you are looking for are empty in the page anyway.e,g:

      <li class="newp">
        <div class="">
        <a href="">
        <del class="grey">$1,699.99</del>
        <span class="bld lrg red"> </span>
        </a><a href="http://www.amazon.com/Samsung-UN60H6350-60-Inch-1080p-120Hz/dp/B00ID2HGQ8/ref=sr_du_25_map?s=tv&amp;ie=UTF8&amp;qid=1404898572&amp;sr=1-25" class="map_popover" id="map_du_25">Click for product details</a>
    <span class="srSprite sprPrime"></span>
                </div>
        </li>
    

    So that would be consistent with your output:

    span=>Samsung UN39FH5000 39-Inch 1080p 60Hz LED TV
    del=>$549.99
    span=>

     

Log in to post a comment.

MongoDB Logo MongoDB