HtmlCleaner / Discussion / Help: How can I get a Document object

David Martirosyan - 2009-01-22

Is there a way to create a org.w3c.dom.Document object with the cleaned html.
I want to use the Document with a xslt file to reformat the html.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Vladimir Nikic - 2009-01-22
  
  Yes, check documentation at:
  
  http://htmlcleaner.sourceforge.net/javause.php
  
  Regards, Vladimir.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- David Martirosyan - 2009-01-23
  
  Thanks Vladimir,
  Here's the code I came up with.
  It seems to work. Please let me know if this is the right way of doing it.
  
      public static Document getDocumentFromHtml(String html){
          try {
              HtmlCleaner cleaner = new HtmlCleaner();
              TagNode rootNode = cleaner.clean(html);
              DomSerializer domSerializer = new DomSerializer(new CleanerProperties());
  
              return domSerializer.createDOM(rootNode);
          } catch (ParserConfigurationException e) {
              e.printStackTrace();
          } catch (IOException e) {
              e.printStackTrace();
          }
          return null;
      }
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Vladimir Nikic - 2009-01-23
    
    That's it.
    Vladimir.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hello, I'm using the DomSerializer Class to obtain DOM Object from a TagNode.

When I Debug the code the Document object is returned null after using DomSerializer.createDOM().

My code:

public static Document transformResultPage(InputStream inputStream){
        HtmlCleaner cleaner = new HtmlCleaner(); //CREATE HTMLCLEANER INSTANCE
        Document domTree;                        //RESULT OBJECT
        DomSerializer domSerializer = new DomSerializer(cleaner.getProperties()); //DOMSERIALIZER INSTANCE USING CLEANER PROPERTIES

        try{
            TagNode tagNode =  cleaner.clean(inputStream); //GET THE TAGNODE
            // AT THIS POINT I CAN SEE THAT THE tagNode OBJECT HAVE DATA FROM THE INPUTSTREAM
            domTree = domSerializer.createDOM(tagNode); //GET THE Document OBJECT FROM TagNode USING DomSerializer INSTANCE
            // AT THIS POINT THE VALUE OF THE OBJECT domTree IS null, I DON'T KNOW WHY!!
            return domTree;
        }catch(IOException e){              
            e.printStackTrace();
            return null;
        }catch(ParserConfigurationException e){
            e.printStackTrace();
            return null;
        }
}
public static void main(String argv[]) throws IOException {
        URL pageUrl = new URL("http://www.google.com");
        URLConnection urlConnection = pageUrl.openConnection();
        Document dom = transformResultPage(urlConnection.getInputStream());
}

I hope you can help me

Thanks

Scott Wilson - 2014-01-28

Hi Jose,

There seems to be a problem with Java's DOMImplementation class when the input uses the HTML5 DocType. For example, if you add:

tagNode.setDocType(null);

... to your code snippet above, then the output dom is as you would expect.

I'll create a new issue for this - we should have a better workaround.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- NAJAMUDDIN KHAN - 2014-04-21
  
  I am doing the same as Jose and tried setting the doctype to null, wouldn't get a DOM object still, is this still an issue and not fixed yet?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2014-04-22

There seems to be an issue with the DOMImplementation in earlier JDKs (certainly 5 and some versions of 6). If you can use JDK 7 you should be fine.

Otherwise, a workaround is to use a different DOCTYPE.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi I try to get some information using the following code, but I got the following result

run3:

span=>Samsung UN39FH5000 39-Inch 1080p 60Hz LED TV
del=>$549.99
span=>

span=>LG Electronics 42LB6300 42-Inch 1080p 120Hz Smart LED TV
del=>$849.99
span=>

span=>Seiki SE32HY10 32-Inch 720p 60Hz LED HDTV (Black)
del=>$289.99
span=>$179.99

span=>LG Electronics 39LN5300 39-Inch 1080p 60Hz LED TV
del=>$479.99
span=>

span=>VIZIO E480i-B2 48-Inch 1080p Smart LED HDTV
del=>$589.99
span=>$568.00

span=>Samsung UN65H6350 65-Inch 1080p 120Hz Smart LED TV
del=>$2,199.99
span=>

span=>Samsung UN50EH6000 50-Inch 1080p 120Hz LED HDTV (2013 Model)
del=>$1,249.99
span=>$797.99

span=>TCL LE40FHDE3010 40-Inch 1080p 60Hz LED HDTV (Black)
del=>$349.99
span=>$279.99

span=>Samsung UN32H6350 32-Inch 1080p 120Hz Smart LED TV
del=>$649.99
span=>

span=>VIZIO E320i-B2 32-Inch 720p 60Hz Smart LED HDTV
del=>$269.99
span=>$252.10

span=>LG Electronics 42LN5400 42-Inch 1080p 120Hz LED TV
del=>$699.00
span=>

span=>Samsung UN55H7150 55-Inch 1080p 240Hz 3D Smart LED TV
del=>$1,899.99
span=>$1,497.99

span=>Sharp LC-80LE650U 80-inch Aquos HD 1080p 120Hz Smart LED TV
del=>$4,999.99
span=>

span=>Samsung UN19F4000 19-Inch 720p 60Hz Slim LED HDTV
del=>$229.00
span=>

span=>Samsung UN40H5500 40-Inch 1080p 60Hz Smart LED TV
del=>$629.99
span=>

span=>Samsung UN75H6350 75-Inch 1080p 120Hz Smart LED TV
del=>$4,299.99
span=>

span=>Samsung UN60H7150 60-Inch 1080p 240Hz 3D Smart LED TV
del=>$2,199.99
span=>$1,797.99

span=>Samsung UN65HU8550 65-Inch 4K Ultra HD 120Hz 3D Smart LED HDTV
del=>$3,999.99
span=>$3,297.99

span=>Samsung UN32H5500 32-Inch 1080p 60Hz Smart LED TV
del=>$479.99
span=>

span=>VIZIO M801d-A3R 80-Inch 1080p LED 3D Smart TV with 8 3D glasses (2013 Model)
del=>$3,799.99
span=>$2,999.99

span=>LG Electronics 47LB6300 47-Inch 1080p 60Hz Smart LED TV
del=>$999.99
span=>

span=>VIZIO M701d-A3R 70-Inch 1080p 3D Smart LED HDTV
del=>$2,499.99
span=>$1,999.99

span=>Samsung UN46EH5000 46-Inch 1080p 60Hz LED HDTV (Black)
del=>$699.99
span=>

span=>VIZIO E280i-B1 28-Inch 720p 60Hz Smart LED HDTV
del=>$229.99
span=>$228.00

if you note some of the prices i am not able to got while it is there in the page. i try to print the TagNode in the readDocument method
as follows
TagNode node = cleaner.clean(reader);
System.out.println(node.getText()) ;
and I dont have these data after it cleaned. is there is any properties I can set to get all the data from the source. I appreciate any help you provide.

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.nio.charset.Charset;
import java.util.List;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.ContentNode;
import org.htmlcleaner.DomSerializer;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
import org.htmlcleaner.XPatherException;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;

To change this template, choose Tools | Templates
and open the template in the editor.
*/

public class TestXPath {

public static void main(String args[]) {
    TestXPath txpath = new TestXPath();
    TagNode node = txpath.readDocument();
    txpath.run3(node);

}

private void run3(TagNode node) {
    System.out.println("run3:");
    try {
        Object[] categoryResult = node.evaluateXPath("//*[@id=\"atfResults\"]");

        if (categoryResult.length > 0) {

            TagNode categoryNode = (TagNode) categoryResult[0];
            // get name, price, discount ration, and rating
            TagNode[] itemList = categoryNode.getChildTags();

            for(TagNode tag : itemList){

                Object[] nameObj = tag.evaluateXPath("//span[@class=\"lrg bold\"]") ;
                Object[] priceObj = tag.evaluateXPath("//li[@class=\"newp\"]/div/a/del") ; 
                Object[] discountPriceObj = tag.evaluateXPath("//li[@class=\"newp\"]/div/a/span") ;

                if(nameObj.length > 0){
                    TagNode nameTN = (TagNode) nameObj[0] ; 
                    System.out.println(nameTN +  "=>" + nameTN.getText().toString().trim());
                }
                if(priceObj.length > 0){
                    TagNode priceTN = (TagNode) priceObj[0];
                    System.out.println(priceTN + "=>" + priceTN.getText());
                }


                if(discountPriceObj.length > 0){
                TagNode discountPriceTN = (TagNode) discountPriceObj[0] ; 
                System.out.println(discountPriceTN + "=>" + discountPriceTN.getText().toString().trim());
                }

                System.out.println();
               // nameObj = null ; 
               // priceObj = null ;


                 Thread.sleep(100);
            }

        }
    } catch (XPatherException ex) {
        ex.printStackTrace();
        Logger.getLogger(TestXPath.class.getName()).log(Level.SEVERE, null, ex);
    } catch (InterruptedException ex) {
        Logger.getLogger(TestXPath.class.getName()).log(Level.SEVERE, null, ex);
    }

}

private TagNode readDocument() {
    try {

        String content = null;
        String strurl = "http://www.amazon.com/s/ref=lp_172659_pg_2?rh=n%3A172282%2Cn%3A!493964%2Cn%3A1266092011%2Cn%3A172659&page=2&ie=UTF8&qid=1402610812";
        HtmlCleaner cleaner = new HtmlCleaner();
        CleanerProperties props = cleaner.getProperties();


        // open a connection to the desired URL
        URL url = new URL(strurl);
        URLConnection conn = url.openConnection();

        InputStream in = conn.getInputStream();
        InputStreamReader reader = new InputStreamReader(in, "UTF-8");

        //use the cleaner to "clean" the HTML and return it as a TagNode object
        TagNode node = cleaner.clean(reader);
        return node;

    } catch (Exception exp) {
         exp.printStackTrace();
    }
    return null;
}

}

Some of those span tags you are looking for are empty in the page anyway.e,g:

  <li class="newp">
    <div class="">
    <a href="">
    <del class="grey">$1,699.99</del>
    <span class="bld lrg red"> </span>
    </a><a href="http://www.amazon.com/Samsung-UN60H6350-60-Inch-1080p-120Hz/dp/B00ID2HGQ8/ref=sr_du_25_map?s=tv&amp;ie=UTF8&amp;qid=1404898572&amp;sr=1-25" class="map_popover" id="map_du_25">Click for product details</a>
<span class="srSprite sprPrime"></span>
            </div>
    </li>

So that would be consistent with your output:

span=>Samsung UN39FH5000 39-Inch 1080p 60Hz LED TV
del=>$549.99
span=>

How can I get a Document object

Forums

Help

How can I get a Document object document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

How can I get a Document object