span=>TCL LE40FHDE3010 40-Inch 1080p 60Hz LED HDTV (Black)
del=>$349.99
span=>$279.99
span=>Samsung UN32H6350 32-Inch 1080p 120Hz Smart LED TV
del=>$649.99
span=>
span=>VIZIO E320i-B2 32-Inch 720p 60Hz Smart LED HDTV
del=>$269.99
span=>$252.10
span=>LG Electronics 42LN5400 42-Inch 1080p 120Hz LED TV
del=>$699.00
span=>
span=>Samsung UN55H7150 55-Inch 1080p 240Hz 3D Smart LED TV
del=>$1,899.99
span=>$1,497.99
span=>Sharp LC-80LE650U 80-inch Aquos HD 1080p 120Hz Smart LED TV
del=>$4,999.99
span=>
span=>Samsung UN19F4000 19-Inch 720p 60Hz Slim LED HDTV
del=>$229.00
span=>
span=>Samsung UN40H5500 40-Inch 1080p 60Hz Smart LED TV
del=>$629.99
span=>
span=>Samsung UN75H6350 75-Inch 1080p 120Hz Smart LED TV
del=>$4,299.99
span=>
span=>Samsung UN60H7150 60-Inch 1080p 240Hz 3D Smart LED TV
del=>$2,199.99
span=>$1,797.99
span=>Samsung UN65HU8550 65-Inch 4K Ultra HD 120Hz 3D Smart LED HDTV
del=>$3,999.99
span=>$3,297.99
span=>Samsung UN32H5500 32-Inch 1080p 60Hz Smart LED TV
del=>$479.99
span=>
span=>VIZIO M801d-A3R 80-Inch 1080p LED 3D Smart TV with 8 3D glasses (2013 Model)
del=>$3,799.99
span=>$2,999.99
span=>LG Electronics 47LB6300 47-Inch 1080p 60Hz Smart LED TV
del=>$999.99
span=>
span=>VIZIO M701d-A3R 70-Inch 1080p 3D Smart LED HDTV
del=>$2,499.99
span=>$1,999.99
span=>Samsung UN46EH5000 46-Inch 1080p 60Hz LED HDTV (Black)
del=>$699.99
span=>
span=>VIZIO E280i-B1 28-Inch 720p 60Hz Smart LED HDTV
del=>$229.99
span=>$228.00
if you note some of the prices i am not able to got while it is there in the page. i try to print the TagNode in the readDocument method
as follows
TagNode node = cleaner.clean(reader);
System.out.println(node.getText()) ;
and I dont have these data after it cleaned. is there is any properties I can set to get all the data from the source. I appreciate any help you provide.
Is there a way to create a org.w3c.dom.Document object with the cleaned html.
I want to use the Document with a xslt file to reformat the html.
Yes, check documentation at:
http://htmlcleaner.sourceforge.net/javause.php
Regards, Vladimir.
Thanks Vladimir,
Here's the code I came up with.
It seems to work. Please let me know if this is the right way of doing it.
public static Document getDocumentFromHtml(String html){
try {
HtmlCleaner cleaner = new HtmlCleaner();
TagNode rootNode = cleaner.clean(html);
DomSerializer domSerializer = new DomSerializer(new CleanerProperties());
return domSerializer.createDOM(rootNode);
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
That's it.
Vladimir.
Hello, I'm using the DomSerializer Class to obtain DOM Object from a TagNode.
When I Debug the code the Document object is returned null after using DomSerializer.createDOM().
My code:
I hope you can help me
Thanks
Hi Jose,
There seems to be a problem with Java's DOMImplementation class when the input uses the HTML5 DocType. For example, if you add:
tagNode.setDocType(null);
... to your code snippet above, then the output dom is as you would expect.
I'll create a new issue for this - we should have a better workaround.
I am doing the same as Jose and tried setting the doctype to null, wouldn't get a DOM object still, is this still an issue and not fixed yet?
There seems to be an issue with the DOMImplementation in earlier JDKs (certainly 5 and some versions of 6). If you can use JDK 7 you should be fine.
Otherwise, a workaround is to use a different DOCTYPE.
Hi I try to get some information using the following code, but I got the following result
run3:
span=>Samsung UN39FH5000 39-Inch 1080p 60Hz LED TV
del=>$549.99
span=>
span=>LG Electronics 42LB6300 42-Inch 1080p 120Hz Smart LED TV
del=>$849.99
span=>
span=>Seiki SE32HY10 32-Inch 720p 60Hz LED HDTV (Black)
del=>$289.99
span=>$179.99
span=>LG Electronics 39LN5300 39-Inch 1080p 60Hz LED TV
del=>$479.99
span=>
span=>VIZIO E480i-B2 48-Inch 1080p Smart LED HDTV
del=>$589.99
span=>$568.00
span=>Samsung UN65H6350 65-Inch 1080p 120Hz Smart LED TV
del=>$2,199.99
span=>
span=>Samsung UN50EH6000 50-Inch 1080p 120Hz LED HDTV (2013 Model)
del=>$1,249.99
span=>$797.99
span=>TCL LE40FHDE3010 40-Inch 1080p 60Hz LED HDTV (Black)
del=>$349.99
span=>$279.99
span=>Samsung UN32H6350 32-Inch 1080p 120Hz Smart LED TV
del=>$649.99
span=>
span=>VIZIO E320i-B2 32-Inch 720p 60Hz Smart LED HDTV
del=>$269.99
span=>$252.10
span=>LG Electronics 42LN5400 42-Inch 1080p 120Hz LED TV
del=>$699.00
span=>
span=>Samsung UN55H7150 55-Inch 1080p 240Hz 3D Smart LED TV
del=>$1,899.99
span=>$1,497.99
span=>Sharp LC-80LE650U 80-inch Aquos HD 1080p 120Hz Smart LED TV
del=>$4,999.99
span=>
span=>Samsung UN19F4000 19-Inch 720p 60Hz Slim LED HDTV
del=>$229.00
span=>
span=>Samsung UN40H5500 40-Inch 1080p 60Hz Smart LED TV
del=>$629.99
span=>
span=>Samsung UN75H6350 75-Inch 1080p 120Hz Smart LED TV
del=>$4,299.99
span=>
span=>Samsung UN60H7150 60-Inch 1080p 240Hz 3D Smart LED TV
del=>$2,199.99
span=>$1,797.99
span=>Samsung UN65HU8550 65-Inch 4K Ultra HD 120Hz 3D Smart LED HDTV
del=>$3,999.99
span=>$3,297.99
span=>Samsung UN32H5500 32-Inch 1080p 60Hz Smart LED TV
del=>$479.99
span=>
span=>VIZIO M801d-A3R 80-Inch 1080p LED 3D Smart TV with 8 3D glasses (2013 Model)
del=>$3,799.99
span=>$2,999.99
span=>LG Electronics 47LB6300 47-Inch 1080p 60Hz Smart LED TV
del=>$999.99
span=>
span=>VIZIO M701d-A3R 70-Inch 1080p 3D Smart LED HDTV
del=>$2,499.99
span=>$1,999.99
span=>Samsung UN46EH5000 46-Inch 1080p 60Hz LED HDTV (Black)
del=>$699.99
span=>
span=>VIZIO E280i-B1 28-Inch 720p 60Hz Smart LED HDTV
del=>$229.99
span=>$228.00
if you note some of the prices i am not able to got while it is there in the page. i try to print the TagNode in the readDocument method
as follows
TagNode node = cleaner.clean(reader);
System.out.println(node.getText()) ;
and I dont have these data after it cleaned. is there is any properties I can set to get all the data from the source. I appreciate any help you provide.
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.nio.charset.Charset;
import java.util.List;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.ContentNode;
import org.htmlcleaner.DomSerializer;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
import org.htmlcleaner.XPatherException;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
/*
*/
public class TestXPath {
}
Some of those span tags you are looking for are empty in the page anyway.e,g:
So that would be consistent with your output:
span=>Samsung UN39FH5000 39-Inch 1080p 60Hz LED TV
del=>$549.99
span=>