Hello
I've to parse a lot of web site.
I want to take their html and transform it into xml. My idea was to take the html , transform it in xml, apply to the xml an xslt and obtain my custom xml. each site (xml) will have its own xslt with xpath..
I've done something like that
org.xml.sax.XMLReader reader = org.xml.sax.helpers.XMLReaderFactory.createXMLReader ("org.htmlparser.sax.XMLReader");
org.xml.sax.ContentHandler content = new MyContentHandler ();
reader.setContentHandler (content);
org.xml.sax.ErrorHandler errors = new MyErrorHandler ();
reader.setErrorHandler (errors);
reader.parse("http://www.google.com");
I've understand that the MyContentHandler will take care about xml tags processing. For the moment I've implemented this only with system.out to test it.
I really don't know how I can do what I want..
For example: how can I apply a xslt to the google site's xml to obtain another xml?
I don't want to parse each tag with java code in the 'MyContentHandler' I want that xslt thake care about this. After I retrive the clean xml from the html I'll give this to the xslt .. so I can take my custom xml.
Someone can help me?
thanks a lot guys
Martina
Logged In: YES
user_id=605407
Originator: NO
I would suggest a set of custom tags that implement a method you define called toXML ().
It would be like toHTML() code but ensure there was valid XML output.
Then you could do something like (pseudo code):
// register your special tags
Tags[] tags = new Tag[] { new MyXMLHeadTag (), new MyXMLBodyTag (), ... };
PrototypicalNodeFactory factory = new PrototypicalNodeFactory (tags);
parser.setNodefactory (factory);
// get the entire page
NodeList list = parser.parse (null);
// print the XML
System.out.println (list.toXML ());
Logged In: YES
user_id=1312539
Originator: NO
This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 30 days (the time period specified by
the administrator of this Tracker).