#49 Html to xml to my custom xml with xslt


I've to parse a lot of web site.
I want to take their html and transform it into xml. My idea was to take the html , transform it in xml, apply to the xml an xslt and obtain my custom xml. each site (xml) will have its own xslt with xpath..

I've done something like that

org.xml.sax.XMLReader reader = org.xml.sax.helpers.XMLReaderFactory.createXMLReader ("org.htmlparser.sax.XMLReader");
org.xml.sax.ContentHandler content = new MyContentHandler ();
reader.setContentHandler (content);
org.xml.sax.ErrorHandler errors = new MyErrorHandler ();
reader.setErrorHandler (errors);

I've understand that the MyContentHandler will take care about xml tags processing. For the moment I've implemented this only with system.out to test it.

I really don't know how I can do what I want..
For example: how can I apply a xslt to the google site's xml to obtain another xml?
I don't want to parse each tag with java code in the 'MyContentHandler' I want that xslt thake care about this. After I retrive the clean xml from the html I'll give this to the xslt .. so I can take my custom xml.
Someone can help me?
thanks a lot guys


  • Derrick Oswald
    Derrick Oswald

    Logged In: YES
    Originator: NO

    I would suggest a set of custom tags that implement a method you define called toXML ().
    It would be like toHTML() code but ensure there was valid XML output.
    Then you could do something like (pseudo code):

    // register your special tags
    Tags[] tags = new Tag[] { new MyXMLHeadTag (), new MyXMLBodyTag (), ... };
    PrototypicalNodeFactory factory = new PrototypicalNodeFactory (tags);
    parser.setNodefactory (factory);

    // get the entire page
    NodeList list = parser.parse (null);

    // print the XML
    System.out.println (list.toXML ());

  • Derrick Oswald
    Derrick Oswald

    • labels: --> Programming Problem
    • assigned_to: nobody --> derrickoswald
    • status: open --> pending
  • Logged In: YES
    Originator: NO

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 30 days (the time period specified by
    the administrator of this Tracker).

    • status: pending --> closed