We will need to source an XML parser (presumably there must be one out there somewhere) so that we can parse the HTML documents.
This implies the ability to convert the retrieved/stored HTML documents to a syntactically correct HTML-schema XML document. We would need to write a design for this conversion, including heuristics in those cases where the change required to the HTML document is not obvious (for example, where it is not obvious where the missing HTML end tags should be).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2000-10-05
We should use SAX or DOM Parsers I think. I know of two which are
free and in Java.
I like the look of Xerces at xml.apache.org. It appears to be fairly complete and sophisticated, and of course, it comes from the apache group so it can't be too bad at all to use.
I'm all for adopting it as our standard XML parsing component, and by the looks of it, you are too Martin. I'd say Chris would be too if he weren't so busy at Uni right now :)
I'll add a link to the home page just now.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We will need to source an XML parser (presumably there must be one out there somewhere) so that we can parse the HTML documents.
This implies the ability to convert the retrieved/stored HTML documents to a syntactically correct HTML-schema XML document. We would need to write a design for this conversion, including heuristics in those cases where the change required to the HTML document is not obvious (for example, where it is not obvious where the missing HTML end tags should be).
We should use SAX or DOM Parsers I think. I know of two which are
free and in Java.
1. Apache Projects XERXES : go to http://xml.apache.org/
SAX 2 and DOM 1 and 2(beta)
2. J Clarks XP : go to http://www.jclark.com/ (Sax only ?)
Isn't there a Oracle-something too ?
martin
I like the look of Xerces at xml.apache.org. It appears to be fairly complete and sophisticated, and of course, it comes from the apache group so it can't be too bad at all to use.
I'm all for adopting it as our standard XML parsing component, and by the looks of it, you are too Martin. I'd say Chris would be too if he weren't so busy at Uni right now :)
I'll add a link to the home page just now.