From: Stefan S. <se...@sy...> - 2003-05-07 18:50:44
|
and...@am... wrote: > Common misconception: parsers drop the original text. > An artifact of the early lexers, like lex. > > Much recent work, e.g. in error handling and macro processing > and refactoring, has provided access to the original text > even in the parsed stream. the question to ask is what content has semantic value. Are linebreaks, whitespace, etc. to be preserved ? These questions boil down to what level you are operating on. Are you interested into the infosets, or do you want to know the details of how things are encoded in a file ? As Murray suggests, we are talking about XML here, which is at least one level on top of raw file data. For example a DTD may implicitly declare an attribute value for a node, even though the attribute was not present in the 'raw' input. When looking at the DOM, you won't see that. That's a feature, not a bug. If you want to have access to lower levels, use lower levels, i.e. raw I/O. > E.g. consider error handling... it should be possible > to trace any error back to any and all files, lines, characters > which are associated with the error. In the presence of macros, > you should be able to say where the error occurred, > in every phase of macro expansion. Yet most parsers that > drop the original text lose this ability. well, no. The question is what error we are talking about. Is it a syntax error that makes the file content invalid XML ? The XML parser can clearly catch that and report it to you. If we are talking about structural errors such as a document that doesn't validate against a dtd/schema, that, too can be done in a generic way, i.e. by telling the parser to match the document against a known dtd/schema. Everything else is up to the application. Stefan |