From: Alister P. <gsp...@gm...> - 2014-09-01 05:35:12
|
The error in more detail… Only happening when storing the document, not when parsing. 2014-08-31 17:19:02,585 [eXistThread-31] DEBUG (ModuleUtils.java [htmlToXHtml]:251) - Converting HTML to XML using NekoHTML parser for: alternative 2014-08-31 17:19:03,724 [eXistThread-31] DEBUG (TransactionManager.java [execute]:159) - Starting new transaction: 5 2014-08-31 17:19:03,730 [eXistThread-31] DEBUG (Collection.java [validateXMLResourceInternal]:1631) - Scanning document /db/test/message-6.xml 2014-08-31 17:19:03,731 [eXistThread-31] DEBUG (GrammarPool.java [retrieveInitialGrammarSet]:81) - Retrieve initial grammarset (http://www.w3.org/TR/REC-xml). 2014-08-31 17:19:03,731 [eXistThread-31] DEBUG (GrammarPool.java [retrieveInitialGrammarSet]:85) - Found 0 grammars. 2014-08-31 17:19:03,740 [eXistThread-31] DEBUG (Indexer.java [fatalError]:419) - fatal error at (58,2064) : The value of the attribute "prefix="xmlns",localpart="o",rawname="xmlns:o"" is invalid. Prefixed namespace bindings may not be empty. 2014-08-31 17:19:03,740 [eXistThread-31] DEBUG (GrammarPool.java [retrieveInitialGrammarSet]:81) - Retrieve initial grammarset (http://www.w3.org/TR/REC-xml). 2014-08-31 17:19:03,740 [eXistThread-31] DEBUG (GrammarPool.java [retrieveInitialGrammarSet]:85) - Found 0 grammars. 2014-08-31 17:19:03,741 [eXistThread-31] ERROR (XMLDBStore.java [evalWithCollection]:220) - The XML parser reported a problem: fatal error at (58,2064) : The value of the attribute "prefix="xmlns",localpart="o",rawname="xmlns:o"" is invalid. Prefixed namespace bindings may not be empty. org.xmldb.api.base.XMLDBException: The XML parser reported a problem: fatal error at (58,2064) : The value of the attribute "prefix="xmlns",localpart="o",rawname="xmlns:o"" is invalid. Prefixed namespace bindings may not be empty. at org.exist.xmldb.LocalCollection.storeXMLResource(LocalCollection.java:893) at org.exist.xmldb.LocalCollection.storeResource(LocalCollection.java:766) at org.exist.xmldb.LocalCollection.storeResource(LocalCollection.java:754) I’ve tried adding a RemoveElement filter to the NekoHTML parser - but can’t make that work - no content is returned. The documentation seems to suggest that I have to add an acceptElement for every element I want to keep - which means all of HTML!!! Now I’m trying a FilterInputStream approach - but that seems such a blunt instrument - any other suggestions on how to remove <o:p></o:p> from the result of htmlTXHtml? This is how I modified ModuleUtils.htmlToXHtml LOG.debug("Converting HTML to XML using NekoHTML parser for: " + url); reader = (XMLReader) Class.forName("org.cyberneko.html.parsers.SAXParser").newInstance(); ElementRemover remover = new ElementRemover(); remover.removeElement("o:p"); // setup filter chain XMLDocumentFilter[] filters = { remover }; reader.setProperty("http://cyberneko.org/html/properties/names/elems","match"); reader.setProperty("http://cyberneko.org/html/properties/names/attrs","no-change"); reader.setProperty("http://cyberneko.org/html/properties/filters", filters); // ADDED FILTER here On 31 Aug 2014, at 6:11 pm, Alister Pillow <gsp...@gm...> wrote: > Hi, > Trying to finish off the mail:get-messages function. The error appears when trying to store an email as xml in the /db. > > I’ve hit a nasty little bug - nothing to do with eXist - and (surprise) related to mail from Microsoft Outlook - via Apple Mail. I now suspect that Apple Mail is the culprit. > > I forwarded an email from my Inbox to another account and then retrieved it using mail:get-messages. > The original html section is full of <o:p></o:p> tags. These are end-of-paragraph markers inserted by MS Word when creating HTML. > > In the original email, the prefix and namespace is declared - but in the forwarded message, it is missing - consequently I get an error from SAXParser when trying to store this content in the DB. > > Is there some way to tell the parser to skip these (empty) elements? Or will I have to write a filter for the text before parsing it? > > I’m using Wolfgang’s suggestion: > DocumentImpl html = ModuleUtils.htmlToXHtml(context, "alternative", new StreamSource(part.getInputStream()), null, null); > ElementImpl rootElem = (ElementImpl)html.getDocumentElement(); > > (Otherwise, mail:get-messages is working quite nicely.) > > Thanks, > Alister. |