From: Adam R. <ad...@ex...> - 2014-09-01 15:41:11
|
Perhaps one option would be to switch namespace processing off in Neko for this and then later add the namespaces back in using a custom filter (if that is desirable). http://nekohtml.sourceforge.net/settings.html#namespaces http://nekohtml.sourceforge.net/settings.html#filters On 1 September 2014 06:36, Alister Pillow <gsp...@gm...> wrote: > The error in more detail… Only happening when storing the document, not when parsing. > > 2014-08-31 17:19:02,585 [eXistThread-31] DEBUG (ModuleUtils.java [htmlToXHtml]:251) - Converting HTML to XML using NekoHTML parser for: alternative > 2014-08-31 17:19:03,724 [eXistThread-31] DEBUG (TransactionManager.java [execute]:159) - Starting new transaction: 5 > 2014-08-31 17:19:03,730 [eXistThread-31] DEBUG (Collection.java [validateXMLResourceInternal]:1631) - Scanning document /db/test/message-6.xml > 2014-08-31 17:19:03,731 [eXistThread-31] DEBUG (GrammarPool.java [retrieveInitialGrammarSet]:81) - Retrieve initial grammarset (http://www.w3.org/TR/REC-xml). > 2014-08-31 17:19:03,731 [eXistThread-31] DEBUG (GrammarPool.java [retrieveInitialGrammarSet]:85) - Found 0 grammars. > 2014-08-31 17:19:03,740 [eXistThread-31] DEBUG (Indexer.java [fatalError]:419) - fatal error at (58,2064) : The value of the attribute "prefix="xmlns",localpart="o",rawname="xmlns:o"" is invalid. Prefixed namespace bindings may not be empty. > 2014-08-31 17:19:03,740 [eXistThread-31] DEBUG (GrammarPool.java [retrieveInitialGrammarSet]:81) - Retrieve initial grammarset (http://www.w3.org/TR/REC-xml). > 2014-08-31 17:19:03,740 [eXistThread-31] DEBUG (GrammarPool.java [retrieveInitialGrammarSet]:85) - Found 0 grammars. > 2014-08-31 17:19:03,741 [eXistThread-31] ERROR (XMLDBStore.java [evalWithCollection]:220) - The XML parser reported a problem: fatal error at (58,2064) : The value of the attribute "prefix="xmlns",localpart="o",rawname="xmlns:o"" is invalid. Prefixed namespace bindings may not be empty. > org.xmldb.api.base.XMLDBException: The XML parser reported a problem: fatal error at (58,2064) : The value of the attribute "prefix="xmlns",localpart="o",rawname="xmlns:o"" is invalid. Prefixed namespace bindings may not be empty. > at org.exist.xmldb.LocalCollection.storeXMLResource(LocalCollection.java:893) > at org.exist.xmldb.LocalCollection.storeResource(LocalCollection.java:766) > at org.exist.xmldb.LocalCollection.storeResource(LocalCollection.java:754) > > I’ve tried adding a RemoveElement filter to the NekoHTML parser - but can’t make that work - no content is returned. The documentation seems to suggest that I have to add an acceptElement for every element I want to keep - which means all of HTML!!! > > Now I’m trying a FilterInputStream approach - but that seems such a blunt instrument - any other suggestions on how to remove <o:p></o:p> from the result of htmlTXHtml? > > This is how I modified ModuleUtils.htmlToXHtml > > LOG.debug("Converting HTML to XML using NekoHTML parser for: " + url); > reader = (XMLReader) Class.forName("org.cyberneko.html.parsers.SAXParser").newInstance(); > > ElementRemover remover = new ElementRemover(); > remover.removeElement("o:p"); > // setup filter chain > XMLDocumentFilter[] filters = { > remover > }; > > reader.setProperty("http://cyberneko.org/html/properties/names/elems","match"); > reader.setProperty("http://cyberneko.org/html/properties/names/attrs","no-change"); > reader.setProperty("http://cyberneko.org/html/properties/filters", filters); // ADDED FILTER here > > > On 31 Aug 2014, at 6:11 pm, Alister Pillow <gsp...@gm...> wrote: > >> Hi, >> Trying to finish off the mail:get-messages function. The error appears when trying to store an email as xml in the /db. >> >> I’ve hit a nasty little bug - nothing to do with eXist - and (surprise) related to mail from Microsoft Outlook - via Apple Mail. I now suspect that Apple Mail is the culprit. >> >> I forwarded an email from my Inbox to another account and then retrieved it using mail:get-messages. >> The original html section is full of <o:p></o:p> tags. These are end-of-paragraph markers inserted by MS Word when creating HTML. >> >> In the original email, the prefix and namespace is declared - but in the forwarded message, it is missing - consequently I get an error from SAXParser when trying to store this content in the DB. >> >> Is there some way to tell the parser to skip these (empty) elements? Or will I have to write a filter for the text before parsing it? >> >> I’m using Wolfgang’s suggestion: >> DocumentImpl html = ModuleUtils.htmlToXHtml(context, "alternative", new StreamSource(part.getInputStream()), null, null); >> ElementImpl rootElem = (ElementImpl)html.getDocumentElement(); >> >> (Otherwise, mail:get-messages is working quite nicely.) >> >> Thanks, >> Alister. > > > ------------------------------------------------------------------------------ > Slashdot TV. > Video for Nerds. Stuff that matters. > http://tv.slashdot.org/ > _______________________________________________ > Exist-development mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-development -- Adam Retter eXist Developer { United Kingdom } ad...@ex... irc://irc.freenode.net/existdb |