You can subscribe to this list here.
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2008 |
Jan
(5) |
Feb
(13) |
Mar
(7) |
Apr
(23) |
May
(1) |
Jun
(1) |
Jul
(10) |
Aug
(2) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
(7) |
2009 |
Jan
(4) |
Feb
(2) |
Mar
|
Apr
(6) |
May
(8) |
Jun
|
Jul
(5) |
Aug
(5) |
Sep
(2) |
Oct
(1) |
Nov
(1) |
Dec
(1) |
2010 |
Jan
(12) |
Feb
(5) |
Mar
|
Apr
(4) |
May
(22) |
Jun
(3) |
Jul
(1) |
Aug
(3) |
Sep
(3) |
Oct
(1) |
Nov
(1) |
Dec
(2) |
2011 |
Jan
(10) |
Feb
|
Mar
(4) |
Apr
(2) |
May
|
Jun
(2) |
Jul
|
Aug
(3) |
Sep
(1) |
Oct
|
Nov
|
Dec
(3) |
2012 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
(2) |
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Misha K. <mis...@gm...> - 2010-05-25 16:05:35
|
Wow. Thank you so much. So do I understand correctly this is all it takes? XMLDocumentFilter idEnhancer = new DefaultFilter() { public void startElement(QName element, XMLAttributes attributes, Augmentations augs) throws XNIException { int idx = attributes.getIndex("id"); if (idx > -1) { attributes.setType(idx, "ID"); Augmentations attrsAugs = attributes.getAugmentations(idx); attrsAugs.putItem(Constants.ATTRIBUTE_DECLARED, Boolean.TRUE); } super.startElement(element, attributes, augs); } XMLDocumentFilter[] filters = { idEnhancer }; fConfiguration.setProperty("http://cyberneko.org/html/properties/filters", filters); p.s. I have not decided yet but will probably cheat and use XPath for now. It seems like that will be a solution that is less likely to change with the times. It definitely seems like something necessary, and I will keep this in mind for later work. Thank you! Misha Jacob Kjome wrote: > That's because it's not a validating parser. You can only define "id" as being of > type "ID" if it's validated against a DTD or XML Schema. > > However, there is a workaround [1] that I implemented for the XMLC project [2]. > You can use a NekoHTML Filter [3] to automagically mark certain attributes as > being of type "ID". Look for the "idEnhancer" filter in the linked code. The > only problem with the solution I came up with is that it uses knowledge about > Xerces internals that could change at any given release. That said, it's worked > since at least Xerces 2.8.1 and the Xerces code that it takes advantage of doesn't > appear to be up for refactoring anytime soon, IMO. > > What would be really nice it to figure out a less brittle implementation; that is, > one that doesn't depend upon Xerces internals. If anyone on this list knows of > one, it would be a great contribution as getElementById() won't work for HTML > without it. > > > [1] > http://websvn.ow2.org/filedetails.php?repname=xmlc&path=%2Ftrunk%2Fxmlc%2Fxmlc%2Fmodules%2Fxmlc%2Fsrc%2Forg%2Fenhydra%2Fxml%2Fxmlc%2Fparsers%2Fxerces%2FXercesHTMLDOMParser.java > [2] http://forge.ow2.org/projects/xmlc/ > [3] http://nekohtml.sourceforge.net/filters.html > > > Jake > > On 5/24/2010 9:01 PM, Misha Koshelev wrote: >> Dear Sirs: >> >> Again thank you for such a great product! >> >> I am undergoing step (ii) of converting my Web Automation Framework (www.mkosh.com - new version to be posted tomorrow) >> to using NekoHTML. >> >> Thank you so much for your prior help with XPath expressions, etc. >> >> Specifically, I have now encountered the following issue. >> >> I parse the document and am able to correctly use XPath expression with lowercase element names. >> >> The attribute names are also lowercase. >> >> However, it seems the id attribute is not marked as being of type "ID", and so document.getElementById always returns null >> (I checked this by using an XPath that retrieves an Element, getting the "id" attribute, and then immediately doing document.getElementById for that exact attribute). >> >> I am using the following code to parse: >> DOMParser domParser=new DOMParser(new HTMLConfiguration()); >> try { >> domParser.setFeature("http://cyberneko.org/html/features/augmentations",true); >> domParser.setProperty("http://cyberneko.org/html/properties/names/elems","lower"); >> } catch (SAXNotRecognizedException saxnre) { >> throw new WebDriverException("Error parsing document",saxnre); >> } catch (SAXNotSupportedException saxnse) { >> throw new WebDriverException("Error parsing document",saxnse); >> } >> try { >> domParser.parse(new InputSource(new ByteArrayInputStream(pageSource.getBytes()))); >> } catch (IOException ioe) { >> throw new WebDriverException("Error parsing document",ioe); >> } catch (SAXException saxe) { >> throw new WebDriverException("Error parsing document",saxe); >> } >> setDocument(domParser.getDocument()); >> >> Thank you so much >> >> Misha >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> nekohtml-user mailing list >> nek...@li... >> https://lists.sourceforge.net/lists/listinfo/nekohtml-user >> >> >> > > ------------------------------------------------------------------------------ > > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user |
From: Jacob K. <ho...@vi...> - 2010-05-25 04:30:16
|
That's because it's not a validating parser. You can only define "id" as being of type "ID" if it's validated against a DTD or XML Schema. However, there is a workaround [1] that I implemented for the XMLC project [2]. You can use a NekoHTML Filter [3] to automagically mark certain attributes as being of type "ID". Look for the "idEnhancer" filter in the linked code. The only problem with the solution I came up with is that it uses knowledge about Xerces internals that could change at any given release. That said, it's worked since at least Xerces 2.8.1 and the Xerces code that it takes advantage of doesn't appear to be up for refactoring anytime soon, IMO. What would be really nice it to figure out a less brittle implementation; that is, one that doesn't depend upon Xerces internals. If anyone on this list knows of one, it would be a great contribution as getElementById() won't work for HTML without it. [1] http://websvn.ow2.org/filedetails.php?repname=xmlc&path=%2Ftrunk%2Fxmlc%2Fxmlc%2Fmodules%2Fxmlc%2Fsrc%2Forg%2Fenhydra%2Fxml%2Fxmlc%2Fparsers%2Fxerces%2FXercesHTMLDOMParser.java [2] http://forge.ow2.org/projects/xmlc/ [3] http://nekohtml.sourceforge.net/filters.html Jake On 5/24/2010 9:01 PM, Misha Koshelev wrote: > Dear Sirs: > > Again thank you for such a great product! > > I am undergoing step (ii) of converting my Web Automation Framework (www.mkosh.com - new version to be posted tomorrow) > to using NekoHTML. > > Thank you so much for your prior help with XPath expressions, etc. > > Specifically, I have now encountered the following issue. > > I parse the document and am able to correctly use XPath expression with lowercase element names. > > The attribute names are also lowercase. > > However, it seems the id attribute is not marked as being of type "ID", and so document.getElementById always returns null > (I checked this by using an XPath that retrieves an Element, getting the "id" attribute, and then immediately doing document.getElementById for that exact attribute). > > I am using the following code to parse: > DOMParser domParser=new DOMParser(new HTMLConfiguration()); > try { > domParser.setFeature("http://cyberneko.org/html/features/augmentations",true); > domParser.setProperty("http://cyberneko.org/html/properties/names/elems","lower"); > } catch (SAXNotRecognizedException saxnre) { > throw new WebDriverException("Error parsing document",saxnre); > } catch (SAXNotSupportedException saxnse) { > throw new WebDriverException("Error parsing document",saxnse); > } > try { > domParser.parse(new InputSource(new ByteArrayInputStream(pageSource.getBytes()))); > } catch (IOException ioe) { > throw new WebDriverException("Error parsing document",ioe); > } catch (SAXException saxe) { > throw new WebDriverException("Error parsing document",saxe); > } > setDocument(domParser.getDocument()); > > Thank you so much > > Misha > > ------------------------------------------------------------------------------ > > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > |
From: Misha K. <mis...@gm...> - 2010-05-25 03:01:50
|
Dear Sirs: Again thank you for such a great product! I am undergoing step (ii) of converting my Web Automation Framework (www.mkosh.com - new version to be posted tomorrow) to using NekoHTML. Thank you so much for your prior help with XPath expressions, etc. Specifically, I have now encountered the following issue. I parse the document and am able to correctly use XPath expression with lowercase element names. The attribute names are also lowercase. However, it seems the id attribute is not marked as being of type "ID", and so document.getElementById always returns null (I checked this by using an XPath that retrieves an Element, getting the "id" attribute, and then immediately doing document.getElementById for that exact attribute). I am using the following code to parse: DOMParser domParser=new DOMParser(new HTMLConfiguration()); try { domParser.setFeature("http://cyberneko.org/html/features/augmentations",true); domParser.setProperty("http://cyberneko.org/html/properties/names/elems","lower"); } catch (SAXNotRecognizedException saxnre) { throw new WebDriverException("Error parsing document",saxnre); } catch (SAXNotSupportedException saxnse) { throw new WebDriverException("Error parsing document",saxnse); } try { domParser.parse(new InputSource(new ByteArrayInputStream(pageSource.getBytes()))); } catch (IOException ioe) { throw new WebDriverException("Error parsing document",ioe); } catch (SAXException saxe) { throw new WebDriverException("Error parsing document",saxe); } setDocument(domParser.getDocument()); Thank you so much Misha |
From: Misha K. <mis...@gm...> - 2010-05-21 02:45:34
|
My apologies: initial message got sent prematurely. Please find updated version below. Thank you! Main changes from version below. I actually have _two_ versions on site (being uploaded as we speak). The stable version was a step back today towards the Javascript-based model, somewhat simplified (uses unique ids for elements). Relies of sleep statements completely, but works quite well. The other (unstable) version actually retrieves a full DOM structure for each web page. There are two issues: 1) how to detect when a page has _truly_ loaded in a cross-browser way. I have some ideas with regards to this (setInterval with a function that will check for creation of any new elements whose IDs are not known, say, every 1 second; when no new elements have been created in 1 second, it means we are done) 2) how to detect whether any changes occur after a Javascript statement is executed (perhaps above will work as well?) In any case, looking much forward to your input. Highlights: Stable version - slightly better Javascript version Unstable version - DOM-based framework, along with tester application (Firebug-like). Unfortunately has some kinks on Windows. Thank you Misha Koshelev --- Dear All: Please bear with me as I release a new version of the SWT Web Automation Framework at www.mkosh.com now licensed under EPL v1. I would still like to contribute the parts of the framework that are in my org.eclipse.swt.browser.webdriver classes back to the SWT project if possible. Just to remind, this is a cross-platform solution that allows: * end user web automation applications * with the ability (that SWT provides) to hide the browser from the user You will find on the Web site, additionally, a sample application (Facebook Birthday Greeter), as well as a Web Automation Framework tester. Notably, the big change from the previous version is that I use NekoHTML to actually parse a DOM structure for the given document. This allows some additional capabilities not present in the prior version, specifically the use of XPath expression querying. Additionally, it makes the methodology somewhat neater, and, as you can see by the Web Automation Framework Tester tool (which, I have to say, simulates to some extent the Firebug Firefox Extension - and thank you to Grant Gayed on the SWT forums for all your help. There is much to be done on the back end side. Roughly, to keep a consistent API, I am following that of WebDriver http://selenium.googlecode.com/svn/trunk/docs/api/java/index.html although, as you can see, there are additional features that are not present there. In any case, I still have much work to do. I have implemented the examples from WebDriver in org.eclipse.swt.browser.webdriver.Test Additionally, although some keystrokes are still sent incorrectly (e.g., "!"), one really neat thing is that we can even send keystrokes correctly to Internet Explorer with the window _not visible_, which is impossible both using Javascript (I did not find a method to simulate keystrokes that worked well on Internet Explorer) and using WebDriver (as it is impossible to hide the IE window). In any case, I look forward to your comments/support/encouragement etc. Thank you! Misha p.s. There are some known problems right, especially with detecting mouse move events in the Tester in Windows IE. I will work on fixing this. Thank you! |
From: Misha K. <mis...@gm...> - 2010-05-21 02:41:27
|
Dear All: Please bear with me as I release a new version of the SWT Web Automation Framework at www.mkosh.com now licensed under EPL v1. I would still like to contribute the parts of the framework that are in my org.eclipse.swt.browser.webdriver classes back to the SWT project if possible. Just to remind, this is a cross-platform solution that allows: * end user web automation applications * with the ability (that SWT provides) to hide the browser from the user You will find on the Web site, additionally, a sample application (Facebook Birthday Greeter), as well as a Web Automation Framework tester. Notably, the big change from the previous version is that I use NekoHTML to actually parse a DOM structure for the given document. This allows some additional capabilities not present in the prior version, specifically the use of XPath expression querying. Additionally, it makes the methodology somewhat neater, and, as you can see by the Web Automation Framework Tester tool (which, I have to say, simulates to some extent the Firebug Firefox Extension - and thank you to Grant Gayed on the SWT forums for all your help. There is much to be done on the back end side. Roughly, to keep a consistent API, I am following that of WebDriver http://selenium.googlecode.com/svn/trunk/docs/api/java/index.html although, as you can see, there are additional features that are not present there. In any case, I still have much work to do. I have implemented the examples from WebDriver in org.eclipse.swt.browser.webdriver.Test Additionally, although some keystrokes are still sent incorrectly (e.g., "!"), one really neat thing is that we can even send keystrokes correctly to Internet Explorer with the window _not visible_, which is impossible both using Javascript (I did not find a method to simulate keystrokes that worked well on Internet Explorer) and using WebDriver (as it is impossible to hide the IE window). In any case, I look forward to your comments/support/encouragement etc. Thank you! Misha p.s. There are some known problems right, especially with detecting mouse move events in the Tester in Windows IE. I will work on fixing this. Thank you! |
From: Misha K. <mis...@gm...> - 2010-05-19 19:05:09
|
Thank you so much! That did the trick... Misha Ian Roberts wrote: > Misha Koshelev wrote: >> I would like to find _all_ descendants (even those that are, say, 2 >> levels deep), within the current context. > > You want "descendant::a". > > XPath has a number of different 'axes' on which you can look for nodes. > "a" is equivalent to "child::a", meaning direct children, but there's > also "descendant::", "descendant-or-self::", "following-sibling::", > "preceding-sibling::", and several others that can be useful in > different circumstances. For example, the expression > > a[not(preceding-sibling::a)] > > finds the *first* 'a' child element of the current context element > (specifically it finds any 'a' element that doesn't have any other 'a' > elements siblings before it). > > Ian > |
From: Ian R. <i.r...@dc...> - 2010-05-19 16:31:57
|
Misha Koshelev wrote: > I would like to find _all_ descendants (even those that are, say, 2 > levels deep), within the current context. You want "descendant::a". XPath has a number of different 'axes' on which you can look for nodes. "a" is equivalent to "child::a", meaning direct children, but there's also "descendant::", "descendant-or-self::", "following-sibling::", "preceding-sibling::", and several others that can be useful in different circumstances. For example, the expression a[not(preceding-sibling::a)] finds the *first* 'a' child element of the current context element (specifically it finds any 'a' element that doesn't have any other 'a' elements siblings before it). Ian -- Ian Roberts | Department of Computer Science i.r...@dc... | University of Sheffield, UK |
From: Misha K. <mis...@gm...> - 2010-05-19 16:05:54
|
Thank you. Apparently XPath is a little stranger beast than I imagined ;) I am reading here: http://java.sun.com/developer/technicalArticles/xml/jaxp1-3/ Notably, if I do "a" as my XPath expression, it finds all "a" elements that are a _direct_ ancestor of my current node (e.g., if there's a <myelement><div><a /></div></element>, and I search with myelement as the context, it does _not_ find my desired tag). If I do "//a", it just ignores the current node and goes for the entire document. Even if I do "//xpathtocurrentnote/a", it seems to only look for _direct_ descendants. hmm... I would like to find _all_ descendants (even those that are, say, 2 levels deep), within the current context. Any ideas? Besides doing whole document and manually checking. Thank you Misha Luis Fernando Gutiérrez wrote: > Misha. > > As far as I understand your XPath expression "//a" is quering all the > "a" elements from the root node. If you want to look inside an > especific element, you should change the "//" part for the element > you want. > > > > ----- Original Message ---- > From: Misha Koshelev <mis...@gm...> > To: nek...@li... > Sent: Tue, May 18, 2010 11:15:00 PM > Subject: [nekohtml-user] Sorry to bother - Xalan JAXP XPath from _current_ element question > > My apologies if unclear. > > I would like to, say, find all "a" tags that are children of an Element element. > > However, when I use the XPathExpression.evaluate(element,XPathExpression.NODE_TYPE) function with the XPath "//a", I end up > getting those for _entire_ document. > > Please let me know if I am missing something simple. > > Thank you > Misha > > ------------------------------------------------------------------------------ > > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user |
From: Luis F. G. <lui...@ya...> - 2010-05-19 15:25:02
|
Misha. As far as I understand your XPath expression "//a" is quering all the "a" elements from the root node. If you want to look inside an especific element, you should change the "//" part for the element you want. ----- Original Message ---- From: Misha Koshelev <mis...@gm...> To: nek...@li... Sent: Tue, May 18, 2010 11:15:00 PM Subject: [nekohtml-user] Sorry to bother - Xalan JAXP XPath from _current_ element question My apologies if unclear. I would like to, say, find all "a" tags that are children of an Element element. However, when I use the XPathExpression.evaluate(element,XPathExpression.NODE_TYPE) function with the XPath "//a", I end up getting those for _entire_ document. Please let me know if I am missing something simple. Thank you Misha ------------------------------------------------------------------------------ _______________________________________________ nekohtml-user mailing list nek...@li... https://lists.sourceforge.net/lists/listinfo/nekohtml-user |
From: Misha K. <mis...@gm...> - 2010-05-19 04:15:09
|
My apologies if unclear. I would like to, say, find all "a" tags that are children of an Element element. However, when I use the XPathExpression.evaluate(element,XPathExpression.NODE_TYPE) function with the XPath "//a", I end up getting those for _entire_ document. Please let me know if I am missing something simple. Thank you Misha |
From: Misha K. <mis...@gm...> - 2010-05-16 02:54:33
|
Thank you. I have followed your advice a few days later :) Misha Jacob Kjome wrote: > On 5/13/2010 10:09 AM, Misha Koshelev wrote: >> Jacob Kjome wrote: >>> No NekoHML doesn't have any specific XPath integration (beyond what Xerces >>> itself provides). But you can use XOM's support (via Jaxen), Jaxen directly >>> (or any other custom XPath implementation), or just use the built in JAXP >>> support provided by the standard JAXP APIs, which Xerces developers would >>> encourage you to use anyway rather than the Xerces implementation directly >>> (which NekoHTML is basically an extension of, or plugin for). >> Thank you. I am a bit confused about this. Do I understand correctly from >> http://xml.apache.org/xalan-j/xpath_apis.html#xpathexpression >> that if I use a SAX parser like NekoHTML with JAXP, then I must _reparse_ the document >> every time I try to execute an XPath? >> >> Ideally, I'd like something simple like >> HTML SAX parser >> + >> XPath library >> > > You provide the DOM and you can use any XPath library you want. In fact, why not > use the standard APIs? > > http://xml.apache.org/xalan-j/xpath_apis.html > http://www.ibm.com/developerworks/library/x-javaxpathapi.html > > There's also Jaxen (which XOM uses internally)... > http://jaxen.codehaus.org/ > > There's commons-jxpath... > http://commons.apache.org/jxpath/ > > There's no lack of XPath APIs. > >> without anything else. So far NekoHTML/dom4j or Tagsoup/XOM both seem to work well. >> >> I am slightly favoring Tagsoup/XOM as it does not require the Xerces implementation, but perhaps >> if I use Xerces I don't need dom4j? I am still a little confused. >> > > I'm fairly certain XOM requires Xerces. It ships with it, after all (a minimal > version). I don't see why you'd ever need DOM4j > >> What I'd like ideally: >> 1) Process HTML document quickly so I can access it in some form >> 2) My access will involve _repeated_ XPath queries on same document >> >> Is NekoHTML/dom4j or Tagsoup/XOM the correct (simplest) solution for this? >> > > All you need is a document. You can create that without the help of dom4j, xom, > jdom, or any other specialized DOM API. Again, I point you to... > > http://nekohtml.sourceforge.net/usage.html > >> Thank you >> Misha >> >> p.s. Please see comment below about setProperty. >> >>> BTW, it occurred to me shortly after I sent my previous response that when you >>> use the SAX parser, then you probably don't need to set.... >>> >>> parser.setProperty("http://cyberneko.org/html/properties/names/elems", >>> "lower"); >> Thanks. Actually I did have to do this. Otherwise we get upper-case element names. >> > > Good to know. Though, I think what you are doing is just creating a DOM. Your > prior comments led me to believe that you were somehow applying XPath expressions > to a SAX stream, which would give you the elements in the case supplied by the > document. It is only when you create an HTML DOM that the case changes, as the > HTML DOM stores elemens in UPPER-case. > > If you are ending up with a Document in the end, there's no reason I can think of > to prefer a SAXParser over a DOMParser. > > > Jake > >>> That is likely only relevant when using a DOM parser. That's only because the >>> HTML DOM stores element names in UPPER-case regardless of the case of the tags >>> in the input document. I encourage you to test this and report back your >>> results. >>> >>> This discussion triggered one other memory. See here for reference... >>> http://archive.jaxen.codehaus.org/lists/org.codehaus.jaxen.dev/msg/111...@my... >>> >>> I asked that question back in 2005, but I don't recall whether I ever tested >>> it out? Basically, if one uses a standard validating XML parser (rather than >>> a non-validating HTML parser like NekoHTML), can one achieve non-namespaced >>> XPath queries by doing (below is an example using Jaxen)?... >>> >>> >>> >>> XPath xpath = new DOMXPath( "//p" ); >>> xpath.addNamespace( "", "http://www.w3.org/1999/xhtml" ); >>> Then again, maybe the XPath would need to be?.... >>> >>> >>> XPath xpath = new DOMXPath( "//:p" ); >>> ...or maybe that wouldn't work either and it would need to be the usual (which >>> I know works)... >>> >>> >>> XPath xpath = new DOMXPath( "//foo:p" ); >>> xpath.addNamespace( "foo", "http://www.w3.org/1999/xhtml" ); >>> >>> I will try this myself when I get time. But if you want to try it an report >>> back your findings, that would be great too. >>> >>> Jake >>> >>> >>> On Thu, 13 May 2010 07:16:01 -0500 >>> Misha Koshelev <mis...@gm...> wrote: >>>> Thank you. I have gotten NekoHTML to work well with dom4j except that >>>> xercesImpl must be included instead of xercesMinimal (believe this is due to >>>> SAX parser having to be included) >>>> >>>> 1. Just to double check - there is _no_ Xpath support in NekoHTML by itself, >>>> correct? >>>> 2. I have found the following (quite dated) bookmark: >>>> http://www.portletbridge.org/saxbenchmark/results.html >>>> Any ideas on performance of NekoHTML+dom4j vs Tagsoup+XOM? >>>> 3. I noticed the Website was "last updated" in 2009. Is this still an >>>> actively maintained project? >>>> >>>> Thank you >>>> Misha >>>> >>>> >>>> Andy Clark wrote: >>>>> There's no good reason for NekoHTML to *not* support this SAX feature >>>>> because there are no external general entities with HTML documents. The >>>>> parser would behave the same regardless of the value of the feature. The >>>>> parser configuration just needs to be changed to not throw an unsupported >>>>> feature exception. >>>>> >>>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> nekohtml-user mailing list >>>> nek...@li... >>>> https://lists.sourceforge.net/lists/listinfo/nekohtml-user >>>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> nekohtml-user mailing list >>> nek...@li... >>> https://lists.sourceforge.net/lists/listinfo/nekohtml-user >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> nekohtml-user mailing list >> nek...@li... >> https://lists.sourceforge.net/lists/listinfo/nekohtml-user >> >> >> > > ------------------------------------------------------------------------------ > > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user |
From: Jacob K. <ho...@vi...> - 2010-05-14 03:35:38
|
On 5/13/2010 10:09 AM, Misha Koshelev wrote: > Jacob Kjome wrote: >> No NekoHML doesn't have any specific XPath integration (beyond what Xerces >> itself provides). But you can use XOM's support (via Jaxen), Jaxen directly >> (or any other custom XPath implementation), or just use the built in JAXP >> support provided by the standard JAXP APIs, which Xerces developers would >> encourage you to use anyway rather than the Xerces implementation directly >> (which NekoHTML is basically an extension of, or plugin for). > Thank you. I am a bit confused about this. Do I understand correctly from > http://xml.apache.org/xalan-j/xpath_apis.html#xpathexpression > that if I use a SAX parser like NekoHTML with JAXP, then I must _reparse_ the document > every time I try to execute an XPath? > > Ideally, I'd like something simple like > HTML SAX parser > + > XPath library > You provide the DOM and you can use any XPath library you want. In fact, why not use the standard APIs? http://xml.apache.org/xalan-j/xpath_apis.html http://www.ibm.com/developerworks/library/x-javaxpathapi.html There's also Jaxen (which XOM uses internally)... http://jaxen.codehaus.org/ There's commons-jxpath... http://commons.apache.org/jxpath/ There's no lack of XPath APIs. > without anything else. So far NekoHTML/dom4j or Tagsoup/XOM both seem to work well. > > I am slightly favoring Tagsoup/XOM as it does not require the Xerces implementation, but perhaps > if I use Xerces I don't need dom4j? I am still a little confused. > I'm fairly certain XOM requires Xerces. It ships with it, after all (a minimal version). I don't see why you'd ever need DOM4j > What I'd like ideally: > 1) Process HTML document quickly so I can access it in some form > 2) My access will involve _repeated_ XPath queries on same document > > Is NekoHTML/dom4j or Tagsoup/XOM the correct (simplest) solution for this? > All you need is a document. You can create that without the help of dom4j, xom, jdom, or any other specialized DOM API. Again, I point you to... http://nekohtml.sourceforge.net/usage.html > Thank you > Misha > > p.s. Please see comment below about setProperty. > >> >> BTW, it occurred to me shortly after I sent my previous response that when you >> use the SAX parser, then you probably don't need to set.... >> >> parser.setProperty("http://cyberneko.org/html/properties/names/elems", >> "lower"); > Thanks. Actually I did have to do this. Otherwise we get upper-case element names. > Good to know. Though, I think what you are doing is just creating a DOM. Your prior comments led me to believe that you were somehow applying XPath expressions to a SAX stream, which would give you the elements in the case supplied by the document. It is only when you create an HTML DOM that the case changes, as the HTML DOM stores elemens in UPPER-case. If you are ending up with a Document in the end, there's no reason I can think of to prefer a SAXParser over a DOMParser. Jake >> >> That is likely only relevant when using a DOM parser. That's only because the >> HTML DOM stores element names in UPPER-case regardless of the case of the tags >> in the input document. I encourage you to test this and report back your >> results. >> >> This discussion triggered one other memory. See here for reference... >> http://archive.jaxen.codehaus.org/lists/org.codehaus.jaxen.dev/msg/111...@my... >> >> I asked that question back in 2005, but I don't recall whether I ever tested >> it out? Basically, if one uses a standard validating XML parser (rather than >> a non-validating HTML parser like NekoHTML), can one achieve non-namespaced >> XPath queries by doing (below is an example using Jaxen)?... >> >> >> >> XPath xpath = new DOMXPath( "//p" ); >> xpath.addNamespace( "", "http://www.w3.org/1999/xhtml" ); >> Then again, maybe the XPath would need to be?.... >> >> >> XPath xpath = new DOMXPath( "//:p" ); >> ...or maybe that wouldn't work either and it would need to be the usual (which >> I know works)... >> >> >> XPath xpath = new DOMXPath( "//foo:p" ); >> xpath.addNamespace( "foo", "http://www.w3.org/1999/xhtml" ); >> >> I will try this myself when I get time. But if you want to try it an report >> back your findings, that would be great too. >> >> Jake >> >> >> On Thu, 13 May 2010 07:16:01 -0500 >> Misha Koshelev <mis...@gm...> wrote: >>> Thank you. I have gotten NekoHTML to work well with dom4j except that >>> xercesImpl must be included instead of xercesMinimal (believe this is due to >>> SAX parser having to be included) >>> >>> 1. Just to double check - there is _no_ Xpath support in NekoHTML by itself, >>> correct? >>> 2. I have found the following (quite dated) bookmark: >>> http://www.portletbridge.org/saxbenchmark/results.html >>> Any ideas on performance of NekoHTML+dom4j vs Tagsoup+XOM? >>> 3. I noticed the Website was "last updated" in 2009. Is this still an >>> actively maintained project? >>> >>> Thank you >>> Misha >>> >>> >>> Andy Clark wrote: >>>> There's no good reason for NekoHTML to *not* support this SAX feature >>>> because there are no external general entities with HTML documents. The >>>> parser would behave the same regardless of the value of the feature. The >>>> parser configuration just needs to be changed to not throw an unsupported >>>> feature exception. >>>> >>>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> nekohtml-user mailing list >>> nek...@li... >>> https://lists.sourceforge.net/lists/listinfo/nekohtml-user >>> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> nekohtml-user mailing list >> nek...@li... >> https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > ------------------------------------------------------------------------------ > > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > |
From: Misha K. <mis...@gm...> - 2010-05-13 16:10:08
|
Jacob Kjome wrote: > No NekoHML doesn't have any specific XPath integration (beyond what Xerces > itself provides). But you can use XOM's support (via Jaxen), Jaxen directly > (or any other custom XPath implementation), or just use the built in JAXP > support provided by the standard JAXP APIs, which Xerces developers would > encourage you to use anyway rather than the Xerces implementation directly > (which NekoHTML is basically an extension of, or plugin for). Thank you. I am a bit confused about this. Do I understand correctly from http://xml.apache.org/xalan-j/xpath_apis.html#xpathexpression that if I use a SAX parser like NekoHTML with JAXP, then I must _reparse_ the document every time I try to execute an XPath? Ideally, I'd like something simple like HTML SAX parser + XPath library without anything else. So far NekoHTML/dom4j or Tagsoup/XOM both seem to work well. I am slightly favoring Tagsoup/XOM as it does not require the Xerces implementation, but perhaps if I use Xerces I don't need dom4j? I am still a little confused. What I'd like ideally: 1) Process HTML document quickly so I can access it in some form 2) My access will involve _repeated_ XPath queries on same document Is NekoHTML/dom4j or Tagsoup/XOM the correct (simplest) solution for this? Thank you Misha p.s. Please see comment below about setProperty. > > BTW, it occurred to me shortly after I sent my previous response that when you > use the SAX parser, then you probably don't need to set.... > > parser.setProperty("http://cyberneko.org/html/properties/names/elems", > "lower"); Thanks. Actually I did have to do this. Otherwise we get upper-case element names. > > That is likely only relevant when using a DOM parser. That's only because the > HTML DOM stores element names in UPPER-case regardless of the case of the tags > in the input document. I encourage you to test this and report back your > results. > > This discussion triggered one other memory. See here for reference... > http://archive.jaxen.codehaus.org/lists/org.codehaus.jaxen.dev/msg/111...@my... > > I asked that question back in 2005, but I don't recall whether I ever tested > it out? Basically, if one uses a standard validating XML parser (rather than > a non-validating HTML parser like NekoHTML), can one achieve non-namespaced > XPath queries by doing (below is an example using Jaxen)?... > > > > XPath xpath = new DOMXPath( "//p" ); > xpath.addNamespace( "", "http://www.w3.org/1999/xhtml" ); > Then again, maybe the XPath would need to be?.... > > > XPath xpath = new DOMXPath( "//:p" ); > ...or maybe that wouldn't work either and it would need to be the usual (which > I know works)... > > > XPath xpath = new DOMXPath( "//foo:p" ); > xpath.addNamespace( "foo", "http://www.w3.org/1999/xhtml" ); > > I will try this myself when I get time. But if you want to try it an report > back your findings, that would be great too. > > Jake > > > On Thu, 13 May 2010 07:16:01 -0500 > Misha Koshelev <mis...@gm...> wrote: >> Thank you. I have gotten NekoHTML to work well with dom4j except that >> xercesImpl must be included instead of xercesMinimal (believe this is due to >> SAX parser having to be included) >> >> 1. Just to double check - there is _no_ Xpath support in NekoHTML by itself, >> correct? >> 2. I have found the following (quite dated) bookmark: >> http://www.portletbridge.org/saxbenchmark/results.html >> Any ideas on performance of NekoHTML+dom4j vs Tagsoup+XOM? >> 3. I noticed the Website was "last updated" in 2009. Is this still an >> actively maintained project? >> >> Thank you >> Misha >> >> >> Andy Clark wrote: >>> There's no good reason for NekoHTML to *not* support this SAX feature >>> because there are no external general entities with HTML documents. The >>> parser would behave the same regardless of the value of the feature. The >>> parser configuration just needs to be changed to not throw an unsupported >>> feature exception. >>> >>> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> nekohtml-user mailing list >> nek...@li... >> https://lists.sourceforge.net/lists/listinfo/nekohtml-user >> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user |
From: Jacob K. <ho...@vi...> - 2010-05-13 15:52:59
|
No NekoHML doesn't have any specific XPath integration (beyond what Xerces itself provides). But you can use XOM's support (via Jaxen), Jaxen directly (or any other custom XPath implementation), or just use the built in JAXP support provided by the standard JAXP APIs, which Xerces developers would encourage you to use anyway rather than the Xerces implementation directly (which NekoHTML is basically an extension of, or plugin for). BTW, it occurred to me shortly after I sent my previous response that when you use the SAX parser, then you probably don't need to set.... parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower"); That is likely only relevant when using a DOM parser. That's only because the HTML DOM stores element names in UPPER-case regardless of the case of the tags in the input document. I encourage you to test this and report back your results. This discussion triggered one other memory. See here for reference... http://archive.jaxen.codehaus.org/lists/org.codehaus.jaxen.dev/msg/111...@my... I asked that question back in 2005, but I don't recall whether I ever tested it out? Basically, if one uses a standard validating XML parser (rather than a non-validating HTML parser like NekoHTML), can one achieve non-namespaced XPath queries by doing (below is an example using Jaxen)?... XPath xpath = new DOMXPath( "//p" ); xpath.addNamespace( "", "http://www.w3.org/1999/xhtml" ); Then again, maybe the XPath would need to be?.... XPath xpath = new DOMXPath( "//:p" ); ...or maybe that wouldn't work either and it would need to be the usual (which I know works)... XPath xpath = new DOMXPath( "//foo:p" ); xpath.addNamespace( "foo", "http://www.w3.org/1999/xhtml" ); I will try this myself when I get time. But if you want to try it an report back your findings, that would be great too. Jake On Thu, 13 May 2010 07:16:01 -0500 Misha Koshelev <mis...@gm...> wrote: > Thank you. I have gotten NekoHTML to work well with dom4j except that >xercesImpl must be included instead of xercesMinimal (believe this is due to >SAX parser having to be included) > > 1. Just to double check - there is _no_ Xpath support in NekoHTML by itself, >correct? > 2. I have found the following (quite dated) bookmark: > http://www.portletbridge.org/saxbenchmark/results.html > Any ideas on performance of NekoHTML+dom4j vs Tagsoup+XOM? > 3. I noticed the Website was "last updated" in 2009. Is this still an >actively maintained project? > > Thank you > Misha > > > Andy Clark wrote: >> There's no good reason for NekoHTML to *not* support this SAX feature >>because there are no external general entities with HTML documents. The >>parser would behave the same regardless of the value of the feature. The >>parser configuration just needs to be changed to not throw an unsupported >>feature exception. >> >> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > |
From: Misha K. <mis...@gm...> - 2010-05-13 12:26:37
|
Thank you. I have gotten NekoHTML to work well with dom4j except that xercesImpl must be included instead of xercesMinimal (believe this is due to SAX parser having to be included) 1. Just to double check - there is _no_ Xpath support in NekoHTML by itself, correct? 2. I have found the following (quite dated) bookmark: http://www.portletbridge.org/saxbenchmark/results.html Any ideas on performance of NekoHTML+dom4j vs Tagsoup+XOM? 3. I noticed the Website was "last updated" in 2009. Is this still an actively maintained project? Thank you Misha Andy Clark wrote: > There's no good reason for NekoHTML to *not* support this SAX feature because there are no external general entities with HTML documents. The parser would behave the same regardless of the value of the feature. The parser configuration just needs to be changed to not throw an unsupported feature exception. > > |
From: Andy C. <an...@cy...> - 2010-05-13 08:45:52
|
There's no good reason for NekoHTML to *not* support this SAX feature because there are no external general entities with HTML documents. The parser would behave the same regardless of the value of the feature. The parser configuration just needs to be changed to not throw an unsupported feature exception. On May 12, 2010, at 11:40 PM, Jacob Kjome <ho...@vi...> wrote: > > Interesting. Never tried using XOM and NekoHTML together. The problem below is > that org.cyberneko.html.HTMLConfiguration doesn't support the feature: > "http://xml.org/sax/features/external-general-entities". > > I'm not sure what it takes to support it, but I imagine it could be added to > satisfy XOM, though I wonder if you'd run into another error after getting by this > one? > > In any case, that's not the end of it. In order to use an XPath like... > > //body/a/@href > > ...you'd need to set the following property... > > parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower"); > > ...otherwise you'd have to do... > > //BODY/A/@href > > ...because the HTML DOM is UPPER-case by default. > > > BTW, you can always use NekoHTML directly rather than via XOM or Dom4j... > http://nekohtml.sourceforge.net/usage.html > > > Jake > > On 5/12/2010 9:58 AM, Misha Koshelev wrote: >> Dear All: >> >> Thank you for your great product! I am trying to use an HTML parser with XPath support for my needs. >> >> I am trying to use NekoHTML with XOM and am running into two issues. If I use the full Xerces implementation, I have the following issue: >> [java] nu.xom.XMLException: org.cyberneko.html.parsers.SAXParser does not support the entity resolution features XOM requires. >> [java] at nu.xom.Builder.<init>(Unknown Source) >> [java] at nu.xom.Builder.<init>(Unknown Source) >> [java] at nu.xom.Builder.<init>(Unknown Source) >> [java] at org.eclipse.swt.browser.webdriver.Driver$1.run(Driver.java:116) >> [java] at org.eclipse.swt.widgets.RunnableLock.run(Unknown Source) >> [java] at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Unknown Source) >> [java] at org.eclipse.swt.widgets.Display.runAsyncMessages(Unknown Source) >> [java] at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source) >> [java] at org.eclipse.swt.browser.webdriver.Test.main(Test.java:76) >> [java] Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://xml.org/sax/features/external-general-entities' is not recognized. >> [java] at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source) >> [java] at nu.xom.Builder.setupParser(Unknown Source) >> [java] ... 9 more >> [java] nu.xom.XMLException: org.cyberneko.html.parsers.SAXParser does not support the entity resolution features XOM requires. >> [java] Java Result: 1 >> >> If I use xercesMinimal.jar, I have the following problem: >> [java] java.util.MissingResourceException: Can't find bundle for base name org.apache.xerces.impl.msg.SAXMessages, locale en_US >> [java] at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1427) >> [java] at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1250) >> [java] at java.util.ResourceBundle.getBundle(ResourceBundle.java:777) >> [java] at org.apache.xerces.util.SAXMessageFormatter.formatMessage(Unknown Source) >> [java] at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source) >> [java] at nu.xom.Builder.setupParser(Unknown Source) >> [java] at nu.xom.Builder.<init>(Unknown Source) >> [java] at nu.xom.Builder.<init>(Unknown Source) >> [java] at nu.xom.Builder.<init>(Unknown Source) >> [java] at org.eclipse.swt.browser.webdriver.Driver$1.run(Driver.java:116) >> [java] at org.eclipse.swt.widgets.RunnableLock.run(Unknown Source) >> [java] at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Unknown Source) >> [java] at org.eclipse.swt.widgets.Display.runAsyncMessages(Unknown Source) >> [java] at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source) >> [java] at org.eclipse.swt.browser.webdriver.Test.main(Test.java:76) >> [java] java.util.MissingResourceException: Can't find bundle for base name org.apache.xerces.impl.msg.SAXMessages, locale en_US >> [java] Java Result: 1 >> >> I appreciate any help. Is there better dom4j compatibility? >> >> Thank you >> Misha >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> nekohtml-user mailing list >> nek...@li... >> https://lists.sourceforge.net/lists/listinfo/nekohtml-user >> >> >> > > ------------------------------------------------------------------------------ > > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user |
From: Jacob K. <ho...@vi...> - 2010-05-13 05:54:58
|
Interesting. Never tried using XOM and NekoHTML together. The problem below is that org.cyberneko.html.HTMLConfiguration doesn't support the feature: "http://xml.org/sax/features/external-general-entities". I'm not sure what it takes to support it, but I imagine it could be added to satisfy XOM, though I wonder if you'd run into another error after getting by this one? In any case, that's not the end of it. In order to use an XPath like... //body/a/@href ...you'd need to set the following property... parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower"); ...otherwise you'd have to do... //BODY/A/@href ...because the HTML DOM is UPPER-case by default. BTW, you can always use NekoHTML directly rather than via XOM or Dom4j... http://nekohtml.sourceforge.net/usage.html Jake On 5/12/2010 9:58 AM, Misha Koshelev wrote: > Dear All: > > Thank you for your great product! I am trying to use an HTML parser with XPath support for my needs. > > I am trying to use NekoHTML with XOM and am running into two issues. If I use the full Xerces implementation, I have the following issue: > [java] nu.xom.XMLException: org.cyberneko.html.parsers.SAXParser does not support the entity resolution features XOM requires. > [java] at nu.xom.Builder.<init>(Unknown Source) > [java] at nu.xom.Builder.<init>(Unknown Source) > [java] at nu.xom.Builder.<init>(Unknown Source) > [java] at org.eclipse.swt.browser.webdriver.Driver$1.run(Driver.java:116) > [java] at org.eclipse.swt.widgets.RunnableLock.run(Unknown Source) > [java] at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Unknown Source) > [java] at org.eclipse.swt.widgets.Display.runAsyncMessages(Unknown Source) > [java] at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source) > [java] at org.eclipse.swt.browser.webdriver.Test.main(Test.java:76) > [java] Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://xml.org/sax/features/external-general-entities' is not recognized. > [java] at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source) > [java] at nu.xom.Builder.setupParser(Unknown Source) > [java] ... 9 more > [java] nu.xom.XMLException: org.cyberneko.html.parsers.SAXParser does not support the entity resolution features XOM requires. > [java] Java Result: 1 > > If I use xercesMinimal.jar, I have the following problem: > [java] java.util.MissingResourceException: Can't find bundle for base name org.apache.xerces.impl.msg.SAXMessages, locale en_US > [java] at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1427) > [java] at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1250) > [java] at java.util.ResourceBundle.getBundle(ResourceBundle.java:777) > [java] at org.apache.xerces.util.SAXMessageFormatter.formatMessage(Unknown Source) > [java] at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source) > [java] at nu.xom.Builder.setupParser(Unknown Source) > [java] at nu.xom.Builder.<init>(Unknown Source) > [java] at nu.xom.Builder.<init>(Unknown Source) > [java] at nu.xom.Builder.<init>(Unknown Source) > [java] at org.eclipse.swt.browser.webdriver.Driver$1.run(Driver.java:116) > [java] at org.eclipse.swt.widgets.RunnableLock.run(Unknown Source) > [java] at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Unknown Source) > [java] at org.eclipse.swt.widgets.Display.runAsyncMessages(Unknown Source) > [java] at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source) > [java] at org.eclipse.swt.browser.webdriver.Test.main(Test.java:76) > [java] java.util.MissingResourceException: Can't find bundle for base name org.apache.xerces.impl.msg.SAXMessages, locale en_US > [java] Java Result: 1 > > I appreciate any help. Is there better dom4j compatibility? > > Thank you > Misha > > ------------------------------------------------------------------------------ > > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > |
From: Misha K. <mis...@gm...> - 2010-05-12 16:20:58
|
Dear All: Thank you for your great product! I am trying to use an HTML parser with XPath support for my needs. I am trying to use NekoHTML with XOM and am running into two issues. If I use the full Xerces implementation, I have the following issue: [java] nu.xom.XMLException: org.cyberneko.html.parsers.SAXParser does not support the entity resolution features XOM requires. [java] at nu.xom.Builder.<init>(Unknown Source) [java] at nu.xom.Builder.<init>(Unknown Source) [java] at nu.xom.Builder.<init>(Unknown Source) [java] at org.eclipse.swt.browser.webdriver.Driver$1.run(Driver.java:116) [java] at org.eclipse.swt.widgets.RunnableLock.run(Unknown Source) [java] at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Unknown Source) [java] at org.eclipse.swt.widgets.Display.runAsyncMessages(Unknown Source) [java] at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source) [java] at org.eclipse.swt.browser.webdriver.Test.main(Test.java:76) [java] Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://xml.org/sax/features/external-general-entities' is not recognized. [java] at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source) [java] at nu.xom.Builder.setupParser(Unknown Source) [java] ... 9 more [java] nu.xom.XMLException: org.cyberneko.html.parsers.SAXParser does not support the entity resolution features XOM requires. [java] Java Result: 1 If I use xercesMinimal.jar, I have the following problem: [java] java.util.MissingResourceException: Can't find bundle for base name org.apache.xerces.impl.msg.SAXMessages, locale en_US [java] at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1427) [java] at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1250) [java] at java.util.ResourceBundle.getBundle(ResourceBundle.java:777) [java] at org.apache.xerces.util.SAXMessageFormatter.formatMessage(Unknown Source) [java] at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source) [java] at nu.xom.Builder.setupParser(Unknown Source) [java] at nu.xom.Builder.<init>(Unknown Source) [java] at nu.xom.Builder.<init>(Unknown Source) [java] at nu.xom.Builder.<init>(Unknown Source) [java] at org.eclipse.swt.browser.webdriver.Driver$1.run(Driver.java:116) [java] at org.eclipse.swt.widgets.RunnableLock.run(Unknown Source) [java] at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Unknown Source) [java] at org.eclipse.swt.widgets.Display.runAsyncMessages(Unknown Source) [java] at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source) [java] at org.eclipse.swt.browser.webdriver.Test.main(Test.java:76) [java] java.util.MissingResourceException: Can't find bundle for base name org.apache.xerces.impl.msg.SAXMessages, locale en_US [java] Java Result: 1 I appreciate any help. Is there better dom4j compatibility? Thank you Misha |
From: Misha K. <mis...@gm...> - 2010-05-12 10:05:45
|
Dear All: Thank you for great product! I am using TagSoup+XOM per: http://nicklothian.com/blog/2006/09/11/using-xpath-on-real-world-html-documents/ seems to work well except the following namespace problem: http://www.supermind.org/blog/613/dom4j-xpath-tagsoup-namespaces-sweet Can I use NekoHTML for XPath? Any code samples? How does it compare to tagsoup/HTMLParser/jtidy etc? Thank you Misha |
From: Tomas M. <tom...@2e...> - 2010-04-22 06:55:30
|
You're right, I figured it out. Was a hard day at work yesterday :-) Thanks. On 22 April 2010 06:34, Andy Clark <an...@cy...> wrote: > This has nothing to do with tag balancing; it's caused by whatever is > writing the output. > > -- > Sent from my iPhone > > On Apr 21, 2010, at 10:17 AM, Tomas Muldoon <tom...@2e...> > wrote: > > > Hi, > > > > With tag balancing is switched on, I notice that <script > src="xxxx"></script> elements are converted to <script src="xxx"/>. This > causes problems down the line when attemting to render the parsed content in > firefox and IE which expects script element with a src attribute to be > empty. > > > > Is there any way to correct this behavior? > > > > Thanks a lot guys, > > Tom > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > nekohtml-user mailing list > > nek...@li... > > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > ------------------------------------------------------------------------------ > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > -- Tom Muldoon | Developer |
From: Andy C. <an...@cy...> - 2010-04-22 05:47:41
|
This has nothing to do with tag balancing; it's caused by whatever is writing the output. -- Sent from my iPhone On Apr 21, 2010, at 10:17 AM, Tomas Muldoon <tom...@2e...> wrote: > Hi, > > With tag balancing is switched on, I notice that <script src="xxxx"></script> elements are converted to <script src="xxx"/>. This causes problems down the line when attemting to render the parsed content in firefox and IE which expects script element with a src attribute to be empty. > > Is there any way to correct this behavior? > > Thanks a lot guys, > Tom > ------------------------------------------------------------------------------ > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user |
From: Tomas M. <tom...@2e...> - 2010-04-21 17:39:41
|
Hi, With tag balancing is switched on, I notice that <script src="xxxx"></script> elements are converted to <script src="xxx"/>. This causes problems down the line when attemting to render the parsed content in firefox and IE which expects script element with a src attribute to be empty. Is there any way to correct this behavior? Thanks a lot guys, Tom |
From: Todd W. <to...@sc...> - 2010-04-07 18:45:59
|
Hi, In using NekoHTML to tidy up HTML I've noticed that most entities ( ) seem to be getting garbled. This is probably due to my misuse of the library, so I'm hoping someone can clarify. Here's the relevant portion of my code: public void test() { try { DOMParser parser = new DOMParser(); //parser.setFeature( "http://apache.org/xml/features/scanner/notify-char-refs", true ); //parser.setFeature( "http://apache.org/xml/features/scanner/notify-builtin-refs", true ); //parser.setFeature( "http://cyberneko.org/html/features/scanner/notify-builtin-refs", true ); //parser.setFeature( "http://cyberneko.org/html/features/scanner/normalize-attrs", true ); String html = "<html><body><p>Entity 1: </p><p>Entity 2: ®</p><p>Entity 3: &</p></body></html>"; ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream( html.getBytes() ); InputSource inputSource = new InputSource( byteArrayInputStream ); parser.parse( inputSource ); this.print( parser.getDocument(), "" ); } catch( Exception e ) { log.error( "Exception: " + e.getMessage(), e ); } } public void print( Node node, String indent ) { System.out.println(indent+node.getNodeValue()); Node child = node.getFirstChild(); while (child != null) { print(child, indent+" "); child = child.getNextSibling(); } } In the output the & is coming through fine, but the other two entities show up as # or ##. I've also tried turning the tidied HTML back into a string using XMLSerializer, but with that the entities come through as garbled characters. You can also see that I've attempted to use various "setFeature" method calls. Any suggestions on what I might be doing wrong? Thanks much, Todd |
From: Thierry L. <tl...@gm...> - 2010-02-14 17:10:16
|
Hi, I am a bit stuck with encoding: i am parsing an ISO-8859-1 page. This page contains this meta: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> I am using a DefaultFilter subclass like this: XMLParserConfiguration parser = new HTMLConfiguration(); parser.setDocumentHandler(filter); XMLInputSource source = new XMLInputSource(null, null, null,myHttpResponse.getEntity().getContent(), "iso-8859-1"); parser.parse(source); Then in the filter i tried something like: public void characters(XMLString text, Augmentations augs) { String myString = new String(text.toString().getBytes("iso-8859-1")); } The result: On some device (Android devices indeed), the result is fine, the é, è, etc. are correctly displayed. But on some other (i suspect it is on US imported device), these characters are replaced by random symbols. >From what i understood, String should always containts UTF-16, so my line "String myString = new String(text.toString().getBytes("iso-8859-1"));" seems incorrect. So where is the problem?? is it te way i set XMLInputSource? or where i convert XMLString to java string in characters()? do i have to set a feature? Any clue? -- Thierry. |
From: Ronald J. <rj...@go...> - 2010-02-14 13:11:05
|
Hi, i am not sure if this is in the scope of Neko, but if not hopefully one of you guys can point me in the right direction; i am extracting links from html using Neko, and I would like to convert the relative links to absolute. at the moment I am doing it "manually" (going through the links and checking if they start with http://, ftp:// etc) and if not i add the base url. This does not feel very error prune at all, and I am sure some function in some package can handle this for me. Any ideas how i can convert relative to absolute urls? cheers, RJ |