From: SourceForge.net <no...@so...> - 2012-10-05 06:56:00
|
Bugs item #3544334, was opened at 2012-07-15 03:51 Message generated for change (Settings changed) made by mguillem You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=952178&aid=3544334&group_id=195122 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: scanner Group: 1.9.15 >Status: Closed Resolution: Fixed Priority: 5 Private: No Submitted By: Uwe Schindler (thetaphi) Assigned to: Marc Guillemot (mguillem) Summary: HTMLScanner class fails in Turkish locale Initial Comment: HTMLScanner uses String.toLowerCase() and String.toUpperCase() without specifying a Locale (Locale.ENGLISH or better Locale.ROOT), so in the Turkish default locale, the uppercasing and lowercasing of element names breaks. See also http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html about the problem. This can be tested by setting the default using Locale.setDefault(new Locale("tr", "TR") and then trying to parse a HTML document. <title> gets reported to SAX as <TİTLE> when uppercasing element names (the default). The correct way to fix this is described in my blog, with Java 1.6 pass Locale.ROOT to alle (throughout the codebase of NEKOHTML) to String toUpperCase or String.toLowerCase. In previous Java versions a workaround is unsing Locale.ENGLISH. ---------------------------------------------------------------------- Comment By: Marc Guillemot (mguillem) Date: 2012-10-01 02:27 Message: Hi Uwe, the commit activity is not indicating a release now ;-) Next release will probably come shortly before next release of HtmlUnit, perhaps at the end of the month. Interesting to know that NekoHTML is used in Lucene's benkchmark. ---------------------------------------------------------------------- Comment By: Uwe Schindler (thetaphi) Date: 2012-09-28 09:40 Message: Hi Marc, thanks for fixing this. As soon as the new version is available via IVY, we can add this to the Lucene benchmark module and remove our workaround in the NekoHTML parser used to scan the test documents for Lucene indexing: https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/DemoHTMLParser.java Is a release coming soon because of lots of commit activity? Uwe ---------------------------------------------------------------------- Comment By: Marc Guillemot (mguillem) Date: 2012-09-28 03:15 Message: Now fixed. Thanks for reporting (and thanks for the patch Ronald). ---------------------------------------------------------------------- Comment By: RBRi (rbri) Date: 2012-08-29 12:09 Message: Index: src/org/cyberneko/html/filters/NamespaceBinder.java =================================================================== --- src/org/cyberneko/html/filters/NamespaceBinder.java (revision 306) +++ src/org/cyberneko/html/filters/NamespaceBinder.java (working copy) @@ -17,6 +17,7 @@ package org.cyberneko.html.filters; import java.util.Enumeration; +import java.util.Locale; import java.util.Vector; import org.apache.xerces.xni.Augmentations; @@ -372,8 +373,8 @@ /** Modifies the given name based on the specified mode. */ protected static final String modifyName(String name, short mode) { switch (mode) { - case NAMES_UPPERCASE: return name.toUpperCase(); - case NAMES_LOWERCASE: return name.toLowerCase(); + case NAMES_UPPERCASE: return name.toUpperCase(Locale.ENGLISH); + case NAMES_LOWERCASE: return name.toLowerCase(Locale.ENGLISH); } return name; } // modifyName(String,short):String @@ -393,7 +394,7 @@ for (int i = attrCount - 1; i >= 0; i--) { attrs.getName(i, fQName); String aname = fQName.rawname; - String ANAME = aname.toUpperCase(); + String ANAME = aname.toUpperCase(Locale.ENGLISH); if (ANAME.startsWith("XMLNS:") || ANAME.equals("XMLNS")) { int anamelen = aname.length(); ---------------------------------------------------------------------- Comment By: RBRi (rbri) Date: 2012-08-29 12:07 Message: Index: src/org/cyberneko/html/HTMLScanner.java =================================================================== --- src/org/cyberneko/html/HTMLScanner.java (revision 306) +++ src/org/cyberneko/html/HTMLScanner.java (working copy) @@ -25,6 +25,7 @@ import java.io.UnsupportedEncodingException; import java.net.URL; import java.util.BitSet; +import java.util.Locale; import java.util.Stack; import org.apache.xerces.util.EncodingMap; @@ -884,7 +885,7 @@ } } if (encodings[1] == null) { - encodings[1] = EncodingMap.getIANA2JavaMapping(encodings[0].toUpperCase()); + encodings[1] = EncodingMap.getIANA2JavaMapping(encodings[0].toUpperCase(Locale.ENGLISH)); if (encodings[1] == null) { encodings[1] = encodings[0]; if (fReportErrors) { @@ -896,7 +897,7 @@ fJavaEncoding = encodings[1]; /* PATCH: Asgeir Asgeirsson */ fIso8859Encoding = fIANAEncoding == null - || fIANAEncoding.toUpperCase().startsWith("ISO-8859") + || fIANAEncoding.toUpperCase(Locale.ENGLISH).startsWith("ISO-8859") || fIANAEncoding.equalsIgnoreCase(fDefaultIANAEncoding); encoding = fIANAEncoding; reader = new InputStreamReader(fByteStream, fJavaEncoding); @@ -1074,8 +1075,8 @@ /** Modifies the given name based on the specified mode. */ protected static final String modifyName(String name, short mode) { switch (mode) { - case NAMES_UPPERCASE: return name.toUpperCase(); - case NAMES_LOWERCASE: return name.toLowerCase(); + case NAMES_UPPERCASE: return name.toUpperCase(Locale.ENGLISH); + case NAMES_LOWERCASE: return name.toLowerCase(Locale.ENGLISH); } return name; } // modifyName(String,short):String @@ -2055,7 +2056,7 @@ fElementCount++; fSingleBoolean[0] = false; final String ename = scanStartElement(fSingleBoolean); - final String enameLC = ename == null ? null : ename.toLowerCase(); + final String enameLC = ename == null ? null : ename.toLowerCase(Locale.ENGLISH); fBeginLineNumber = fCurrentEntity.getLineNumber(); fBeginColumnNumber = fCurrentEntity.getColumnNumber(); fBeginCharacterOffset = fCurrentEntity.getCharacterOffset(); @@ -2584,7 +2585,7 @@ } else { fAttributes.getName(aindex,fQName); - fQName.rawname = fQName.rawname.toLowerCase(); + fQName.rawname = fQName.rawname.toLowerCase(Locale.ENGLISH); fAttributes.setName(aindex,fQName); aindex++; } @@ -2663,7 +2664,7 @@ String content = getValue(fAttributes, "content"); if (content != null) { content = removeSpaces(content); - int index1 = content.toLowerCase().indexOf("charset="); + int index1 = content.toLowerCase(Locale.ENGLISH).indexOf("charset="); if (index1 != -1 && !fIgnoreSpecifiedCharset) { final int index2 = content.indexOf(';', index1); final String charset = index2 != -1 ? content.substring(index1+8, index2) : content.substring(index1+8); @@ -2733,7 +2734,7 @@ boolean encodingChanged = false; try { String ianaEncoding = charset; - String javaEncoding = EncodingMap.getIANA2JavaMapping(ianaEncoding.toUpperCase()); + String javaEncoding = EncodingMap.getIANA2JavaMapping(ianaEncoding.toUpperCase(Locale.ENGLISH)); if (DEBUG_CHARSET) { System.out.println("+++ ianaEncoding: "+ianaEncoding); System.out.println("+++ javaEncoding: "+javaEncoding); @@ -2754,7 +2755,7 @@ // change the charset else { fIso8859Encoding = ianaEncoding == null - || ianaEncoding.toUpperCase().startsWith("ISO-8859") + || ianaEncoding.toUpperCase(Locale.ENGLISH).startsWith("ISO-8859") || ianaEncoding.equalsIgnoreCase(fDefaultIANAEncoding); fJavaEncoding = javaEncoding; fCurrentEntity.setStream(new InputStreamReader(fByteStream, javaEncoding)); @@ -3079,7 +3080,7 @@ private boolean isEnded(String ename) { String content = new String(fCurrentEntity.buffer, fCurrentEntity.offset, fCurrentEntity.length - fCurrentEntity.offset); - return content.toLowerCase().indexOf("</" + ename.toLowerCase() + ">") != -1; + return content.toLowerCase(Locale.ENGLISH).indexOf("</" + ename.toLowerCase(Locale.ENGLISH) + ">") != -1; } } // class ContentScanner Index: src/org/cyberneko/html/HTMLTagBalancer.java =================================================================== --- src/org/cyberneko/html/HTMLTagBalancer.java (revision 306) +++ src/org/cyberneko/html/HTMLTagBalancer.java (working copy) @@ -18,6 +18,7 @@ import java.util.ArrayList; import java.util.List; +import java.util.Locale; import org.apache.xerces.util.XMLAttributesImpl; import org.apache.xerces.xni.Augmentations; @@ -1045,7 +1046,7 @@ fErrorReporter.reportWarning("HTML2007", new Object[]{ename,iname}); } if (fDocumentHandler != null) { - // PATCH: Marc-Andr� Morissette + // PATCH: Marc-Andr� Morissette callEndElement(info.qname, i < depth - 1 ? synthesizedAugs() : augs); } } @@ -1222,8 +1223,8 @@ /** Modifies the given name based on the specified mode. */ protected static final String modifyName(String name, short mode) { switch (mode) { - case NAMES_UPPERCASE: return name.toUpperCase(); - case NAMES_LOWERCASE: return name.toLowerCase(); + case NAMES_UPPERCASE: return name.toUpperCase(Locale.ENGLISH); + case NAMES_LOWERCASE: return name.toLowerCase(Locale.ENGLISH); } return name; } // modifyName(String,short):String Index: src/org/cyberneko/html/filters/ElementRemover.java =================================================================== --- src/org/cyberneko/html/filters/ElementRemover.java (revision 306) +++ src/org/cyberneko/html/filters/ElementRemover.java (working copy) @@ -17,6 +17,7 @@ package org.cyberneko.html.filters; import java.util.Hashtable; +import java.util.Locale; import org.apache.xerces.xni.Augmentations; import org.apache.xerces.xni.NamespaceContext; @@ -136,12 +137,12 @@ * see #removeElement */ public void acceptElement(String element, String[] attributes) { - Object key = element.toLowerCase(); + Object key = element.toLowerCase(Locale.ENGLISH); Object value = NULL; if (attributes != null) { String[] newarray = new String[attributes.length]; for (int i = 0; i < attributes.length; i++) { - newarray[i] = attributes[i].toLowerCase(); + newarray[i] = attributes[i].toLowerCase(Locale.ENGLISH); } value = attributes; } @@ -157,7 +158,7 @@ * @param element The element to completely remove. */ public void removeElement(String element) { - Object key = element.toLowerCase(); + Object key = element.toLowerCase(Locale.ENGLISH); Object value = NULL; fRemovedElements.put(key, value); } // removeElement(String) @@ -306,26 +307,26 @@ /** Returns true if the specified element is accepted. */ protected boolean elementAccepted(String element) { - Object key = element.toLowerCase(); + Object key = element.toLowerCase(Locale.ENGLISH); return fAcceptedElements.containsKey(key); } // elementAccepted(String):boolean /** Returns true if the specified element should be removed. */ protected boolean elementRemoved(String element) { - Object key = element.toLowerCase(); + Object key = element.toLowerCase(Locale.ENGLISH); return fRemovedElements.containsKey(key); } // elementRemoved(String):boolean /** Handles an open tag. */ protected boolean handleOpenTag(QName element, XMLAttributes attributes) { if (elementAccepted(element.rawname)) { - Object key = element.rawname.toLowerCase(); + Object key = element.rawname.toLowerCase(Locale.ENGLISH); Object value = fAcceptedElements.get(key); if (value != NULL) { String[] anames = (String[])value; int attributeCount = attributes.getLength(); LOOP: for (int i = 0; i < attributeCount; i++) { - String aname = attributes.getQName(i).toLowerCase(); + String aname = attributes.getQName(i).toLowerCase(Locale.ENGLISH); for (int j = 0; j < anames.length; j++) { if (anames[j].equals(aname)) { continue LOOP; Index: src/org/cyberneko/html/filters/Purifier.java =================================================================== --- src/org/cyberneko/html/filters/Purifier.java (revision 306) +++ src/org/cyberneko/html/filters/Purifier.java (working copy) @@ -16,6 +16,8 @@ package org.cyberneko.html.filters; +import java.util.Locale; + import org.apache.xerces.util.XMLChar; import org.apache.xerces.util.XMLStringBuffer; import org.apache.xerces.xni.Augmentations; @@ -204,7 +206,7 @@ standalone = null; } else { - standalone = standalone.toLowerCase(); + standalone = standalone.toLowerCase(Locale.ENGLISH); } } super.xmlDecl(version,encoding,standalone,augs); @@ -464,7 +466,7 @@ for (int i = 0; i < len; i++) { str.insert(0, '0'); } - return str.toString().toUpperCase(); + return str.toString().toUpperCase(Locale.ENGLISH); } // toHexString(int,int):String } // class Purifier Index: src/org/cyberneko/html/filters/Writer.java =================================================================== --- src/org/cyberneko/html/filters/Writer.java (revision 306) +++ src/org/cyberneko/html/filters/Writer.java (working copy) @@ -20,6 +20,7 @@ import java.io.OutputStreamWriter; import java.io.PrintWriter; import java.io.UnsupportedEncodingException; +import java.util.Locale; import org.apache.xerces.xni.Augmentations; import org.apache.xerces.xni.NamespaceContext; @@ -333,11 +334,11 @@ // modify META[@http-equiv='content-type']/@content value int contentIndex = -1; String originalContent = null; - if (element.rawname.toLowerCase().equals("meta")) { + if (element.rawname.toLowerCase(Locale.ENGLISH).equals("meta")) { String httpEquiv = null; int length = attributes.getLength(); for (int i = 0; i < length; i++) { - String aname = attributes.getQName(i).toLowerCase(); + String aname = attributes.getQName(i).toLowerCase(Locale.ENGLISH); if (aname.equals("http-equiv")) { httpEquiv = attributes.getValue(i); } @@ -345,12 +346,12 @@ contentIndex = i; } } - if (httpEquiv != null && httpEquiv.toLowerCase().equals("content-type")) { + if (httpEquiv != null && httpEquiv.toLowerCase(Locale.ENGLISH).equals("content-type")) { fSeenHttpEquiv = true; String content = null; if (contentIndex != -1) { originalContent = attributes.getValue(contentIndex); - content = originalContent.toLowerCase(); + content = originalContent.toLowerCase(Locale.ENGLISH); } if (content != null) { int charsetIndex = content.indexOf("charset="); ---------------------------------------------------------------------- Comment By: Uwe Schindler (thetaphi) Date: 2012-07-15 03:55 Message: The same happens with charset names. If the charset is reported as "iso-8859-1" (in lowercase), this part of code will never get true: ianaEncoding.toUpperCase().startsWith("ISO-8859") ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=952178&aid=3544334&group_id=195122 |