[Htmlparser-cvs] htmlparser/src/org/htmlparser/util CharacterReference.java,NONE,1.1 Translate.java,
Brought to you by:
derrickoswald
From: <der...@us...> - 2004-02-09 02:12:55
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9169/src/org/htmlparser/util Modified Files: Translate.java package.html Added Files: CharacterReference.java Removed Files: Generate.java Log Message: Rework character entity translation. See task 58599 enhance character reference translation. Decode now handles missing semi colons, encoding is more efficient, hexadecimal numeric character entity references are handled and both encoding and decoding make minimal use of substring(). Augmented the tests in CharacterTranslationTest significantly, and merged the Generate class into the tests. Added translate command scripts in bin, which read from stdin and write to stdout. --- NEW FILE: CharacterReference.java --- /* * CharacterReference.java * * Created on February 5, 2004, 9:40 PM */ package org.htmlparser.util; import java.io.Serializable; import org.htmlparser.util.sort.Ordered; /** * Structure to hold a character and it's equivalent entity reference kernel. * For the character reference &copy; the character would be '©' and * the kernel would be "copy", for example.<p> * Character references are described at <a href="Character references">http://www.w3.org/TR/REC-html40/charset.html#entities</a> * Supports the Ordered interface so it's easy to create a list sorted by * kernel, to perform binary searches on.<p> */ public class CharacterReference implements Serializable, Cloneable, Ordered { /** * The character value as an integer. */ protected int mCharacter; /** * This entity reference kernel. * The text between the ampersand and the semicolon. */ protected String mKernel; /** * Construct a <code>CharacterReference</code> with the character and kernel given. * @param kernel The kernel in the equivalent character entity reference. * @param character The character needing encoding. */ public CharacterReference (String kernel, int character) { mKernel = kernel; mCharacter = character; if (null == mKernel) mKernel = ""; } /** * Get this CharacterReference's kernel. * @return The kernel in the equivalent character entity reference. */ public String getKernel () { return (mKernel); } /** * Set this CharacterReference's kernel. * This is used to avoid creating a new object to perform a binary search. * @param kernel The kernel in the equivalent character entity reference. */ void setKernel (String kernel) { mKernel = kernel; } /** * Get the character needing translation. * @return The character. */ public int getCharacter () { return (mCharacter); } /** * Set the character. * This is used to avoid creating a new object to perform a binary search. * @param character The character needing translation. */ void setCharacter (int character) { mCharacter = character; } /** * Visualize this character reference as a string. * @return A string with the character and kernel. */ public String toString () { String hex; StringBuffer ret; ret = new StringBuffer (6 + 8 + 2); // max 8 in string hex = Integer.toHexString ((int)getCharacter ()); ret.append ("\\u"); for (int i = hex.length (); i < 4; i++) ret.append ("0"); ret.append (hex); ret.append ("["); ret.append (getKernel ()); ret.append ("]"); return (ret.toString ()); } // // Ordered interface // /** * Compare one reference to another. * @see org.htmlparser.util.sort.Ordered */ public int compare (Object that) { CharacterReference r; r = (CharacterReference)that; return (getKernel ().compareTo (r.getKernel ())); } } Index: Translate.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/Translate.java,v retrieving revision 1.42 retrieving revision 1.43 diff -C2 -d -r1.42 -r1.43 *** Translate.java 2 Jan 2004 16:24:58 -0000 1.42 --- Translate.java 9 Feb 2004 02:09:45 -0000 1.43 *************** *** 27,57 **** package org.htmlparser.util; import java.util.HashMap; import java.util.Iterator; import java.util.Map; /** * Translate numeric character references and character entity references to unicode characters. * Based on tables found at <a href="http://www.w3.org/TR/REC-html40/sgml/entities.html"> * http://www.w3.org/TR/REC-html40/sgml/entities.html</a> [...1684 lines suppressed...] + * Numeric character reference and character entity reference to unicode codec. + * Translate the <code>System.in</code> input into an encoded or decoded + * stream and send the results to <code>System.out</code>. + * @param args If arg[0] is <code>-encode</code> perform an encoding on + * <code>System.in</code>, otherwise perform a decoding. + */ + public static void main (String[] args) + { + boolean encode; + + if (0 < args.length && args[0].equalsIgnoreCase ("-encode")) + encode = true; + else + encode = false; + if (encode) + encode (System.in, System.out); + else + decode (System.in, System.out); + } } Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/package.html,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** package.html 2 Jan 2004 16:24:58 -0000 1.19 --- package.html 9 Feb 2004 02:09:45 -0000 1.20 *************** *** 29,46 **** --> </head> ! <body bgcolor="white"> ! The util package is intended for holding utility classes that dont directly help with the parsing, ! but can take responsibilities out from some classes. Resuable code which can be reused by many classes, should be located ! in this package. ! ! <h2>Related Documentation</h2> ! ! For overviews, tutorials, examples, guides, and tool documentation, please see: ! <ul> ! <li><a href="http://htmlparser.sourceforge.net">HTML Parser Home Page</a> ! </ul> ! ! <!-- Put @see and @since tags down here. --> ! </body> </html> --- 29,36 ---- --> </head> ! <body> ! Code which can be reused by many classes, is located in this package. ! The util package is intended for holding utility classes that don't directly ! help with parsing, but can take responsibilities out of some classes. </body> </html> --- Generate.java DELETED --- |