Hi.
I needed to support windows-1251 encoding in saxon so I've patched
something.
Michael, could you add it to the next release? Saxon has good reputation in
Russia but it lacks full windows-1251 support.
What you need to do is to add this function to the aelfred/XmlParser.java:
/**
* Convert a buffer of windows-1251-encoded bytes into UTF-16
characters.
*
* <p>When readDataChunk () calls this method, the raw bytes are in
* rawReadBuffer, and the final characters will appear in
* readBuffer.
*
* @param count The number of bytes to convert.
* @see #readDataChunk
* @see #rawReadBuffer
* @see #readBuffer
*/
private void copyWin1251ReadBuffer (int count)
throws IOException
{
int i, j;
final int[] codes =
{
0x0402, 0x0403, 0x201A, 0x0453,
0x201E, 0x2026, 0x2020, 0x2021,
0x20AC, 0x2030, 0x0409, 0x2039,
0x040A, 0x040C, 0x040B, 0x040F,
0x0452, 0x2018, 0x2019, 0x201C,
0x201D, 0x2022, 0x2013, 0x2014,
0x0000, 0x2122, 0x0459, 0x203A,
0x045A, 0x045C, 0x045B, 0x045F,
0x00A0, 0x040E, 0x045E, 0x0408,
0x00A4, 0x0490, 0x00A6, 0x00A7,
0x0401, 0x00A9, 0x0404, 0x00AB,
0x00AC, 0x00AD, 0x00AE, 0x0407,
0x00B0, 0x00B1, 0x0406, 0x0456,
0x0491, 0x00B5, 0x00B6, 0x00B7,
0x0451, 0x2116, 0x0454, 0x00BB,
0x0458, 0x0405, 0x0455, 0x0457
};
for (i = 0, j = readBufferPos; i < count; i++, j++)
{
int c = rawReadBuffer [i] & 0xff;
if (c > 0xBF) c += 0x350;
else
if (c > 0x7F) c = codes[c - 0x80];
readBuffer [j] = (char) c;
if (c == '\r')
{
sawCR = true;
}
}
readBufferLength = j;
}
Then add
private final static int ENCODING_WINDOWS_1251 = 10;
To the supported encoding constant list:
//
// Constants for supported encodings. "external" is just a flag.
//
private final static int ENCODING_EXTERNAL = 0;
private final static int ENCODING_UTF_8 = 1;
private final static int ENCODING_ISO_8859_1 = 2;
private final static int ENCODING_UCS_2_12 = 3;
private final static int ENCODING_UCS_2_21 = 4;
private final static int ENCODING_UCS_4_1234 = 5;
private final static int ENCODING_UCS_4_4321 = 6;
private final static int ENCODING_UCS_4_2143 = 7;
private final static int ENCODING_UCS_4_3412 = 8;
private final static int ENCODING_ASCII = 9;
// Here!!!
private final static int ENCODING_WINDOWS_1251 = 10;
And finally in setupEncoding:
if (encoding == ENCODING_UTF_8 || encoding == ENCODING_EXTERNAL) {
if (encodingName.equals ("ISO-8859-1")
|| encodingName.equals ("8859_1")
|| encodingName.equals ("ISO8859_1")
) {
encoding = ENCODING_ISO_8859_1;
return;
}
// Associate ENCODING_WINDOWS_1251 with apropriate encoding designators
else if (encodingName.equals ("WINDOWS-1251")
|| encodingName.equals ("CP1251")
) {
encoding = ENCODING_WINDOWS_1251;
return;
}
// And so on
else if (encodingName.equals ("US-ASCII")
........
As soon as I have time I'll write koi8-r support - that's another popular
Russian encoding.
Bye.
/lexi
|