Menu

#1 Fix accents

open
nobody
None
5
2007-03-13
2007-03-13
No

JDictionary has troubles with properly displaying some of the accented characters and some other special symbols. Here I'll show you how to fix it. Don't get afraid by the length of this article; it's not hard at all!

The phenomenon: jdictionary only shows accents correctly if it is run with a locale whose charset matches the legacy 8-bit charset of the dictionaries involved. For example, if you use the English-Spanish module, you are required to use an iso-8859-1 locale, otherwise Spanish characters will be wrong. If you use e.g. iso-8859-2 then you'll see ń instead of ñ. The situation is the worst with the French-Hungarian dictionary, here it's impossible to use the right charset since the two languages would require different settings. The dictionary is stored in a mixed-encoding text file: the same byte has different semantics depending on which column it occurs in, it's impossible to get French û and Hungarian ű both show up correctly as they share the same byte value in these two encodings. And if you happen to use an UTF-8 locale (nowadays you probably don't have a reason not to do so) then no accented letters will appear at all.

I'll show you how to fix jdictionary so that all the accents and special characters show up correctly, no matter what your locale is (it can use latin1, latin2, utf8 or any other charset).

The main jdictionary.jar doesn't need modification at all. What you have to fix are the plugins. You have to perform basically the same steps for each and every plugin, except for the online lexicon. (I have no information on how this plugin handles accents.) The English-Serbian plugin will be slightly different than the eight other plugins, I'll tell about the details.

So for each plugin perform the following four steps:

1. * Unpack the plugin's archive *

Create a new empty directory, copy the plugin's .jar file here, change to this directory and extract the archive:
jar -xf *.jar
(If you have no "jar", download and install the java developer kit (JDK).)

2. * Fix the java code to read UTF-8 dictionary *

Choose exactly one of the "quick" and the "clever" way.

The quick way:

Simply overwrite Buffer.class with the file you can download from here. (A slightly different file is to be used for the Serbian plugin. Most likely they are perfectly interchangeable but I haven't tested it and it's simpler to stay safe.)

The clever way:

i) Download and install Pavel Kouznetsov's "Jad" (Java Decompiler). Use your favorite web search engine to find it. I used the version "Jad 1.5.8e for Linux (statically linked)".

ii) Decompile the classes involved in fixing the bug:
jad -sjava Buffer MySelf Resources

iii) Patch Buffer.java using the patch I attach:
patch -p0 <Buffer.patch
Remember to use the right patch for the Serbian plugin. Should you use a different java decompiler (or a different version of Jad) you might need to manually adjust the patch.

Note: The patch fixes two issues. The first is that jdictionary assumed the dictionary was encoded according to the current locale's charset, now we explicitely tell that it's encoded in UTF-8. The second is that jdictionary incorrectly assumed that the number of characters in the dictionary is the same as its length measured in bytes. This is true for legacy 8-bit encodings, but not for UTF-8.

iv) Make sure the CLASSPATH environment variable points to the main jdictionary.jar file:
export CLASSPATH=/your/path/to/jdictionary.jar

v) Compile the fixed source:
javac -g Buffer.java MySelf.java Resources.java

Note: I use "-g" because luckily the official jdictionary release was created with debugging enabled too. Without this, it would have been harder for me to understand the code and create the patch.

3. * Convert the dictionary to UTF-8 *

The dictionary file is under resources/dictionary. You need to convert it from the current 8-bit (probably mixed) encoding to UTF-8.

Assume ISO-8859-1 charset for English, Spanish, French, German. Note that even English uses accents or other special symbols sometimes so it would be false to assume 7-bit ASCII. ISO-8859-1 is the right choice for English. Hungarian uses ISO-8859-2. Serbian is perhaps ISO-8859-5, but it's not important. The Serbian dictionary has only one line (line no. 28660) that is not 7-bit ASCII, and that line is trivially faulty anyway. Just remove this line and you're ready. For the other dictionaries:

If the two languages use the same character set, you can simply use iconv or recode, e.g.:
iconv -f ISO-8859-1 -t UTF-8 <oldfile >newfile
You have to use this method for the interlingua dictionaries where the file uses a different format than for the other plugins.

If the file is in mixed encoding, you can use the attached script to convert it. Take a look at the dictionary file to see what the delimiter character is between the columns (either a pipe or a colon) and to see in which order the two languages are (this is not necessarily the same as what the plugin name suggests!) The 1st parameter of the convert-dictionary script is the delimiter, the next two parameters are the current charsets of the two columns (in the same order of course). For example, this converts the French-Hungarian dictionary to UTF-8:
perl ./convert-dictionary '|' ISO-8859-1 ISO-8859-2 <oldfile >newfile

Open the new file in an utf-8 capable viewer and make sure that it's really encoded in utf-8 and accents appear correctly for both languages. If everything's file, replace the old dictionary file with the new one.

We're basically ready. You might want to fix several entries in the dictionary that suffer from accent problems. Go to
https://svn.uhulinux.hu/packages/dev/
, enter the directory of the corresponding plugin and then the "patches" directory. If you see a file called "dictionary.patch0" (or something similar) here, download it. Due to technical reasons it's most likely a "double patch" (patch within patch). First type:
patch -p0 <dictionary.patch0
This just creates the dictionary.patch file which will fix the dictionary by issuing this command:
patch -p0 <dictionary.patch

4. * Update the jar file *

Type
jar -uf *.jar Buffer.class resources/dictionary/TheNameOfTheDictionary
This replaces the two fixed files within the .jar archive. Now install this fixed .jar plugin to its proper place. Restart jdictionary and enjoy it! :)

Discussion

  • Egmont Koblinger

    Patch to fix Buffer.java

     
  • Egmont Koblinger

    Logged In: YES
    user_id=79382
    Originator: YES

    File Added: Buffer.patch

     
  • Egmont Koblinger

    Patch to fix Buffer.java of the Serbian plugin

     
  • Egmont Koblinger

    Logged In: YES
    user_id=79382
    Originator: YES

    File Added: Buffer-Serbian.patch

     
  • Egmont Koblinger

    Fixed version of Buffer.class

     
  • Egmont Koblinger

    Logged In: YES
    user_id=79382
    Originator: YES

    File Added: Buffer.class

     
  • Egmont Koblinger

    Fixed version of Buffer.class for the Serbian plugin

     
  • Egmont Koblinger

    Logged In: YES
    user_id=79382
    Originator: YES

    File Added: Buffer-Serbian.class

     
  • Egmont Koblinger

    Script to convert mixed-encoding dictionaries

     
  • Egmont Koblinger

    Logged In: YES
    user_id=79382
    Originator: YES

    File Added: convert-dictionary

     

Log in to post a comment.