I read the new wiki page of Dspace and learned that Jim Downing recently
suggest to use Collator to sort in other languages. But I do not think
this will work with Vietnamese. While digging into the source code of
Dspace and trying to localize Dspace in Vietnamese, we found a new
approach for I18N in this sorting issue particularly. I would like to
share it with you and hope to receive feedback from you before we go
First, I realize that the sort_title column in itembytitle table was
created to solve the sort issue. Before the data inserted into this
table, it must be processed by calling normalization function which will
cut off some "a", "the" and then convert all letters to lower case. As I
know, this column does nothing more than for sorting. If this is
correct, we can store the data in this field in any other formats even
it is not human-readable. From that, we decide to store the data in
encrypted format, which can help the sort algorithm easier (specially in
Vietnamese language) and help us do the least to the code.
- Create another properties file, called alphabet_xx.properties. In this
file, we use 2 Latin letters to present a single 'prime' letter in our
language. (We use 2 because in Vietnamese there are 93 single "prime"
letters in the alphabet). The rule to create this file is to present the
first letter in our alphabet by "aa" (the least value), then next letter
by "ab" and so on.
- At the time we load messages file into hash table, we also load this
alphabet hash table.
- In the normalization function, after cutting off the article (a, the,
...) and convert all words to lower case, we use this alphabet hash
table to encrypt the input string, then insert into database as usual.
- The cost for this approach is the data amount in itembytitle will be
double in size for sort_title column. I feel the performance is Ok when
I am testing with the small amount of data (I am using Postgresql with
10,000 items). Moreover, I am not an expert in database, so I am not
sure if this will be slow down the system when our data growing. Indeed,
we need your comments and suggestions on this issue. I think with this
approach, we can apply for all languages. With 2-latin-letter hash
value, we can present 27^2 = 729 single "target" letters.
- To create the browsing control bar (the A-Z bar in English interface),
we just add a little code, new class which return an array of all single
letters we can pull out from the database. First, select the first 2
letter from this column - sort_title.
SELECT DISTINCT LEFT(sort_title, 2) FROM itembytitle
then fetch the array of data, decrypt it using the alphabet hash table
above. After all we will have the array of letters exist in database.
Pass this array to jsp, then using an iteration statement to echo them
- The advantage of this approach is the solution for tone marks - the
very headache problem in kinds of languages similar to Vietnamese. As I
told you before, in Vietnamese there are 93 single 'prime' letters, but
it;s not true, I did simplize it on purpose. There are only 34 letters,
compare to latin alphabet, we have 6 more vowels (ă, â, ê, ô, ơ, ư) and
5 tone marks. The tone marks can only go with vowels and then it;s still
considered the same group of the vowel in the browsing control bar. To
be clear, please look at this example.
With letter "a" in our language, we have 6 different appearances
combined with the tone marks: a, à ả, ã, á, ạ
They all must be in the same group A in browsing control bar. To do
this, I make no change to the code, and only re-define the alphabet
properties file as follow using combination of major value and minor
value for the encrypt values.
- Ex: letter "a" in our language will have the major hash value is "aa"
- An we define the à, ả, ã, á and ạ with the combination of major and
à aazv (minor value is zv)
ả aazw (minor value is zw)
ã aazx (minor value is zx)
á aazy (minor value is zy)
ạ aazz (minor value is zz)
next letter ab
next letter ac
...... .... ....
and so on...
Keeping the same SQL statement above (getting only first 2 letters), we
can put a, à, ả, ã, á, ạ in only 1 group "A".
We used zz, zy, zx, zw, zv in reversed order to make sure 3 things: (1)
keep the right order of tone-marks in sorting rule of our language and
(2) assure that no other letter has used this value already and (3) keep
the consistency of the length (= original string length * 2).
- The most advantage of this approach is to keep the code as simple as
we can while shifting all the flexibility to the properties file. By
that, we can satisfy all languages around the world.
I have tried my best with my limited vocabulary, and i hope you all can
get my message. I apology for any misunderstanding in my explanation
above. I will try to send you the code on Monday next week.
I am looking forward to your response. Have a nice weekend.
Duong Quang Minh
An Giang University, Library
AGU's E-mail System
Contact us: mailer_admin@...
Other AGU's helpful links:
For more information about AGU, please visit below URL:
For access member's mailbox by web, please visit below URL:
For the latest news and annoucement, please visit below URL:
For online teaching and learning, please visit below URL:
For discussion and getting experiences, please visit below URL:
For troubleshootings, please visit below URL: