Character Encoding Issues

Help
timt
2010-01-28
2013-05-28
  • timt

    timt - 2010-01-28

    Although I have Bibutils installed and working now, I am having some issues with character encoding. I installed my database with UTF-8 encoding (this is confirmed when I issue the MySQL command

    SHOW CREATE DATABASE DATABASE_NAME;
    

    ). However, when I change the value for '$contentTypeCharset' to UTF-8, I am not able to import XML files from EndNote (these files have been saved with UTF-8 encoding). When I try to import an XML file, I receive the following error message:

    There were validation errors regarding the data you entered:
    Record 1: Unrecognized data format! Required field missing: TY
    Skip records with unrecognized data format

    However, if I change the value of '$contentTypeCharset' to ISO-8859-1, I am then able to import the XML files, except that accented characters do not display correctly (all of my records are in Portuguese, so accents are important). Once I have imported the records, if I then change the value of '$contentTypeCharset' back to UTF-8, most of the characters display correctly, but many do not. I could correct the remaining errors manually, but it would be great if there were a global solution to the problem. . . .

    Thanks again for all your help!

    Tim

     
  • timt

    timt - 2010-01-29

    When my sysadmin compiled Bibutils on our server, he used version 4.7. I went back and installed version 3.4, and now all the character encoding issues have been solved.

    Thanks again!

     
  • timt

    timt - 2010-01-29

    Ah, but one more question. Is there a way to configure refbase so that users can search without using accent marks and still retrieve strings that contain them? For example, an author search for "Abrahao" retrieves "Abrahão". This would make a huge difference.

     
  • Matthias Steffens

    Hi Tim,

    glad that Bibutils is working for you know.

    Is there a way to configure refbase so that users can search without using accent marks and still retrieve strings that contain them?

    This has been proposed previously, see e.g. here:

    http://www.refbase.net/index.php/Planned_feature_additions#Simple_handling_of_accented_characters_or_umlauts

    but, unfortunately, nothing has been done about it yet.

    Power users could use a search pattern like this:

    Abraho

    to find both authors, "Abrahao" and "Abrahão", but that would require some knowledge about the refbase search syntax:

    http://www.refbase.net/index.php/Searching#Using_metacharacters_to_form_complex_queries

    Obviously, this wouldn't help with novice users. :-/

    That said, it might be already helpful if it would be possible to teach users just two special character sequences:

    - the dot character (".") matches any single character, i.e. a search for "Abrah.o" would also find both, "Abrahao" and "Abrahão"

    - the dot-plus character (".+") sequence matches a string of one or more characters

    Matthias

     
  • timt

    timt - 2010-01-31

    Being able to use regex syntax for search queries is a nice feature. However, I just tried a search for "Abrah.o" and was told that my query didn't produce any results. Yet if I search for "Abrah.+" or "Abrah.+o", I retrieve 4 records. Any idea why the first query doesn't work?

    Cheers,
    Tim

     
  • Matthias Steffens

    Hi Tim,

    I just tried a search for "Abrah.o" and was told that my query didn't produce any results.

    I just tried it in our refbase beta database (which is UTF8-based) at

    http://refbase.textdriven.com/beta/

    and I see the same behaviour as you describe. Searches that include literal non-ASCII characters (e.g. '… WHERE title RLIKE "Lógicas" …') work fine, but if non-ASCII characters are included in a regular expression pattern, then the search does not produce correct results. E.g. these two MySQL WHERE clauses fail:

    … WHERE title RLIKE "Lgicas" …
    … WHERE title RLIKE "L.gicas" …

    whereas this works:

    … WHERE title RLIKE "L.+gicas" …

    The fact that '.' doesn't work but that '.+' does hints at a multi-byte problem. Googling for this issue gives e.g. these results:

    http://bugs.mysql.com/bug.php?id=34473
    http://bugs.mysql.com/bug.php?id=30241
    http://dev.mysql.com/doc/refman/5.1/en/regexp.html

    The latter link (i.e. the "Regular Expressions" page in the MySQL documentation) gives this warning:

    The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.

    So, unfortunately, the MySQL Regex library is still not multi-byte aware. And this problem also seems to persist in upcoming MySQL versions. :-(

    So I fear that there's not much we can do about it. As a workaround, you could use something like:

    … WHERE title RLIKE "L[^]+gicas" …

    which should result in more accurate results than ".+". This is since the character strings matched by [^]+ (i.e. any character that is not a space/tab/newline/return or a punctuation character) won't cross word boundaries.

    Obviously, these rather complex patterns are more than ideal. Sorry for the trouble.

    Matthias

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks