UTF-8 problems

Help
2013-03-12
2015-01-22
  • Laura Williams

    Laura Williams - 2013-03-12

    Hi,

    Following advice before I think I have finally managed to set up my database so that it can accept and interpret UTF-8 characters.  Thank you for your previous help, especially Karnesky.

    I am now at the point where I can import references with accents in and search for the accented character and the result will be returned.  All accented characters are also displayed correctly.

    The next problem I have is that I have author names that are sometimes accented and sometimes not.  For example, I have one name Megarbane which is sometimes listed as Mégarbane.  When a search is conducted I only find the accents if I put the accent into the search query.  So if I search for Megarbane I get 74 results and if I search for Mégarbane I get a different 33 results.  This means that users are missing references depending on whether they use the accent or not.  I know that you can search for both using M(e|é)garbane which also works very well on the database, however, there are some users who will never be able to remember/adapt this approach.

    In trying to simplify my interface I had a similar problem with synonyms which I resolved by using preg_replace to replace their query with an amended version.  For example, if they type paracetamol into the search engine I have code which replaces this with (paracetamol|acetaminophen).  This is also working really well, so I was thinking that I could do the same with accented characters.  I currently have the following code below in search.php.  However, this returns a php error so it obviously does not work in the same way as the example above.

    What I need to know is whether my database is in fact working correctly (because some websites seem to suggest that accented characters will automatically be picked up if you have character sets set up correctly).  Whether there is a simpler way of searching for both or whether you know why my current code is returning a php error!!

    I am sorry to ask so many questions but I have been searching online for months and I cannot seem to work out a solution.

    Any help would be greatly appreciated.

    Thanks,

    Laura

    $inputpatternaccent = array();
    $inputpatternaccent = '/a/';
    $inputpatternaccent = '/e/';
    $inputpatternaccent = '/i/';
    $inputpatternaccent = '/o/';
    $inputpatternaccent = '/u/';
    $inputpatternaccent = '/y/';
    $inputpatternaccent = '/s/';
    $inputpatternaccent = '/z/';
    $inputpatternaccent = '/ae/';
    $inputpatternaccent = '/n/';
    $inputpatternaccent = '/c/';
    $inputpatternaccent = '/ss/';

    $replacementpatternaccent = array();
    $replacementpatternaccent = '(a|à|á|â|ã|ä|å|À|Á|Â|Ã|Ä|Å)';
    $replacementpatternaccent = '(e|è|é|ê|ë|È|É|Ê|Ë)';
    $replacementpatternaccent = '(i|ì|í|î|ï|Ì|Í|Î|Ï)';
    $replacementpatternaccent = '(o|ð|ò|ó|ô|õ|ö|ø|Ò|Ó|Ô|Õ|Ö|Ø)';
    $replacementpatternaccent = '(u|ù|ú|û|ü|Ù|Ú|Û|Ü)';
    $replacementpatternaccent = '(y|ý|ÿ|Ý|Ÿ)';
    $replacementpatternaccent = '(s|š|Š)';
    $replacementpatternaccent = '(z|ž|Ž)';
    $replacementpatternaccent = '(ae|æ|Æ)';
    $replacementpatternaccent = '(n|ñ|Ñ)';
    $replacementpatternaccent = '(c|ç|Ç)';
    $replacementpatternaccent = '(ss|ß)';

    $query = preg_replace($inputpatternaccent, $replacementpatternaccent, $query);

     
  • Richard Karnesky

    We'd be interested in patches that reflected your changes that would either be includeded in 'contrib' to show other users how to accomplish this or, depending on implementation, to become part of the core product.

    Character set issues are hard.  Your PHP file should use the same character encoding at the database/tables/columns you're searching.  You could display the $query string to help troubleshoot.  Also note that there are multiple ways for UTF-8 to store accented characters.   And 'RLIKE' works byte-per-byte.  If some of your accented characters are stored as multibyte UTF-8, strange things might occur.

    We've talked about implementing accent-insensitve search in the past.  One thing to note is that it magically works in modern UTF-8 MySQL if you use 'LIKE' instead of 'RLIKE'.  The query

    SELECT * FROM `refs` WHERE author LIKE '%muller%';
    

    will find 'Müller', but RLIKE won't.

    And note that this trick can also be forced to work in latin-1 tables with conversion, a'la:

    select author from refs where CONVERT(author using utf8) LIKE _utf8'%muller%';
    

    If we just used 'LIKE', it would remove flexibility for power users.  This might not be an issue for you.  But, one compromise might be to do something along the lines of:

    SELECT * FROM `refs` WHERE author LIKE '%muller%' OR author RLIKE 'muller';
    
     
  • Laura Williams

    Laura Williams - 2013-03-13

    Thanks for your quick reply.  I have been looking at this all day but not really got any further.  Firstly I tried using the SQL search on Refbase to perform a LIKE search rather than an RLIKE.  This still only returns the non-accented characters.  I really am not sure why!

    Secondly I note that the php files have to be UTF-8 coded as well.  This was not the case before so I have changed the search.php file.  Would I have to change all my php files or just the search one?  At the moment all the others are not UTF-8.

    An interesting quirk I have just noticed.  When I search for Mégarbane I get the results for records with the accent.  When I search for M(e|é)garbane I get all results back as expected.  However, when I search for Mgarbane following your guidance pages on refbase.net I am only getting the non accented references returned.  Does this mean that my REGEX is not working in a way that is normally expected with Refbase?

    With regard to patches and contributing, I have no problem sharing if it would help but I am not a developer and a lot of my code is messy and potentially destroying other systems you have in place.  I wouldn't want to provide you with something when I have no way of checking the code thoroughly.

    Many thanks,

    Laura

     
  • Richard Karnesky

    I have no idea why 'LIKE' in sql_search isn't working for you.  Does it not work when you force encoding, as in my second example?

    The coding of any file that has those replacementpatterns needs to be correct.

     
  • Laura Williams

    Laura Williams - 2015-01-22

    Hi Richard,

    I am sorry to bring this back up again but I am hopefully a bit more wise this time! I have managed to start again with a new database and I have got 'LIKE' working in an sql search. You mentioned before that the website uses RLIKE. Does this mean that to get it working for the website as well I just need to change all instances of RLIKE to LIKE? Whereabouts would I find this?

    Thanks,

    Laura

     
  • Richard Karnesky

    Yes: The default SQL query uses RLIKE to enable regular expression searches to be available to power users. So, you have two options:

    • Replace 'RLIKE' with 'LIKE', making accent-agnostic searches work, but not depending on your custom code and breaking regular expression searches.
    • Edit the SQL query the way you were trying to in your first post. This is likely to be more challenging, but will lead to a more useful patch.

    In either case, there are several places where SQL queries are assembled. Use 'grep' or other file search to find the many places we use 'RLIKE'.

     
  • Laura Williams

    Laura Williams - 2015-01-22

    Thanks for the quick reply. I tried searching for all instances of RLIKE and replacing with LIKE. This did not achieve the desired outcome. I think it is also because I need to edit the search too add the '%' either side of the word. I am now thinking of trying the second option and editing the query as it is assembled in the SQL. It is a work in progress... are there any particular pages you can suggest I start in? I keep seeing a link to quote_smart but cannot find where this is stored.

    Thanks,

    Laura

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks