Menu

RIS import error

Help
lamjas
2010-07-11
2013-05-28
  • lamjas

    lamjas - 2010-07-11

    I imported some records in Refbase. However, some authors names were imported with errors like following

    Original:
    Hill, Peter C.;; Pargament, Kenneth I.

    Imported version:
    Hill, Pet.er C.; Pargament, Kennet.h I.

    How can I solve the problem??

     
  • lamjas

    lamjas - 2010-07-13

    Just another update:

    1) I manually correct those improper displays. Under "show all", it displays correctly. However, under "home", it still shows improper error.

    2) In another thread (https://sourceforge.net/projects/refbase/forums/forum/218758/topic/3659988), you mentioned that this improper display of author name may due to PHP PCRE. I changed the setting so that it support UTF-8, but it does not solve the error at all (yes, my database is set as UTF-8).

    Could you help??

     
  • Richard Karnesky

    you mentioned that this improper display of author name may due to PHP PCRE.

    More specifically, it is likely due to PHP PCRE being compiled without support for Unicode characters on your server. Did you confirm how it was built?  More details on your hosting setup would be useful too.

     
  • lamjas

    lamjas - 2010-07-13

    Hi, thanks for your reply. Because I installed the program on a server maintained by a university department, I asked the "tech" guy to change the setting. They confirmed the following:

    >> The PCRE package has been updated to support both UTF-8 and Unicode
    >> properties (the system package came with UTF-8 support but not Unicode
    >> properties).
    >>
    >> PCRE version 6.6 06-Feb-2006
    >> Compiled with
    >>     UTF-8 support
    >>     Unicode properties support
    >>     Newline character is LF
    >>     Internal link size = 2
    >>     POSIX malloc threshold = 10
    >>     Default match limit = 10000000
    >>     Default recursion depth limit = 10000000
    >>     Match recursion uses stack
    >>

     
  • Matthias Steffens

    Hi lamjas,

    since your server now seems to support Unicode properties (i.e. the PCRE package has been compiled with the "-enable-unicode-properties" option), the issue you've reported should be gone now. I.e., there shouldn't be any strange periods in first names (after the letters "U", "T" & "L").  Can you confirm this?

    Thanks, Matthias

     
  • lamjas

    lamjas - 2010-07-14

    No. Try it already and import new references, but still have problems….

     
  • Matthias Steffens

    Hi lamjas,

    are the problems exactly as before, i.e., do you get periods inserted in first names (after the letters "U", "T" & "L") on import? Or do you see different issues now? Please elaborate.

    Are you really sure that support for Unicode properties is now enabled on your server?

    If the issue persists, you may use a workaround, e.g. you could try to open file '' and change all occurrences of variable '$shortenGivenNames' from this:

    $shortenGivenNames = true;

    to this:

    $shortenGivenNames = false;

    To use a workaround for this problem that also pertains to the refbase citation output, you could open file 'includes/transtab_latin1_charset.inc.php', copy its contents, then open file 'includes/transtab_unicode_charset.inc.php', select all & paste the code from the clipboard. I.e. replace the contents from file 'includes/transtab_unicode_charset.inc.php' with the contents from 'includes/transtab_latin1_charset.inc.php'. This should then use regular POSIX character classes instead of Unicode character properties for your refbase UTF-8 installation. In doing so, you'd likely loose the ability to properly match Unicode characters e.g. in first names. But the problem you're seeing should go away.

    Let us know if this doesn't help.

    Matthias

     
  • lamjas

    lamjas - 2010-07-16

    Hi Matthias,

    Thanks for your inputs.

    Before the workaround, the problem is the same as you described in this thread. I got periods interested in first names (both authors and editors) after the letter "U", "T" and "L".

    I tried the workaround you indicated above. Now, it works perfectly.

    So, does it mean my server is still not UTF-8 supported??

     
  • Matthias Steffens

    Hi lamjas,

    good to hear that at least the workaround is working for you. Note that if you did implement the first workaround mentioned in my previous post (where I meant to write "open file 'import.inc.php'"), that this will only affect import but not citation output. To work around issues with the latter, try my the second suggestion from my earlier post.

    So, does it mean my server is still not UTF-8 supported??

    Since the Unicode properties used in file 'includes/transtab_unicode_charset.inc.php' are still misinterpreted as regular characters by your server, I'd say that this indicates that your server still doesn't support Unicode properties correctly. But maybe I'm overlooking something here? Not sure.

    To test things further, you could create a new PHP script containing the below code and copy it to your server. When executing this script in your browser, you should see the following text on success:

    input: Keith Richard
    output: K. R.

    If your server doesn't support Unicode properties, then you'd probably see:

    input: Keith Richard
    output: Keit.h Richard

    Here's the code for the test script:

    <?php
        // Small PHP script to test Unicode properties. If successful, output should
        // be just the first uppercase character from the string given in $firstName.
        // For more info on Unicode properties see:
        // <http://www.php.net/manual/en/regexp.reference.unicode.php>
        // Test string (should start with an uppercase char and contain some
        // "U", "T" or "L" characters)
        $input = "Keith Richard";
        // Matches Unicode lower case letters
        $lower = "\p{Ll}"; // Unicode-aware equivalent of "[:lower:]"
        // Matches Unicode upper case letters
        $upper = "\p{Lu}\p{Lt}"; // Unicode-aware equivalent of "[:upper:]"
        // Defines the PCRE pattern modifier(s) to be used in conjunction with the
        // above variables. The "u" (PCRE_UTF8) pattern modifier causes PHP/PCRE
        // to treat pattern strings as UTF-8. More info on pattern modifiers:
        // <http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php>
        $modifiers = "u";
        // Test search & replace pattern which should reduce given names to initials
        $output = preg_replace("/([$upper])[$lower]+/$modifiers", "\\1", $input);
        // Separate initials with a dot character:
        $output = preg_replace("/([$upper])(?=[^$lower]+|$)/$modifiers", "\\1.", $output);
    
        echo "<html>\n<head><title>Unicode properties test</title></head>\n<body>";
        echo "\ninput: " . $input;
        echo "\n<br>\n";
        echo "output: " . $output;
    
        echo "\n</body>\n</html>";
    ?>
    

    HTH, Matthias

     
  • lamjas

    lamjas - 2010-07-20

    Okay. I confirmed with the IT guys. The errors caused were due to mysql database character set. The default set is latin1 instead of utf8.

    I think I will stick the workaround as it works smoothly.

    However, I found out that the first name is shortened into initials even though I have changed $shortenGivenNames = false; in "includes\import.inc.php". What's wrong now??

     
  • Matthias Steffens

    Hi lamjas,

    The default set is latin1 instead of utf8

    No matter what charset you're using (latin1 or utf8), please make sure that the same charset is used throughout the whole system. More info and guidance is given here:

    http://www.refbase.net/index.php/Installation-Troubleshooting#Problems_with_special_characters

    However, I found out that the first name is shortened into initials even though I have changed $shortenGivenNames = false; in "includes\import.inc.php"

    When exactly do you see first names being shortened to initials, on import or during output of formatted citations?

    For import, changing ALL occurrences of '$shortenGivenNames' to false in file 'includes/import.inc.php' should suffice. Note that, in total, there are seven places in that file where you'd need to change '$shortenGivenNames' to false.

    W.r.t. citation output: This is defined by the citation style used. Either use a style that doesn't shorten first names to initials (e.g. the "Chicago" or "MLA" citation style), or modify your style of choice accordingly. To do so, open the 'cite_*.php' file and change the 12th parameter in all 'reArrangeAuthorContents()' function calls to false.

    Note that you can specify the default citation style in variable '$defaultCiteStyle' in file 'initialize/ini.inc.php'.

    HTH, Matthias

     
  • lamjas

    lamjas - 2010-07-21

    As I mentioned before, I installed the refbase on a departmental server which is shared with other people. I will talk to the tech guy later and see whether I can set the same charset throughout the whole system.

    For the import function I mentioned, let me explained a bit more.

    When I used the workaround, the author's first and middle names are shortened into initial automatically (even the import RIS file has the author's full name). For example, the full name on the file is "Miller, Paul Jonathan". When I import it, the record becomes "Miller, P.J.". When I click to edit the record, the author is "Miller, P.J." not the full name. Obviously, it is not the citation output.

     
  • Matthias Steffens

    Hi lamjas,

    thanks for the clarifications.

    I will talk to the tech guy later and see whether I can set the same charset throughout the whole system.

    Ok, good. To avoid any misunderstandings, please pass the link I gave in my last post to your tech guy.

    When I import it, the record becomes "Miller, P.J.".

    Hmm, strange. One way to help you further would be that you'd send me all refbase files from your *server* that you've modified, plus the RIS file you're trying to import.

    Thanks, Matthias

     
  • lamjas

    lamjas - 2010-07-23

    You can download the refbase files here:

    http://www.lamjas.com/refbase.tar

    For security reasons, I have deleted out those server and database settings.

    You can use this RIS file for testing:

    http://www.lamjas.com/RIS.txt

     
  • lamjas

    lamjas - 2010-07-23

    Hello again,

    Just another update, hopefully it can help you out a bit to solve my problem:

    1) In mysql, I confirmed the following:

    mysql> SHOW VARIABLES LIKE '%character%';
    +-----------------+-------------------+
    | Variable_name            | Value                      |
    +-----------------+-------------------+
    | character_set_client     | utf8                       |
    | character_set_connection | utf8                       |
    | character_set_database   | utf8                       |
    | character_set_filesystem | binary                     |
    | character_set_results    | utf8                       |
    | character_set_server     | utf8                       |
    | character_set_system     | utf8                       |
    | character_sets_dir       | /usr/share/mysql/charsets/ |
    +-----------------+-------------------+
    8 rows in set (0.00 sec)

    mysql> SHOW VARIABLES LIKE '%collation%';
    +---------------+-----------+
    | Variable_name        | Value           |
    +---------------+-----------+
    | collation_connection | utf8_general_ci |
    | collation_database   | utf8_general_ci |
    | collation_server     | utf8_general_ci |
    +---------------+-----------+
    3 rows in set (0.00 sec)

    According to your troubleshooting (http://www.refbase.net/index.php/Troubleshooting#MySQL_migration_and_character_set_problems), the above looks good to me.

    I used the original file 'includes/transtab_unicode_charset.inc.php', i.e., without using the workaround, the old problem still exists. I got periods in first names (both authors and editors) after the letter "U", "T" and "L".

    If I used the workaround, the full names becomes initials. Also, I found that if the first names contains Latin letters, all authors names become blank.

     
  • Matthias Steffens

    Hi lamjas,

    you mentioned that your refbase MySQL database is UTF-8 based, but just to be sure: what's the output for this MySQL command entered in your MySQL command line interpreter (e.g. the 'mysql' CLI tool or phpMyAdmin):

    SHOW CREATE DATABASE database_name;

    where "database_name" is replaced by the actual name of your refbase MySQL database.

    Thanks for your refbase files, I've checked them and you seem to have replaced just the contents of file 'includes/transtab_unicode_charset.inc.php' with the ones from file 'includes/transtab_latin1_charset.inc.php' (the second workaround from the workarounds I proposed earlier). Since you still have '$shortenGivenNames = true' in file 'includes/import.inc.php', first names should get reduced to initials.

    Thanks also for the 'SHOW VARIABLES LIKE …' output, this looks indeed good. You have also set variable '$contentTypeCharset' in file 'initialize/ini.inc.php' to "UTF-8", so this all seems fine to me.

    Your RIS file looks also good to me and it imports just fine using my local refbase installation. And first names get reduced to initials as it should.

    So, are you saying, that with exactly this setup, you still see strange periods in first names on import. And does this really only happen after the letters "U", "T" & "L"? I ask since this wouldn't really make sense because with your above modification to file 'includes/transtab_unicode_charset.inc.php' you are NOT using any Unicode character properties anymore.

    Also, I found that if the first names contains Latin letters, all authors names become blank.

    By "Latin letters" do you mean non-ASCII chars from the latin1 (ISO-8859-1) charset?

    If you haven't done so already, may I ask you to run the small PHP script to test Unicode properties (which I posted earlier) on your server. What's the output from this script?

    Thanks, Matthias

     
  • lamjas

    lamjas - 2010-07-23

    Hi Matthias,

    Thanks for your update. I did the following in mysql and the results were:

    mysql> SHOW CREATE DATABASE beliefs;
    +-------+--------------------------------------------+
    | Database | Create Database                                                  |
    +-------+--------------------------------------------+
    | beliefs  | CREATE DATABASE `beliefs` /*!40100 DEFAULT CHARACTER SET utf8 */ |
    +-------+--------------------------------------------+
    1 row in set (0.00 sec)

    Million thanks to figure our my overlook on setting '$shortenGivenNames = true' in RIS import. Now, it works great.

    Let me provide you another RIS file. I imported this one and found that all author's names become blank when I used the workaround.
    http://www.lamjas.com/RIS2.txt

    Million thanks again.

     
  • Matthias Steffens

    Hi lamjas,

    thanks for the update, your refbase MySQL database seems to be UTF-8, so that's fine.

    W.r.t. to your 'RIS2.txt' file: This file is saved as UTF-8 and includes a BOM (byte order mark) at the beginning of the UTF-8 file. This BOM character might confuse PHP or refbase, so I'd try to resave that RIS file to UTF-8 without BOM before importing. Also, it's worth re-saving the file with Unix (LF) line endings instead of Windows (CRLF) line endings.

    But instead of uploading the file via the "Upload file" button, you could also try to open the file in a text editor (making sure that all non-ASCII characters are displayed correctly), copy its contents and paste them directly into the text input form of 'import.php'. This often works better. Does this make for any difference?

    Anyways, while thinking about it, my proposed workaround of replacing the contents of file 'includes/transtab_unicode_charset.inc.php' with the ones from file 'includes/transtab_latin1_charset.inc.php' probably won't work correctly for a UTF-8 based setup (sorry about that!). So I guess this is why you're seeing problems and why we did establish a dedicated file ('includes/transtab_unicode_charset.inc.php') for Unicode workflows in the first place. The patterns in 'includes/transtab_latin1_charset.inc.php' won't work with UTF-8 strings, at least not without the 'u' pattern modifier. So you might try to use:

    $patternModifiers = "u";

    and see if that helps. But since it seems that PCRE and other regular expression extensions are not locale-aware, i.e. patterns like '[]' won't match Unicode characters anyway. So your best bet would be to get support for Unicode character properties working on your server and then use the original file 'includes/transtab_unicode_charset.inc.php'.

    : As I mentioned earlier, it would be really interesting to get the output from my test script which I posted earlier. Did you have any chance to run this on your server?

    Thanks, Matthias

     
  • lamjas

    lamjas - 2010-07-26

    Hello again,

    Thanks for your help and explanations.

    I tried the test script. As you might expected, my server doesn't support Unicode properties. Thus I saw:

    input: Keith Richard
    output: Keit.h Richard

    I also contacted the tech guy about this problem. In the email, he stated the following:

    The mySQL is definitely UTF-8 aware, please refer to:

    http://dev.mysql.com/doc/refman/5.0/en/charset.html

    Most of the UTF-8 problem lies in the PHP, please refer to these links:
    http://textpattern.net/wiki/index.php?title=Unicode_Support
    http://www.phpwact.org/php/i18n/charsets?s=utf8

    ====================
    I haven't read through the information he gave me yet. But, FYI, the PHP version installed on server is 5.1.6.

    I also tried your new workaround. However, it did not workout perfectly. After I used $patternModifiers = "u"; The first author names become blank, while last names are remained. So, the output becomes "Smith,, Miller,, & Lee,,"

    Hopefully, this updates can help you pinpoint the cause of the problem.

     
  • Matthias Steffens

    Hi lamjas,

    thanks for trying the script. The results you reported seem to indicate that your server still does NOT support Unicode properties. Please note that there's a difference between general Unicode/UTF-8 support and support for Unicode properties. The latter refers to special escape sequences in the PCRE regex syntax that match generic character types when UTF-8 mode (i.e. the 'u' pattern modifier) is used. More info is available here

    http://www.php.net/manual/en/regexp.reference.unicode.php

    To enable support for Unicode properties, the PCRE package must have been compiled with the "-enable-unicode-properties" option.

    Your tech guy wrote:

    The mySQL is definitely UTF-8 aware, please refer to:

    http://dev.mysql.com/doc/refman/5.0/en/charset.html

    Most of the UTF-8 problem lies in the PHP, please refer to these links:
    http://textpattern.net/wiki/index.php?title=Unicode_Support
    http://www.phpwact.org/php/i18n/charsets?s=utf8

    But this is NOT the problem. The links given by your tech guy are just generic docs about Unicode support but they do not refer to Unicode properties (again, there's a difference). refbase *does* support Unicode/UTF-8 and your server generally seems to support it as well. However, installing refbase with Unicode/UTF-8 requires that your server's PCRE package supports Unicode properties as well, and this still doesn't seem to be the case on your server.

    Here's a refbase demo server that has both, general support for Unicode/UTF-8 *plus* support for PCRE Unicode properties, and it works as advertised with the current code base:

    http://refbase.textdriven.com/beta/

    E.g. try importing the contents of your 'RIS.txt' file there, the author first names should get reduced to initials correctly.

    Matthias

     
  • Richard Karnesky

    I realize I'm bumping a very old thread, but I just checked in code that will remove BOMs from uploaded files, if present.  RefWorks & some other providers include BOM by default.

     

Log in to post a comment.