Menu

#108 Indexing string with codepage 1250

Win32/Linux
closed-wont-fix
nobody
None
5
2023-03-14
2022-02-19
No

Indexing string (MDX) does not work when codepage 1250 is used with special characters (Central European and Eastern European languages). It is not sorted by alphabet order.

Discussion

  • Paul Baker

    Paul Baker - 2022-02-20
    • status: open --> closed
     
  • Paul Baker

    Paul Baker - 2022-02-20
    • status: closed --> open
     
  • Paul Baker

    Paul Baker - 2022-02-20

    Does this happen with latest snapshot, currently revision 803, from https://sourceforge.net/p/tdbf/code/HEAD/tree/? If so, please provide some code that will reproduce it, expected behavior, and actual behavior.

     
  • Damir Šimunović

    Yes, its happen with revision I downloaded on 2022-02-17. In ZIP attachment you can find code and two examples. One - test.dbf - is created and filled with very old version and Delphi 2007, and works correct. Set the index and I hope you will see the right order. Another - proba3.dbf - is created and filled with recent TDbf and Delphi 10.2. The codepage 1250 appears as default. In my language the alphabetic order is AaBbCcČčĆćDdĐđEeFfGgHhIiJjKkLlMmNnOoPpRrSsŠšTtUuVvXxYyZzŽž.

     
  • Paul Baker

    Paul Baker - 2022-03-10

    If you could provide the source for a small program that creates the table and populates it with data, I could look at it. Please use the most recent version of Delphi you have (10.2?) and the most recent revision of code available (currently 803) and specify which you used. Then specify the order of the records by record number you expected, and the order you actually got.

    Also, I will need to know what your Windows language is, as this decides the default code page for tables you create (and I have a hunch this is the part that is failing). To be specific, what does GetUserDefaultLCID() return?

    Thanks!

     

    Last edit: Paul Baker 2022-03-10
  • Damir Šimunović

    Regards, Paul,

    attached is a ZIP file with a program that creates a table and fills it with a series of strings that use specific characters of my language. Such a table is also proba4.dbf with indexes in the proba4.mdx file. You can create a new table and load it by clicking on bFillData with the same data. My Windows is set to Croatian language (HRV code). GetUserDefaultLCID() returns 1050. When creating dbf, codepage is automatically set to 1250. A strange code DB0 is written under the language, I don't know what that means. The Language ID is 0. The file proba4.txt shows the order in which the indexed strings should be displayed. At the bottom of the file is the correct alphabetical order for the Croatian language.

    I'm using the TDbf version I downloaded on 2022-02-27. I guess it's revision is 803, I didn't pay attention to it when I downloaded it, and I don't see the revision number in the files.

    The example is made in Delphi version 10.2, but I think it doesn't matter which version of Delphi it is. I also have version 11, but also version 2007. The new version of TDbf is installed only in Delphi 10.2 for now.

    With Delphi 2007 I have been using TDbf version 6.4.7 for many years. It’s a very stable version and I’ve never had a problem with it. Except, of course, the problem that it doesn't work with unicode versions of Delphi. The attached test.dbf table is created in that version. You will see that this table, if loaded into the program with new TDbf, is indexed correctly. However, as soon as a new record is entered with a new version of TDbf, the index no longer gives the correct result.

    My guess is that the problem is in the dbf_collate.pas unit. Version 6.4.7 did not have such a unit. It seems to me that in dbf_collate the language connects to CollationTable, but I don't know where and how. I see that the Czech language with code page 1250 is particularly prominent here. The Czech language has some special characters similar to Croatian and the order is similar, but not the same.

    If I can help in any way, I am available. Thank you very much for your help.

    Damir

    p.s. I'll send the attachment with exe included separately

     
    • Paul Baker

      Paul Baker - 2022-03-12

      Thank you. This is what I know so far.

      I believe it is not able to correctly determine the language driver in the constructor of TDbfGlobals.Create(). What does GetACP() return? As you said, GetUserDefaultLCID() returns 1050. That's 0x041A, which is hr-HR (Croatian).

      I do not yet fully understand FindLangId() and some of the other code, but it is not finding a language driver. Code page 1250 has the characters you need, but it is associated with LANG_CZECH. I can't find a good reference for the dBASE language drivers, but this post similarly lists Czech and not Croatian.

      It seems likely that there is no Croatian language driver and that you need the Czech language driver. Does it work if (just for an experiment) you set your Windows language to Czech and create a new file?

      The error handling is not great, but I was afraid to change it. I don't think a language driver ID of 0x00 is valid and the language driver name of US0 is incorrectly constructed (it should be something like DBWINUS0). It might be there for BDE compatibility, which I have found tends to put invalid values in files.

      Code page is not in the DBF header. It is determined by the language driver. 1252 is the Western European code page that English systems use, and does not have some of the accented characters you need.

      The problem is that it has no way to properly encode some of the accented characters you need in the file. So the field values are wrong and the index keys and order are wrong. The fact that the behavior depends on the version isn't really relevant, because it just doesn't have the information it needs. Even the old version probably only works on a Croatian system and only by luck, not design.

      Hopefully, we can get this fixed, so that you can confidently start using the latest revision. Unfortunately, I can't say when there will be an actual release, but you can grab the source code. By using an older version, you are missing dozens of great fixes. I started using TDbf myself with version 6.9.1, which was newer than what you use , and I considered it very unstable especially with indexes.

       
    • Paul Baker

      Paul Baker - 2022-03-12

      PS. I see incorrect values in the text file, proba4.txt, as well, because it is ANSI rather than Unicode. Again, it only works on a Croatian system. It actually illustrates the problem facing TDbf! It needs a language driver in order to know how to interpret the characters.

      I understand the idea, though, from your alphabet here.
      AaBbCcČčĆćDdĐđEeFfGgHhIiJjKkLlMmNnOoPpRrSsŠšTtUuVvXxYyZzŽž

       

      Last edit: Paul Baker 2022-03-12
  • Damir Šimunović

    without exe

     
  • Damir Šimunović

    with exe

     
  • Damir Šimunović

    GetACP() returns ANSI codepage 1250 as it schould be.

    When I create and open dbf, both with the old and with the new TDbf, TDbf.CodePage returns 1250.

    Dbf-dbffile.pas contains the property DbfGlobals.DefaultCreateLangId. This reads the FDefaultCreateLangId that should be formed in TDbfGlobals.Create, but is not formed then but in the SetDefaultCreateCodePage procedure. It is formed by the CostructLangId function from the dbf_lang unit. However, at the moment I do not see what and when is calling this function. I'll try to follow what's going on here and I'll let you know.

    True, I also tried to use version 6.9.x when it was released, but it didn't work properly. However, the older version 6.4.7 worked correctly and still works today. The applications I made with Delphi 2007 and TDbf 6.4.7 also work properly under Windows versions 7, 8 and 10, although they are compiled under Windows XP. Of course, version 6.4.7 cannot be used with unicode versions of Delphi, which means since version 2009.

     
  • Damir Šimunović

    This happens when a Form is created even though the TDbf component was not created at all.

    Function ConstructLangId calls function FindLangId twice: first with CodePage, then without.

    In first run:

    function FindLangId

    • CodePage = 1250
    • Info2 = 1050
    • Info2Table = $66D3DC
    • IsFoxPro = false

    after loop for I := 0 to $FF
    - Region = 2
    - DbfRes = 0
    - FoxRes = 0

    finally: Result = 0

    In second run result is the same. Obviously function can't find LangId with Locale = 1050.

     
    • Paul Baker

      Paul Baker - 2022-04-15

      I wanted to let you know that I got all the information I need, I am not ignoring you, I just need to find the time to handle it!

       

      Last edit: Paul Baker 2022-04-15
  • Damir Šimunović

    All right, Paul, thanks, I appreciate that.

     
  • Damir Šimunović

    Regards, Paul. Is there any news concerning the problem with string sorting with locale 1050?

     
  • Paul Baker

    Paul Baker - 2022-06-21

    To summarise, I believe you need language driver 0xC8 (aka DB1250CZ0), which is associated with code page 1250 and locale ID 1029 (aka cs-CZ). However, your system has code page 1250 and locale ID 1050 (aka hr-HR), which it fails to match to a language driver.

    It still needs to be fixed but, as a workaround, does setting DbfGlobals.DefaultCreateLangId := FoxLangId_EEurope_1250; before creating the table give you the expected results?

    BTW, How to interpret the language driver name in a dBase (.dbf) file? mostly agrees with the data in dbf_lang.pas.

    '01' => 'cp437'       # U.S. MS-DOS
    '02' => 'cp850'       # International MS-DOS
    '03' => 'cp1252'      # Windows ANSI
    '08' => 'cp865'       # Danish OEM
    '09' => 'cp437'       # Dutch OEM
    '0a' => 'cp850'       # Dutch OEM*
    '0b' => 'cp437'       # Finnish OEM
    '0d' => 'cp437'       # French OEM
    '0e' => 'cp850'       # French OEM*
    '0f' => 'cp437'       # German OEM
    '10' => 'cp850'       # German OEM*
    '11' => 'cp437'       # Italian OEM
    '12' => 'cp850'       # Italian OEM*
    '13' => 'cp932'       # Japanese Shift-JIS
    '14' => 'cp850'       # Spanish OEM*
    '15' => 'cp437'       # Swedish OEM
    '16' => 'cp850'       # Swedish OEM*
    '17' => 'cp865'       # Norwegian OEM
    '18' => 'cp437'       # Spanish OEM
    '19' => 'cp437'       # English OEM (Britain)
    '1a' => 'cp850'       # English OEM (Britain)*
    '1b' => 'cp437'       # English OEM (U.S.)
    '1c' => 'cp863'       # French OEM (Canada)
    '1d' => 'cp850'       # French OEM*
    '1f' => 'cp852'       # Czech OEM
    '22' => 'cp852'       # Hungarian OEM
    '23' => 'cp852'       # Polish OEM
    '24' => 'cp860'       # Portuguese OEM
    '25' => 'cp850'       # Portuguese OEM*
    '26' => 'cp866'       # Russian OEM
    '37' => 'cp850'       # English OEM (U.S.)*
    '40' => 'cp852'       # Romanian OEM
    '4d' => 'cp936'       # Chinese GBK (PRC)
    '4e' => 'cp949'       # Korean (ANSI/OEM)
    '4f' => 'cp950'       # Chinese Big5 (Taiwan)
    '50' => 'cp874'       # Thai (ANSI/OEM)
    '57' => 'cp1252'      # ANSI
    '58' => 'cp1252'      # Western European ANSI
    '59' => 'cp1252'      # Spanish ANSI
    '64' => 'cp852'       # Eastern European MS-DOS
    '65' => 'cp866'       # Russian MS-DOS
    '66' => 'cp865'       # Nordic MS-DOS
    '67' => 'cp861'       # Icelandic MS-DOS
    '6a' => 'cp737'       # Greek MS-DOS (437G)
    '6b' => 'cp857'       # Turkish MS-DOS
    '6c' => 'cp863'       # French-Canadian MS-DOS
    '78' => 'cp950'       # Taiwan Big 5
    '79' => 'cp949'       # Hangul (Wansung)
    '7a' => 'cp936'       # PRC GBK
    '7b' => 'cp932'       # Japanese Shift-JIS
    '7c' => 'cp874'       # Thai Windows/MS-DOS
    '86' => 'cp737'       # Greek OEM
    '87' => 'cp852'       # Slovenian OEM
    '88' => 'cp857'       # Turkish OEM
    'c8' => 'cp1250'      # Eastern European Windows
    'c9' => 'cp1251'      # Russian Windows
    'ca' => 'cp1254'      # Turkish Windows
    'cb' => 'cp1253'      # Greek Windows
    'cc' => 'cp1257'      # Baltic Windows
    
     

    Last edit: Paul Baker 2022-06-22
  • Damir Šimunović

    Unfortunately, I already tried setting FoxLangId_EEurope_1250 earlier and that doesn't help. It's not a code page problem, it's a sorting problem. TDbf correctly recognizes code page 1250 and correctly writes and reads specific Croatian characters (which, by the way, are not the same as in the Czech language), but does not know how to sort them.

     
    • Paul Baker

      Paul Baker - 2022-10-14

      It sounds like what you ideally need is a language driver that uses code page 1050. There isn't much dBASE documentation, and what there is is scattered around the Internet, but I think this is the only European languages that does not have a language driver. I don't think dBASE even supports your language 😞

      Are you able to use Borland Database Engine for this data? If so, we might be able to emulate that, if we can get a copy of a working database. Is this the same problem as yours? It seems to be unresolved.
      https://stackoverflow.com/questions/6202352/bde-and-slovak-language-is-there-any-driver-for-it

      It would also solve the problem not to translate and to sort as binary, which I think is what it used to do. However, not specifying a valid language driver in the header of the DBF file would be incorrect and possibly cause compatibility problems with other software.

      I am not sure what to do, and am open to suggestions, but it looks like I will need to do more research when I have a lot of time (I don't know when that will be).

       

      Last edit: Paul Baker 2022-10-14
    • Paul Baker

      Paul Baker - 2023-03-14

      I am closing this, because dBASE does not support your language and TDbf does what it can to correctly follow the dBASE protocol.

       
  • Paul Baker

    Paul Baker - 2023-03-14
    • status: open --> closed-wont-fix
     

Log in to post a comment.