Menu

UTF-8 enabled ?

Help
Hibou57
2008-01-26
2013-04-26
  • Hibou57

    Hibou57 - 2008-01-26

    Hola amigos o amigas,

    I've just discovered CodeBrowser and feel like I was dreaming....

    But was so-so disapointed when opening one of my UTF-8 files, I saw it displayed as ISO8859. I must use UTF-8 for a foreign language, so I'm afraid to be constrained to give up with my hope to use CodeBroswer for this reason.

    Or may be I'm wrong ? Perhaps there is an "hidden" option I did not see, somewhere, and which will enable to open UTF-8 encoded files ?

    .... please, say me this is possible, dears

     
    • Marc Kerbiquet

      Marc Kerbiquet - 2008-01-27

      Code Browser supports only 8-bit encodings (latin1, cyrilic, ...), UTF-8 is not supported.

       
    • Hibou57

      Hibou57 - 2008-01-27

      Thanks for the answer Marc,

      Indeed, looking at the code I've understood that there is no type for unicode characters (32 bits characters). But as utf-8 is an encoding based on a 8 bits characters stream, this is sometime possible to make more or less utf-8 enabled, some applications that were not primarly designed for this purpose.

      I currently spend some time to modify my copy of the source code, and will talk about it if I manage to do it.

      But I could just do it for Windows, beceause I do not know about X11 programming.

      Further more, as there was no support for multiple encoding, the modified version should only write files as utf-8, but will still be able to read ansi, iso, utf-16 and utf-8 (which will be always saved as utf-8).

      Have a nice day :)

      Yannick

       
      • Hibou57

        Hibou57 - 2008-01-28

        Well, I've finally found that output to screen and to file, as well as input from file can be managed esealy with some code modification, but other input cause a lot of trouble with....

        Finally gave up

         
    • Marc Kerbiquet

      Marc Kerbiquet - 2008-01-28

      Hi Yannick,

      The Linux version uses the GTK+ library which uses UTF-8 encoding exclusively so it is not an issue.

      Congratulations if you manage to make it work with UTF-8, even partially.

      I've always thought to implement unicode internally, but your idea of using UTF-8 internally is appealing to me: it would be possible to handle both 8 bit encodings and UTF-8 internally. The great advantage I can see is that there is no need to convert files at load and saving times:
      - no loss of information in conversions even using a wrong encoding.
      - easy to change encoding when the document is loaded (useful if the BOM marker is not present)
      - faster load and save times with huge files.
      - use less memory compared to unicode

      But I don't like multi-byte encodings very much because it forces to make a distinction between column and offset. Currently, it is the same thing, the nth char in a line is directly accessed. So there is a lot of changes to do to support it:
      - in cursor move functions
      - in search functions
      - in the edit function
      - ...

      But may be the most important change is to make the GUI abstraction layer (interface between the GTK+/Win32 and the editor) unicode compliant.

      As this tool was primarily intended for programming and nobody requests for unicode until now, I've never thought about implementing this feature.

       
      • Hibou57

        Hibou57 - 2008-01-29

        Hello :P

        UTF-8 is good for transport, beceause it is transparent to applications which store and retrieve data. That's why UTF-8 is so much used. And also beceause a file encoded with UTF-8, preserve the encoding of ANSI character, so in a basic raw text editor, you may not read all characters, but at least english text appears normally.

        These are the good reason for UTF-8

        But for internall processing in a real text oriented application, UTF-32 or UTF-16 is really better.

        Indeed, the application internal algorithmes turns into to much complexity when attempting to use UTF-8 for internal representation.... except if utf-8 is an internal representation for string, encapsulated, and the application does deal with.

        Having "big characters" is so better.

        I'em currently trying to port the application replacing all "char" type with a "uchar" type, which is not really a unicode 32 bits type, but rather a 16 bits type. I choose the 16 bits type beceause Windows use it. It is formally UTF-16, but it can also be seen as unicode, beceause most of unicode code points belong below the limit of 16 bits code points.

        But this is a long job, beceause in the original code no distinction were made between memory size (size in bytes) and text length (size in characters), obviously beceause with ansi character, both are the same.

        At the time beeing, the application still always hang up at start up, but I spent a lot of time to find cause of crashes.

        The side effect is that now the code I've modified will not be portable.

        There is also a limitation : some part of the CodeBrowser code use ANSI functions, which do not have any unicode equivalent, like fopen, and the like. Even if manage to port it to unicode under windows, it will still not be able to handle unicode file names (I do not really bother about it, file content was the primarly matter).

        I get some time to modify the CB code to save the time of the developpement of an other application, but I think in the futur (do not know when), I will attempt to create one my self (but not in zinc, rather using Ada).

        By the way, I've discovered a funny language, as fast to code with as C is, while beeing more structured and more type safe than C is.... funny to play with (even if it is not really my favorite kind of language, beceause I prefer very-very stricte language).

        I was to look at the zinc reference right now, to see if there is a way to pass result out by reference parameter (I need it to decode file content after raw data has been loaded).

        Well.... still a lot of work to do.....

        About my need of UTF-8, ... I know that most developpement work does not require this kind of language flexibility, this is just that I work on a web application and binary cgi which contains some texte in foreign language which needs utf-8. That's the reason why :P

        See you soon, or rather read you soon here :)

        Hasta la proxima

         
    • Marc Kerbiquet

      Marc Kerbiquet - 2008-01-31

      Hi Yannick,

      libc functions like fopen() should be easily replaced by native Windows functions, it is just a question to make an abstraction layer for Unix portability.

      If you try to use UTF-16 internally, I think you'll face a serious limitation with zinc: it does not handle  UTF-16 litteral strings ( L"..." in C). Wide char constants can be managed using a cast or a macro ($x:uchar).

      Zinc does not allow to pass values by reference to a functions (there is no equivalent to
      the '&' C operator). This was done intentionally to make optimizations easy: a local variable has no effective address so it can safely be stored in a register. Sometimes I do something like this to work around this limitation:
        def result: [1] int
        f (result)
        def i = result[]
        ...
        func f(x:[]int)
          ...
          x[0] = ...
        end

      The value is stored in a single cell array and the array is passed instead.

       
    • Hibou57

      Hibou57 - 2008-02-01

      Bonsoir Marci :p

      > If you try to use UTF-16 internally, I think you'll face a serious limitation with zinc: it does not handle UTF-16 litteral strings ( L"..." in C). Wide char constants can be managed using a cast or a macro ($x:uchar). <

      I meet this trouble right at the beginning, but quickly found a work around : I use some kind of string resource, which is a Zink file while entries like "message01 := const []uchar = ($A:uchar, $....., $0:uchar)

      Obviously, this string resource file is builded automaticaly by a little C programme.

      At the time I'm ready to finish, but I meet serious troubles with the Zinc compiler : it sometime produce C code with duplicated function definition or stucture definition. It does not recognized some type defined in a scope (I've checke that the source file defining the struct is imported)...

      Well.... the Zinc compiler seems to loose track of some type definition, and it is the start of other troubles chains.

      I really do not know how to work around, I've tried many things.

      Where can I send you the CB code to see it by self ? Perhaps you'll better undestand what I'em talking about.

      I feel a bit disapointed about that.... evry thing would be finished tonigh if this was not about these troubles.

       
    • Marc Kerbiquet

      Marc Kerbiquet - 2008-02-01

      You can send me the code at my email address:
      mkerbiquet -at- free -dot- fr
      I'll have a look.

      Which compiler version do you use ? the one in the cb distribution is 1.1, and the separate package is still 1.0. The version 1.1 crashes on most errors since there is a bug in the print error function.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.