Menu

#64 UTF8 mode

open
3
2014-09-30
2014-09-16
No

UTF 8 support in Whitebeam needs improving. Currently the default interface to SpideMonkey simply treats all 'C' strings as ASCII. The interface pads all characters to 16 bits by clearing the upper 8 bits and truncates on output.

The JS engine can be compiled to make all strings UTF-8.

Simply turning this on doesn't work because errors are thrown for all invalid UTF-8 encodings stored anywhere in the system.

It is also a compile time all-or-nothing switch. No leaving existing sites to work as they do and using UTF-8 on new sites.

Proposal: Add a Whitebeam UTF8 mode that can be configured in httpd.conf. Default to 'off'.

A new function in XmlScriptIf.cxx is responsible for converting a string from 'C' to UTF-16 if the mode is on, or using the old behaviour if off.

Ultimately all calls to JS_GetStringBytes must be replaced with the less efficient function. less efficient because of the string copy with decode.

In SpiderMonkey 1.8.5 onwards though JS_GetStringBytes disappears and so this would also be a stepping stone to more recent versions of the engine. It also reduces the amount of volatile SpiderMonkey API outside of XmlScriptIf.cxx.

Thoughts?

Related

Feature Requests: #64

Discussion

  • Steve Davies

    Steve Davies - 2014-09-18

    I agree that the above is a reasonable way forward as an immediate solution to the problem at hand.

    On the other hand I would personally be reluctant to now make any big push towards supporting newer versions of SpiderMonkey given that (as I understand it) SM is no longer supported as a separate project from the Mozilla browser engine, so 1.8.5 is likely to be the last version that can be integrated.

    ECMA 6 (proper support for classes in JS) is on the horizon, and persisting with SM would likely result in Whitebeam being left behind.

     
  • Peter Wilson

    Peter Wilson - 2014-09-18

    Thanks Steve. I do have a version of Whitebeam that does the above. I also replaced JS_GetStringBytes(Z) with an interim function so I have all input and all output strings being suitably encoded. A reasonable large number of files changed so I don't really want to check this into the main branch. I could do with creating a Git branch as you did with the 64bit work you did. Is there a quick one-line summary of how to do that? :-)

    I do get somewhat depressed each time I think about SpiderMonkey. The source code is still open so in theory we could stick with it. It was never 'officially' released outside of Firefox. The API will have to settle down soon or they're not going to be able to reliably implement anything themselves.

    Alternatives? I've looked at V8 a few times, although not recently - which is all C++ templates and, at the time, very little documentation. I doesn't have a 'toSource()' implementation either so that would require changes both in Whitebeam and to applications.

    JavaScriptCore looks more promising in that it seems to have a reasonably sane 'C' API that's not a million miles from SpiderMonkey. Not much documentation around though. I've found:

    https://www.webkit.org/projects/javascript/

    and

    http://uselessbyte.blogspot.co.uk/2009/12/adding-custom-javascript-bindings-to.html

    Not found any information on whether it's even possible to compile outside of WebKit though...

     
    • Steve Davies

      Steve Davies - 2014-09-18

      To create a branch "utf" locally to yourself only, based on master:

      git checkout master
      git checkout -b utf

      To sync that branch up on the SF servers, I believe that this works:

      git push -u origin utf

      then when you push, the simplest way is

      git push --all
      git push --tags

      which syncs all branches and all tags respectively.

      If that's not right, let me know and I'll create a branch from here for
      you to checkout, patch and push back.

      V8 is the "most respected" JS engine at present, but I appreciate that
      there would be a need to provide an enhanced uneval()/toSource(), and
      that the API is utterly different. I did get the impression that
      different might be better documented and more consistent though!

      Cheers,
      Steve

      On 18/09/14 17:24, Peter Wilson wrote:

      Thanks Steve. I do have a version of Whitebeam that does the above. I
      also replaced JS_GetStringBytes(Z) with an interim function so I have
      all input and all output strings being suitably encoded. A reasonable
      large number of files changed so I don't really want to check this into
      the main branch. I could do with creating a Git branch as you did with
      the 64bit work you did. Is there a quick one-line summary of how to do
      that? :-)

      I do get somewhat depressed each time I think about SpiderMonkey. The
      source code is still open so in theory we could stick with it. It was
      never 'officially' released outside of Firefox. The API will have to
      settle down soon or they're not going to be able to reliably implement
      anything themselves.

      Alternatives? I've looked at V8 a few times, although not recently -
      which is all C++ templates and, at the time, very little documentation.
      I doesn't have a 'toSource()' implementation either so that would
      require changes both in Whitebeam and to applications.

      JavaScriptCore looks more promising in that it seems to have a
      reasonably sane 'C' API that's not a million miles from SpiderMonkey.
      Not much documentation around though. I've found:

      https://www.webkit.org/projects/javascript/
      https://www.webkit.org/projects/javascript

      and

      http://uselessbyte.blogspot.co.uk/2009/12/adding-custom-javascript-bindings-to.html

      Not found any information on whether it's even possible to compile
      outside of WebKit though...


      [feature-requests:#64]
      http://sourceforge.net/p/whitebeam/feature-requests/64 UTF8 mode

      Status: open
      Group: Next Release (example)
      Labels: spidermokney utf8
      Created: Tue Sep 16, 2014 06:40 PM UTC by Peter Wilson
      Last Updated: Thu Sep 18, 2014 03:50 PM UTC
      Owner: Peter Wilson

      UTF 8 support in Whitebeam needs improving. Currently the default
      interface to SpideMonkey simply treats all 'C' strings as ASCII. The
      interface pads all characters to 16 bits by clearing the upper 8 bits
      and truncates on output.

      The JS engine /can/ be compiled to make all strings UTF-8.

      Simply turning this on doesn't work because errors are thrown for all
      invalid UTF-8 encodings stored anywhere in the system.

      It is also a compile time all-or-nothing switch. No leaving existing
      sites to work as they do and using UTF-8 on new sites.

      Proposal: Add a Whitebeam UTF8 mode that can be configured in
      httpd.conf. Default to 'off'.

      A new function in XmlScriptIf.cxx is responsible for converting a string
      from 'C' to UTF-16 if the mode is on, or using the old behaviour if off.

      Ultimately all calls to JS_GetStringBytes must be replaced with the less
      efficient function. less efficient because of the string copy with decode.

      In SpiderMonkey 1.8.5 onwards though JS_GetStringBytes disappears and so
      this would also be a stepping stone to more recent versions of the
      engine. It also reduces the amount of volatile SpiderMonkey API outside
      of XmlScriptIf.cxx.

      Thoughts?


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/whitebeam/feature-requests/64/
      https://sourceforge.net/p/whitebeam/feature-requests/64

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/
      https://sourceforge.net/auth/subscriptions

       

      Related

      Feature Requests: #64

  • Steve Davies

    Steve Davies - 2014-09-18

    To create a branch "utf" locally to yourself only, based on master:

    git checkout master
    git checkout -b utf

    To sync that branch up on the SF servers, I believe that this works:

    git push -u origin utf

    then when you push, the simplest way is

    git push --all
    git push --tags

    which syncs all branches and all tags respectively.

    If that's not right, let me know and I'll create a branch from here for you to checkout, patch and push back.

    V8 is the "most respected" JS engine at present, but I appreciate that there would be a need to provide an enhanced uneval()/toSource(), and that the API is utterly different. I did get the impression that different might be better documented and more consistent though!

     
    • Peter Wilson

      Peter Wilson - 2014-09-29

      Actually I think going to V8 route would be a good opportunity to convert templates to use JSON for serialising data rather than toSource/uneval. This would then open up the possibility of using Postgres JSON data operations to search fields within those structures. I think the latest version of Postgres even stores JSON data types in a structured, compact binary format.

       
      • Steve Davies

        Steve Davies - 2014-09-29

        Sounds interesting - It would certainly require work to migrate a WB database and its site(s), but in the long run it would be far more buzz-word compliant :). JSON is also a much simpler format, and is supported cross-language so would make my Perl-to-template parser library simpler.

         
  • Peter Wilson

    Peter Wilson - 2014-09-29

    Thanks Steve, will give it a try when I get a moment

     
  • Peter Wilson

    Peter Wilson - 2014-09-29

    Additional infor: I've been writing some tests for UTF8 compatibility. Interestingly some of the tests I expected to fail with existing Whitebeam were passing. Specifically writing UTF8 data to a template then reading it back should end up with corrupted data.

    It turns out the toSource() - or uneval which we actually use when serialising objects - replaces all characters with codes >127 into \uXXXX unicode escape sequences. This seems like fairly bizarre behaviour given it doesn't happen anywhere else.

    So - write a UTF8 string to contact.customData works with no corruption however...
    write the same data to contact.description and it's corrupted

     
  • Peter Wilson

    Peter Wilson - 2014-09-30

    New experimental branch pushed: utf8

    Default mode is as before - although there are now more string copies in places.

    Generally a whole VirtualHost would be put into UTF8 mode by adding:
    RButf8 true
    to the relevant VirtualHost - or to httpd.conf/whitebeam.conf if you want all hosts to default to UTF8.

    While probably not useful in production:
    rb.page.utf8() returns a boolean, current UTF8 processing state
    rb.page.utf8(bool) sets the current processing state. More useful for testing probably

    Attached are two files that demonstrate the changes. One is a simple UTF8 string that contains multiple-byte characters. The second is a test script. Put both files in the root of a VirtualHost and run.

     
MongoDB Logo MongoDB