UTF 8 support in Whitebeam needs improving. Currently the default interface to SpideMonkey simply treats all 'C' strings as ASCII. The interface pads all characters to 16 bits by clearing the upper 8 bits and truncates on output.
The JS engine can be compiled to make all strings UTF-8.
Simply turning this on doesn't work because errors are thrown for all invalid UTF-8 encodings stored anywhere in the system.
It is also a compile time all-or-nothing switch. No leaving existing sites to work as they do and using UTF-8 on new sites.
Proposal: Add a Whitebeam UTF8 mode that can be configured in httpd.conf. Default to 'off'.
A new function in XmlScriptIf.cxx is responsible for converting a string from 'C' to UTF-16 if the mode is on, or using the old behaviour if off.
Ultimately all calls to JS_GetStringBytes must be replaced with the less efficient function. less efficient because of the string copy with decode.
In SpiderMonkey 1.8.5 onwards though JS_GetStringBytes disappears and so this would also be a stepping stone to more recent versions of the engine. It also reduces the amount of volatile SpiderMonkey API outside of XmlScriptIf.cxx.
Thoughts?
I agree that the above is a reasonable way forward as an immediate solution to the problem at hand.
On the other hand I would personally be reluctant to now make any big push towards supporting newer versions of SpiderMonkey given that (as I understand it) SM is no longer supported as a separate project from the Mozilla browser engine, so 1.8.5 is likely to be the last version that can be integrated.
ECMA 6 (proper support for classes in JS) is on the horizon, and persisting with SM would likely result in Whitebeam being left behind.
Thanks Steve. I do have a version of Whitebeam that does the above. I also replaced JS_GetStringBytes(Z) with an interim function so I have all input and all output strings being suitably encoded. A reasonable large number of files changed so I don't really want to check this into the main branch. I could do with creating a Git branch as you did with the 64bit work you did. Is there a quick one-line summary of how to do that? :-)
I do get somewhat depressed each time I think about SpiderMonkey. The source code is still open so in theory we could stick with it. It was never 'officially' released outside of Firefox. The API will have to settle down soon or they're not going to be able to reliably implement anything themselves.
Alternatives? I've looked at V8 a few times, although not recently - which is all C++ templates and, at the time, very little documentation. I doesn't have a 'toSource()' implementation either so that would require changes both in Whitebeam and to applications.
JavaScriptCore looks more promising in that it seems to have a reasonably sane 'C' API that's not a million miles from SpiderMonkey. Not much documentation around though. I've found:
https://www.webkit.org/projects/javascript/
and
http://uselessbyte.blogspot.co.uk/2009/12/adding-custom-javascript-bindings-to.html
Not found any information on whether it's even possible to compile outside of WebKit though...
To create a branch "utf" locally to yourself only, based on master:
git checkout master
git checkout -b utf
To sync that branch up on the SF servers, I believe that this works:
git push -u origin utf
then when you push, the simplest way is
git push --all
git push --tags
which syncs all branches and all tags respectively.
If that's not right, let me know and I'll create a branch from here for
you to checkout, patch and push back.
V8 is the "most respected" JS engine at present, but I appreciate that
there would be a need to provide an enhanced uneval()/toSource(), and
that the API is utterly different. I did get the impression that
different might be better documented and more consistent though!
Cheers,
Steve
On 18/09/14 17:24, Peter Wilson wrote:
Related
Feature Requests: #64
To create a branch "utf" locally to yourself only, based on master:
git checkout master
git checkout -b utf
To sync that branch up on the SF servers, I believe that this works:
git push -u origin utf
then when you push, the simplest way is
git push --all
git push --tags
which syncs all branches and all tags respectively.
If that's not right, let me know and I'll create a branch from here for you to checkout, patch and push back.
V8 is the "most respected" JS engine at present, but I appreciate that there would be a need to provide an enhanced uneval()/toSource(), and that the API is utterly different. I did get the impression that different might be better documented and more consistent though!
Actually I think going to V8 route would be a good opportunity to convert templates to use JSON for serialising data rather than toSource/uneval. This would then open up the possibility of using Postgres JSON data operations to search fields within those structures. I think the latest version of Postgres even stores JSON data types in a structured, compact binary format.
Sounds interesting - It would certainly require work to migrate a WB database and its site(s), but in the long run it would be far more buzz-word compliant :). JSON is also a much simpler format, and is supported cross-language so would make my Perl-to-template parser library simpler.
Thanks Steve, will give it a try when I get a moment
Additional infor: I've been writing some tests for UTF8 compatibility. Interestingly some of the tests I expected to fail with existing Whitebeam were passing. Specifically writing UTF8 data to a template then reading it back should end up with corrupted data.
It turns out the toSource() - or uneval which we actually use when serialising objects - replaces all characters with codes >127 into \uXXXX unicode escape sequences. This seems like fairly bizarre behaviour given it doesn't happen anywhere else.
So - write a UTF8 string to contact.customData works with no corruption however...
write the same data to contact.description and it's corrupted
New experimental branch pushed: utf8
Default mode is as before - although there are now more string copies in places.
Generally a whole VirtualHost would be put into UTF8 mode by adding:
RButf8 true
to the relevant VirtualHost - or to httpd.conf/whitebeam.conf if you want all hosts to default to UTF8.
While probably not useful in production:
rb.page.utf8() returns a boolean, current UTF8 processing state
rb.page.utf8(bool) sets the current processing state. More useful for testing probably
Attached are two files that demonstrate the changes. One is a simple UTF8 string that contains multiple-byte characters. The second is a test script. Put both files in the root of a VirtualHost and run.