Thread: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

Brought to you by: adubey, ncannasse

ocaml-lib-devel

[Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: Yamagata Y. <yor...@mb...> - 2003-06-21 08:38:31

Attachments: uChar.mli uChar.ml uTF8.mli uTF8.ml

The second proposal of UChar and UTF8 modules are attached.  The
improvements are

 * Better documentation
 * Error reporting
 * Performance improvement
 * Code clean up.

As before, I did some random tests.

--
Yamagata Yoriyuki

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: John M. S. <sk...@oz...> - 2003-06-21 23:40:11

Yamagata Yoriyuki wrote:


Still has the redundant tests here.

> let add_uchar buf u =
>   let masq = 0b111111 in
>   let k = int_of_uchar u in
>   if k >= 0 && k <= 0x7f then
>     Buffer.add_char buf (Char.chr k)
>   else if k >= 0x80 && k <= 0x7ff then begin
>     Buffer.add_char buf (Char.chr (0xc0 lor (k lsr 6)));
>     Buffer.add_char buf (Char.chr (0x80 lor (k land masq)))
>   end else if k >= 0x800 && k <= 0xffff then begin
>     Buffer.add_char buf (Char.chr (0xe0 lor (k lsr 12)));
>     Buffer.add_char buf (Char.chr (0x80 lor ((k lsr 6) land masq)));
>     Buffer.add_char buf (Char.chr (0x80 lor (k land masq)));
>   end else if k >= 0x10000 && k <= 0x1fffff then begin
>     Buffer.add_char buf (Char.chr (0xf0 + (k lsr 18)));
>     Buffer.add_char buf (Char.chr (0x80 lor ((k lsr 12) land masq)));
>     Buffer.add_char buf (Char.chr (0x80 lor ((k lsr 6) land masq)));
>     Buffer.add_char buf (Char.chr (0x80 lor (k land masq)));
>   end  else if k >= 0x200000 && k <= 0x3ffffff then begin
>     Buffer.add_char buf (Char.chr (0xf8 + (k lsr 24)));
>     Buffer.add_char buf (Char.chr (0x80 lor ((k lsr 18) land masq)));
>     Buffer.add_char buf (Char.chr (0x80 lor ((k lsr 12) land masq)));
>     Buffer.add_char buf (Char.chr (0x80 lor ((k lsr 6) land masq)));
>     Buffer.add_char buf (Char.chr (0x80 lor (k land masq)));
>   end else begin
>     Buffer.add_char buf (Char.chr (0xfc + (k lsr 30)));
>     Buffer.add_char buf (Char.chr (0x80 lor ((k lsr 24) land masq))); 
>     Buffer.add_char buf (Char.chr (0x80 lor ((k lsr 18) land masq)));
>     Buffer.add_char buf (Char.chr (0x80 lor ((k lsr 12) land masq)));
>     Buffer.add_char buf (Char.chr (0x80 lor ((k lsr 6) land masq)));
>     Buffer.add_char buf (Char.chr (0x80 lor (k land masq)));
>   end


-- 
John Max Skaller, mailto:sk...@oz...
snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia.
voice:61-2-9660-0850

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: Yamagata Y. <yor...@mb...> - 2003-06-22 09:57:46

Attachments: uTF8.ml

From: John Max Skaller <sk...@oz...>
Subject: Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules
Date: Sun, 22 Jun 2003 09:39:52 +1000

> Still has the redundant tests here.

Fixed.  Thanks.  The file is attached below.

--
Yamagata Yoriyuki

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: Yamagata Y. <yor...@mb...> - 2003-06-22 21:42:09

Attachments: uChar.ml uTF8.ml

I find some problems in uChar.ml.  (chr do not check the argument
properly, and some functions raise the wrong exception.)  The correct
version is attached below.  Also, I tried to optimize uTF.ml further,
using unsafe operations.  I actually do not like this kind of the trick,
but it seems to squeeze about 30-40% speed up.

Is there other things to do for the inclusion?
--
Yamagata Yoriyuki

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: John M. S. <sk...@oz...> - 2003-06-23 03:25:13

Yamagata Yoriyuki wrote:

> let add_uchar buf u =
>   let masq = 0b111111 in
>   let k = int_of_uchar u in
>   if k < 0 || k >= 0x4000000 then begin
>     Buffer.add_char buf (Char.chr (0xfc + (k lsr 30)));
>     Buffer.add_char buf (Char.unsafe_chr (0x80 lor ((k lsr 24) land masq))); 

You might try replacing 'masq' with the actual literal value,
though the compiler should be able to determine its a constant,
it may not.

The other thing to do here is reorder the tests:
test k>=0 && k<=07f first, then k<=07ff ..., reason being
that the first case is 99.99% of cases. The second
is 99.99% of the rest of cases.

-- 
John Max Skaller, mailto:sk...@oz...
snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia.
voice:61-2-9660-0850

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: Nicolas C. <war...@fr...> - 2003-06-23 08:56:51

> > let add_uchar buf u =
> >   let masq = 0b111111 in
> >   let k = int_of_uchar u in
> >   if k < 0 || k >= 0x4000000 then begin
> >     Buffer.add_char buf (Char.chr (0xfc + (k lsr 30)));
> >     Buffer.add_char buf (Char.unsafe_chr (0x80 lor ((k lsr 24) land
masq)));
>
> You might try replacing 'masq' with the actual literal value,
> though the compiler should be able to determine its a constant,
> it may not.

I'm pretty sure it is doing it.
Since masq is non-mutable, this is an easy optimisation for the compiler.
BTW, you can check the native output code by running ocamlopt with -S

> The other thing to do here is reorder the tests:
> test k>=0 && k<=07f first, then k<=07ff ..., reason being
> that the first case is 99.99% of cases. The second
> is 99.99% of the rest of cases.

This one the compiler can't :-)

Nicolas Cannasse

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: Yamagata Y. <yor...@mb...> - 2003-06-23 15:37:44

From: "Nicolas Cannasse" <war...@fr...>
Subject: Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules
Date: Mon, 23 Jun 2003 17:55:16 +0900

> I'm pretty sure it is doing it.
> Since masq is non-mutable, this is an easy optimisation for the compiler.
> BTW, you can check the native output code by running ocamlopt with -S

It does.

> The other thing to do here is reorder the tests:
> test k>=0 && k<=07f first, then k<=07ff ..., reason being
> that the first case is 99.99% of cases. The second
> is 99.99% of the rest of cases.

k could be negative.  A better way is

  if k >= 0 then
     if k <= 0x7f then ... else
     if k <= 0x7ff then ... else
     ...
     if k <= 0x3ffffff then ... else
     (*)
  else
     (*)

but then the code (*) is duplicated.  I don't think an extra integer
comparison is a big deal.

More optimization also could be possible by the unsafe operations.
For example, iter can use unsafe_next. (which does not check whether
i is valid.)

let rec iter_aux proc s i =
  if i >= String.length s then () else
  let u = look s i in
  proc u;
  iter_aux proc s (next s i)

But, it makes the code duplicated (unsafe_next and next.  If we
implement next using unsafe_next, then we add one extra function call
for the safe operation1, which is IMO not desirable.) and makes the
code more prone to error.  (In this case, we implicitly assume
inter_aux never be called with i < 0.)

My opinion is that uTF8.ml is already optimized well, unless we have a
good benchmark.

--
Yamagata Yoriyuki

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: John M. S. <sk...@oz...> - 2003-06-23 17:30:34

Yamagata Yoriyuki wrote:

> k could be negative.  A better way is
> 
>   if k >= 0 then
>      if k <= 0x7f then ... else
>      if k <= 0x7ff then ... else
>      ...
>      if k <= 0x3ffffff then ... else
>      (*)
>   else
>      (*)
> 
> but then the code (*) is duplicated.  I don't think an extra integer
> comparison is a big deal.

I do, because, I may UTF-8 every input file to my compiler.

Since lexical analysis is the slowest part of compilation,

and this routine is handling every character individually,
blinding speed is important. Adding 50% more comparisons
to handle an ASCII character may slow the lexer, and thereby
the whole compilation process, by a significant amount.
I'm already thinking to replace Ocamllex, since the
space compaction on the lookup tables costs performance :-)
Also tempted to mmap the input file, to eliminate the
check for end of buffer needed on each char.

After all, the core of a scanner is ultra fast:

	while(state = matrix[state][*p++]);

which should outperform memory easily.
Well, if I go i18n, I want the decoder function
as fast as possible (the encoder is less critical).

-- 
John Max Skaller, mailto:sk...@oz...
snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia.
voice:61-2-9660-0850

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: Yamagata Y. <yor...@mb...> - 2003-06-24 13:15:13

From: John Max Skaller <sk...@oz...>
Subject: Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules
Date: Tue, 24 Jun 2003 03:30:06 +1000

> Well, if I go i18n, I want the decoder function
> as fast as possible (the encoder is less critical).

The part we talked about is the encoder, not the decoder.  (the
decoder is look function.)

I do some benchmarks.  While the test repeating "buf.add_uchar 'a'
buf" and "buf.clear buf" shows 25% speed up, the more realistic test
that repeatedly puts 1K uchars into a buffer shows only 3% speed up.
Not a big deal, really.

--
Yamagata Yoriyuki

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: John M. S. <sk...@oz...> - 2003-06-25 02:47:54

Yamagata Yoriyuki wrote:

>>Well, if I go i18n, I want the decoder function
>>as fast as possible (the encoder is less critical).
>>
> 
> The part we talked about is the encoder, not the decoder.  (the
> decoder is look function.)

I know.

> I do some benchmarks.  While the test repeating "buf.add_uchar 'a'
> buf" and "buf.clear buf" shows 25% speed up, the more realistic test
> that repeatedly puts 1K uchars into a buffer shows only 3% speed up.
> Not a big deal, really.

Yes it is. It is a big deal. Let me ask you something:

would you give up 3% interest on an investment?
would you give 3% of a year extra to your employer
instead of holidays? That's about 11 days of holidays,
which is two whole weeks .. around here that's the whole
of your Xmas holdiday .. and over half your total holidays.

You might think the comparison is unfair.
But software systems compete on margins, like
anything else. 3% off an overhead is a good improvement.
Get it fast enough and the case for recoding
in C -- or even assembler -- will diminish.

-- 
John Max Skaller, mailto:sk...@oz...
snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia.
voice:61-2-9660-0850

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: Nicolas C. <war...@fr...> - 2003-06-25 02:57:51

> > I do some benchmarks.  While the test repeating "buf.add_uchar 'a'
> > buf" and "buf.clear buf" shows 25% speed up, the more realistic test
> > that repeatedly puts 1K uchars into a buffer shows only 3% speed up.
> > Not a big deal, really.
>
> Yes it is. It is a big deal. Let me ask you something:
>
> would you give up 3% interest on an investment?
> would you give 3% of a year extra to your employer
> instead of holidays? That's about 11 days of holidays,
> which is two whole weeks .. around here that's the whole
> of your Xmas holdiday .. and over half your total holidays.

The thing here is that if we're sure to get a 3% improvement, then it's
perhaps worth it ( if the code doesn't get bigger - since for some people
space is more an issue than time - and if the source doesn't get ugly so it
can still be maintained , modified , etc. ).
But since hardware are different, memory issues and not-so-good process
timing, you can be pretty sure that this 3% is not relevent, and then a +3%
on Yamagata-san computer can be a -10% on my windows box for example :-)
Always be careful with micro benchmarks, watch the generated assembly code
(using ocamlopt -S) to be sure that this actually an optimisation !

Nicolas Cannasse

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: Yamagata Y. <yor...@mb...> - 2003-06-26 05:05:08

From: "Nicolas Cannasse" <war...@fr...>
Subject: Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules
Date: Wed, 25 Jun 2003 11:56:12 +0900

> if the code doesn't get bigger - since for some people
> space is more an issue than time - and if the source doesn't get ugly so it
> can still be maintained , modified , etc.

The code gets slightly bigger (160 bytes).  I am most concerned about
code duplication.  The code converter has several "magic" numbers and
formulas in it.  We had better to stuff them in one place, so that we
can easily review and modify them.

In my experience, (and judging from the woe stories caused by the
broken converters) writing a correct code converter is not easy.  To
err on the safe side is the better decision than reckless
optimization.

--
Yamagata Yoriyuki

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: Nicolas C. <war...@fr...> - 2003-06-23 08:53:23

> Is there other things to do for the inclusion?

Once everybody will be happy with it, I will commit it to ExtLib.

Nicolas Cannasse

Re: [Ocaml-lib-devel] second proposal of UChar, UTF8 modules

From: Yamagata Y. <yor...@mb...> - 2003-06-23 15:37:47

Attachments: uTF8.diff

Argh, I found a bug in uTF8.ml.  Here is a fix.