Thread: [Ocaml-lib-devel] write_double

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,
It would be possible to make write_double more than twice faster 
(close to 3 times faster).
The tradeoff is that there is additional C code.

Cheers

Florent Monnier wrote :
> Hi,
> It would be possible to make write_double more than twice faster
> (close to 3 times faster).
> The tradeoff is that there is additional C code.
>
> Cheers

Have you decided to make the extlib with only ocaml code, 
without any external C ?

Anyway if you wish to test the write_double_opt, grab it here:
http://www.linux-nantes.org/~fmonnier/TMP/extlib-1.5.1_write_double_opt.tar.bz2
and to test, just open the tarball and type:
 sh test_write_double.sh

Here are the elapsed times that I get:
  0.44 sec. -- the implementation of the current extLib
  0.16 sec. -- proposal for inclusion in the extLib distribution
  0.14 sec. -- with a higher-order function

-- 
Cheers
Florent

Hiya,

On Sun, Jun 1, 2008 at 12:15 PM, Florent Monnier
<fmo...@li...> wrote:
> Have you decided to make the extlib with only ocaml code,
> without any external C ?

Yes, I think that was the original idea.

Best,
-Amit

On Sun, Jun 01, 2008 at 01:15:28PM +0200, Florent Monnier wrote:
> Florent Monnier wrote :
> > Hi,
> > It would be possible to make write_double more than twice faster
> > (close to 3 times faster).
> > The tradeoff is that there is additional C code.
> >
> > Cheers
> 
> Have you decided to make the extlib with only ocaml code, 
> without any external C ?

That's the policy - to make it easier to port extlib to any platform
which supports OCaml.

Is it possible to match the speed-ups using pure OCaml code?  eg.  by
carefully looking at the generated assembler (ocamlopt -S) and
studying why it might be slow?

Rich.

-- 
Richard Jones
Red Hat

> > Have you decided to make the extlib with only ocaml code,
> > without any external C ?
>
> That's the policy - to make it easier to port extlib to any platform
> which supports OCaml.

Maybe write_double_opt could be joined along with the extlib only as a patch 
and a readme explaining the difference.

> Is it possible to match the speed-ups using pure OCaml code?  eg.  by
> carefully looking at the generated assembler (ocamlopt -S) and
> studying why it might be slow?

yes I have read the gas code of the extlib version compared to the one of the 
mixed ocaml/C, and even without this just reading the original code it is 
easy to understand what makes the difference:

* the mixed ocaml/C copy the double to an ocaml string buffer (allways the
  same so no alloc at each call) with the C function memcpy, then the string
  buffer is writen once. This is minimal, no extra operations:

let buf_str = "01234567"
external double_cast: buf_str:string -> float -> unit = "double_cast"

let write_double_opt_native ch d =
        double_cast ~buf_str d;
        nwrite ch buf_str

CAMLprim value double_cast( value str, value d ) {
        memcpy((char *)str, (double *)d, sizeof(double));
        return Val_unit;
}

// as you see there is only one line of C code
// we have "Int64.bits_of_float" but not any "String.buf_of_float" :,(

// I've tryed with the Marshall module, the binary part is at the end
// but sometimes for some particular floats there are not the good number
// of bytes in total.

    ; the assembly code of the C part:
double_cast:
        pushl   %ebp
        movl    %esp, %ebp
        movl    12(%ebp), %edx
        movl    8(%ebp), %ecx
        movl    (%edx), %eax
        movl    %eax, (%ecx)
        movl    4(%edx), %eax
        movl    %eax, 4(%ecx)
        movl    $1, %eax
        popl    %ebp
        ret

    ; and the gas code of the ocaml part:
camlIO__write_double_opt_native_422:
        subl    $4, %esp
.L557:
        movl    %eax, 0(%esp)
        pushl   %ebx
        pushl   camlIO + 296
        movl    $double_cast, %eax
        call    caml_c_call
.L558:
        addl    $8, %esp
        movl    camlIO + 296, %ebx
        movl    0(%esp), %eax
        addl    $4, %esp
        jmp     camlIO__nwrite_140

* the original write_double makes a lot of shifts and convertions,
  here is the list of the extraneous operations done:

    - 1 x (Int64.bits_of_float)
    - 3 x (Int64.to_int32)
    - 1 x (Int64.shift_right_logical)
    - 4 x (Int32.to_int)
    - 2 x (Int32.shift_right_logical)
    - 4 x (lsr)
    - 8 x (write_byte)

(and there is no surprise in the related gas code which only lists all these 
operations, if interested I've put it at the end of this email, because it's 
quite long)

* even without additional C, it would be possible to make the implementation a 
bit more concise, but it doesn't enhance the speed very much (just a very 
little even with 1 million calls):

let write_double_ext_native ch f =
        let bin = Int64.bits_of_float f in
        let b7 = Int64.to_int(bin) in
        let b6 = b7 lsr  8
        and b5 = b7 lsr 16
        and b4 = Int64.to_int(Int64.shift_right_logical bin 24) in
        let b3 = b4 lsr  8
        and b2 = b4 lsr 16
        and b1 = Int64.to_int(Int64.shift_right_logical bin 48) in
        let b0 = b1 lsr  8
        in
        write_byte ch b7;
        write_byte ch b6;
        write_byte ch b5;
        write_byte ch b4;
        write_byte ch b3;
        write_byte ch b2;
        write_byte ch b1;
        write_byte ch b0

this version is included too in the test tarball that I have provided in my 
previous email, and you can easily compare it with all the other 
implementations adding this line in the test script 'test_write_double.sh':
   time ./test.opt /dev/null  -ext

--
With Regards
Florent
-- 

; gas code of the current write_double:

camlIO__write_double_391:
        subl    $4, %esp
.L527:  
        movl    %eax, 0(%esp)
        pushl   %ebx 
        movl    $caml_int64_bits_of_float, %eax
        call    caml_c_call
.L528:  
        addl    $4, %esp
        movl    %eax, %ebx
        movl    0(%esp), %eax
        addl    $4, %esp
        jmp     camlIO__write_i64_385

; ....

camlIO__write_i64_385:
        subl    $8, %esp
.L520:  
        movl    %eax, 4(%esp)
        movl    %ebx, 0(%esp)
        pushl   %ebx
        movl    $caml_int64_to_int32, %eax
        call    caml_c_call
.L521:  
        addl    $4, %esp
        movl    %eax, %ebx
        movl    4(%esp), %eax
        call    camlIO__write_real_i32_380
.L522:
        pushl   $65
        movl    4(%esp), %eax
        pushl   %eax
        movl    $caml_int64_shift_right_unsigned, %eax
        call    caml_c_call
.L523:
        addl    $8, %esp
        pushl   %eax
        movl    $caml_int64_to_int32, %eax
        call    caml_c_call
.L524:  
        addl    $4, %esp
        movl    %eax, %ebx
        movl    4(%esp), %eax
        addl    $8, %esp
        jmp     camlIO__write_real_i32_380

; ...

camlIO__write_real_i32_380:
        subl    $12, %esp
.L516:  
        movl    %eax, %ecx
        movl    %ecx, 8(%esp)
        movl    4(%ebx), %eax
        sall    $1, %eax
        orl     $1, %eax
        movl    %eax, 0(%esp)
        movl    4(%ebx), %ebx
        shrl    $24, %ebx
        sall    $1, %ebx
        orl     $1, %ebx
        movl    %ebx, 4(%esp)
        andl    $511, %eax
        movl    (%ecx), %ebx
        movl    (%ebx), %ecx
        call    *%ecx
.L517:  
        movl    0(%esp), %eax
        shrl    $8, %eax
        orl     $1, %eax
        andl    $511, %eax
        movl    8(%esp), %ebx
        movl    (%ebx), %ebx
        movl    (%ebx), %ecx
        call    *%ecx
.L518:
        movl    0(%esp), %eax
        shrl    $16, %eax
        orl     $1, %eax
        andl    $511, %eax
        movl    8(%esp), %ebx
        movl    (%ebx), %ebx
        movl    (%ebx), %ecx
        call    *%ecx
.L519:
        movl    4(%esp), %eax
        andl    $511, %eax
        movl    8(%esp), %ebx
        movl    (%ebx), %ebx
        movl    (%ebx), %ecx
        addl    $12, %esp
        jmp     *%ecx

> > Is it possible to match the speed-ups using pure OCaml code?  eg.  by
> > carefully looking at the generated assembler (ocamlopt -S) and
> > studying why it might be slow?
>
> yes I have read the gas code of the extlib version compared to the one of
> the
> mixed ocaml/C, and even without this just reading the original code it is
> easy to understand what makes the difference:

If IO module is used to write out to a file, it would sound like the
overhead of writing to output would far outweigh benefits of tighter
assembly code for writing out doubles.  Wouldn't calling write_byte eight
times be much more expensive than the few shift instructions?

It looks like your C/Ocaml implementation with the ocaml-side string is not
thread safe?  Perhaps this doesn't happen with the current OCaml run-time,
but it looks like if two threads would enter double_cast at the same time,
you'd corrupt buf_str?

let buf_str = "01234567"
external double_cast: buf_str:string -> float -> unit = "double_cast"

Janne

> If IO module is used to write out to a file, it would sound like the
> overhead of writing to output would far outweigh benefits of tighter
> assembly code for writing out doubles.  Wouldn't calling write_byte eight
> times be much more expensive than the few shift instructions?
>

Looks like the situation is more complicated than what I said above.

I wrote this simple benchmark:

let () =
  let io_out = IO.output_string () in

  for i = 0 to 65535 do
    for j = 0 to 127 do (* j upper bound needs to be 16 when writing out 8
bytes per double! *)
      IO.write_double io_out (float_of_int i)
    done;
  done

which just writes floats to a string.  I verified via .s files that
IO.write_double actually gets called and doesn't get inlined.

If I implement write_i64 (which write_double uses) as a dummy function that
does nothing:

let write_i64 ch i =
  ()

the program can write out 14.5 million write_doubles / sec.  On my 1.8GHz
laptop this means roughly 125 cycles per write_double.  Although not exactly
blazingly fast for almost a non-op, it doesn't feel completely off given
that the program does a lot of conversions from ints to floats and allocates
memory per each alloc (see assembly right below):

.L103:
        call    caml_alloc2
.L107:
        leal    4(%eax), %ebx
        movl    $2301, -4(%ebx)
        movl    4(%esp), %eax
        sarl    $1, %eax
        pushl   %eax
        fildl   (%esp)
        addl    $4, %esp
        fstpl   (%ebx)
        movl    8(%esp), %eax
        call    camlIO__write_double_327
.L108:
        movl    12(%esp), %eax
        movl    %eax, %ebx
        addl    $2, %eax
        movl    %eax, 12(%esp)
        cmpl    $255, %ebx
        jne     .L103

Now if I go and modify the write_i64 to be only slightly more complex:

let write_i64 ch i =
  let ilo = Int64.to_int32 i in
  ()

the write_double rate drops to 9.3 million ops / sec.

Slightly complicating it again like so:

let write_i64 ch i =
  let ilo = Int64.to_int32 i in
  let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in
  ()

the performance almost halves, now at 5.3 million ops / sec.  I'd tend to
believe that this performance drop is due to more allocation, as both ilo
and ihi variables would be of type Int32.t and hence boxed 32-bit ints.

If I implement the write_i64 function with something that actually does
something:

let write_i64 ch i =
  let ilo = Int64.to_int32 i in
  let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in
  let s = String.create 8 in
  let ilo_nat = Int32.to_int ilo in
  s.[0] <- Char.unsafe_chr ilo_nat;
  s.[1] <- Char.unsafe_chr (ilo_nat lsr 8);
  s.[2] <- Char.unsafe_chr (ilo_nat lsr 16);
  s.[3] <- Char.unsafe_chr (Int32.to_int (Int32.shift_right_logical ilo
24));
  let ihi_nat = Int32.to_int ihi in
  s.[4] <- Char.unsafe_chr ihi_nat;
  s.[5] <- Char.unsafe_chr (ihi_nat lsr 8);
  s.[6] <- Char.unsafe_chr (ihi_nat lsr 16);
  s.[7] <- Char.unsafe_chr (Int32.to_int (Int32.shift_right_logical ihi
24));
  nwrite ch s

I now get only about 2 million ops / sec.  Commenting out the byte
extraction code (i.e., write out garbage):

let write_i64 ch i =
  let ilo = Int64.to_int32 i in
  let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in
  let s = String.create 8 in
  nwrite ch s

with this change, I still get roughly 2 million ops (only slightly faster).
This would lead me to believe that time is not spent in doing ALU ops but
rather time is spent either in the garbage collector or data cache misses.

Feeling optimistic that I had made an optimization over the original
write_i64 by calling I/O writes less often (i.e., once as opposed to 8
times), I benchmarked the original write_i64 version.  Well, when writing to
a string output, the 8x write_byte is faster.  When writing to a real file,
the nwrite 8 version ends up slightly faster.

Isn't there a way to perform the same double_cast using some GC/Ocaml object
structure magic like the Obj module?  Ideally the write_double should try to
aoid calling write_i64 as well, as calling write_i64 will cause the
allocation of an Int64.t item.

How fast does a write_double have to be?

Janne

> with this change, I still get roughly 2 million ops (only slightly faster).
> This would lead me to believe that time is not spent in doing ALU ops but
> rather time is spent either in the garbage collector or data cache misses.

Have you tryed my (write_double_opt) version in your benchs to compare?
How much was the difference?

and if time is spent more in allocs and GC, this explains too why 
(write_double_opt) is 3 times faster: it doesn't alloc anything.

> Isn't there a way to perform the same double_cast using some GC/Ocaml
> object structure magic like the Obj module?

I haven't found any ocaml function equivalent to the C function memcpy() even 
in the Obj module.

> How fast does a write_double have to be?

Yes I admit happily that this is a geekish issue :)
but we are geeks and love trying to optimise code ;-)

Well more seriously if I had found a speedup of only 5% or 10% I wouldn't have 
bother you. If I decided to write it's because an enhancement of close to 3 
times faster was not that bad.
But with a (String.create) to make it thread safe, the enhancement is only 
about 2.5 times faster.

-- 
With Regards
Florent

On Mon, Jun 02, 2008 at 08:58:14PM +0200, Florent Monnier wrote:
> > > Have you decided to make the extlib with only ocaml code,
> > > without any external C ?
> >
> > That's the policy - to make it easier to port extlib to any platform
> > which supports OCaml.
> 
> Maybe write_double_opt could be joined along with the extlib only as a patch 
> and a readme explaining the difference.

It's maybe possible to have optimized versions of extlib functions
which are only switched in if they can be compiled, with OCaml
versions as back ups.  I'll see what others think about that though.

> > Is it possible to match the speed-ups using pure OCaml code?  eg.  by
> > carefully looking at the generated assembler (ocamlopt -S) and
> > studying why it might be slow?
>
> yes I have read the gas code of the extlib version compared to the
> one of the mixed ocaml/C, and even without this just reading the
> original code it is easy to understand what makes the difference:

This is the code we're discussing:

  CAMLprim value
  double_cast( value str, value d )
  {
          memcpy((char *)str, (double *)d, sizeof(double));
          return Val_unit;
  }

  external double_cast: buf_str:string -> float -> unit = "double_cast"
  let buf_str = "01234567"

  let write_double_opt_native ch d =
          double_cast ~buf_str d;
          nwrite ch buf_str

As was pointed out in another reply, I don't think this is thread
safe.

Anyhow you can get the same effect in pure OCaml with this hairy bit
of Obj magic:

  open Printf

  let hairy_string_of_float (d : float) =
    let r = Obj.repr d in
    Obj.set_tag r Obj.string_tag;
    (Obj.obj r : string)

  let print_bytes s n =
    for i = 0 to n-1 do
      printf "%02x " (Char.code (String.unsafe_get s i))
    done;
    printf "\n"

  let () =
    print_bytes (hairy_string_of_float 1.0) 8;
    print_bytes (hairy_string_of_float 3.0) 8;
    print_bytes (hairy_string_of_float 1e38) 8;

Output:

  00 00 00 00 00 00 f0 3f 
  00 00 00 00 00 00 08 40 
  b1 a1 16 2a d3 ce d2 47 

Note that the string returned from hairy_string_of_float isn't a
well-formed OCaml string, so it's not safe to call anything except
String.unsafe_get on it.  eg. functions such as String.length will
definitely fail.

I haven't tested the performance, but I did look at the assembly code.
On my x86-64 it's unfortunate that the compiler didn't inline the call
to Obj.set_tag (instead it's a C call, even though the C function is a
two-liner).  You can probably replace it with a call to
String.unsafe_set with a negative offset to modify the tag directly,
and with luck the generated code should be faster than your C impl.

Rich.

-- 
Richard Jones
Red Hat

On Sun, Jun 08, 2008 at 10:17:59AM +0100, rich wrote:
>   let hairy_string_of_float (d : float) =
>     let r = Obj.repr d in
>     Obj.set_tag r Obj.string_tag;
>     (Obj.obj r : string)

In fact, thinking about this a bit more clearly, there's no need to
set the tag at all if all you do is call 'String.unsafe_get', so this
works:

  let hairy_string_of_float (d : float) =
    (Obj.magic d : string)

Be really careful that you don't let the returned "string" escape into
the wild though :-) It still looks like a double to the garbage
collector and it's not a well-formed string either, so any function
other than 'String.unsafe_get' is highly likely to fail.

Rich.

-- 
Richard Jones
Red Hat

> In fact, thinking about this a bit more clearly, there's no need to
> set the tag at all if all you do is call 'String.unsafe_get', so this
> works:
>
>   let hairy_string_of_float (d : float) =
>     (Obj.magic d : string)
>
> Be really careful that you don't let the returned "string" escape into
> the wild though :-) It still looks like a double to the garbage
> collector and it's not a well-formed string either, so any function
> other than 'String.unsafe_get' is highly likely to fail.

Hi,
this hairy version is as fast as the one with the external oneliner C,
here are the elapsed times (from Sys.time) for 3_000_000 floats written 
to /dev/null:

  0.51 sec. -- with extern oneliner C
  0.50 sec. -- the hairy one wins, well done!

-- 
Cheers
Florent

On Sun, Jun 08, 2008 at 09:39:39PM +0200, Florent Monnier wrote:
> > In fact, thinking about this a bit more clearly, there's no need to
> > set the tag at all if all you do is call 'String.unsafe_get', so this
> > works:
> >
> >   let hairy_string_of_float (d : float) =
> >     (Obj.magic d : string)
> >
> > Be really careful that you don't let the returned "string" escape into
> > the wild though :-) It still looks like a double to the garbage
> > collector and it's not a well-formed string either, so any function
> > other than 'String.unsafe_get' is highly likely to fail.
> 
> Hi,
> this hairy version is as fast as the one with the external oneliner C,
> here are the elapsed times (from Sys.time) for 3_000_000 floats written 
> to /dev/null:
> 
>   0.51 sec. -- with extern oneliner C
>   0.50 sec. -- the hairy one wins, well done!

That's to be expected.  Obj.magic is just the %identity compiler
primitive, which the compiler optimizes away completely, so the only
overhead is (maybe) the function call and (maybe) the boxing up of the
float argument.

Rich.

-- 
Richard Jones
Red Hat

...
> > Hi,
> > this hairy version is as fast as the one with the external oneliner C,
> > here are the elapsed times (from Sys.time) for 3_000_000 floats written
> > to /dev/null:
> >
> >   0.51 sec. -- with extern oneliner C
> >   0.50 sec. -- the hairy one wins, well done!
>
> That's to be expected.  Obj.magic is just the %identity compiler
> primitive, which the compiler optimizes away completely, so the only
> overhead is (maybe) the function call and (maybe) the boxing up of the
> float argument.

yes Obj.magic / %identity just allow to go through the type inference, but
do not produce any assembly.

the one with the extern C should be much slower because there is a call to 
memcpy (while there's no copy in the hairy one), and moreover there is a C 
function call that cannot be inlined.

why the difference is so thin, I think, is that the hairy one makes 8 char 
outputs, while the C oneliner makes a single string output. I don't see any 
other reason why it's so close, this point seems to balance.

...

So, I've just tryed to hack it to make a single string output, the hack is to 
use an unsafe_output just as the previous hairy one used String.unsafe_get.

The function unsafe_output is defined in Pervasives, but is not exposed in 
the .mli file ; however it is still possible to use it from any file if one 
copy its definition:

external unsafe_output: out_channel -> string -> int -> int -> unit
  = "caml_ml_output"

this one avoids the call to string_length that will crash with the fake string

there's a significant gain:

  0.50 sec. -- rich's hairy one
  0.35 sec. -- with the single unsafe ouput

  1.34 sec. -- (for recall) the one from current extLib

___________________________

to make it included in the extLib.IO I have added the unsafe output in the 
type "output":

type 'a output = {
    mutable out_write : char -> unit;

    mutable out_output : string -> int -> int -> int;
    mutable out_output_unsafe : string -> int -> int -> int;

    mutable out_close : unit -> 'a;
    mutable out_flush : unit -> unit;
}

let output_channel ch = 
    {
        out_write = (fun c -> output_char ch c);

        out_output = (fun s p l -> Pervasives.output ch s p l; l);
        out_output_unsafe = (fun s p l -> unsafe_output ch s p l; l);

        out_close = (fun () -> Pervasives.close_out ch);
        out_flush = (fun () -> Pervasives.flush ch);
    }

(* in other places I've just copy-pasted out_output to out_output_unsafe
 *)

let output_unsafe o s p l =
    o.out_output_unsafe s p l

let write_double_once ch (d : float) =
    ignore(output_unsafe ch (Obj.magic d : string) 0 8)

maybe there could be a more elegant way...
-- 
With Regards
Florent

I may have missed it, but did you have an actual patch which
incorporates the latest/fastest version which we could review?

Rich.

-- 
Richard Jones
Red Hat

> I may have missed it, but did you have an actual patch which
> incorporates the latest/fastest version which we could review?

Here is a patch (3.8K) or a tarball (69K):

http://www.linux-nantes.org/~fmonnier/temp/extlib-1.5.1_write_double_once.patch
http://www.linux-nantes.org/~fmonnier/temp/extlib-1.5.1_write_double_once.tar.gz

both are equivalent and include a simple 'test.ml' which shows the differences

(the problem as explained in the previous email is that I had to add a new 
field to the type 'a output for the unsafe output)

--
Florent
-- 

Hi,
although this code works, and can be tested to compare perfs, it can not be 
included as is due to the unsafe output that does not exist in the IO system.
So could someone enlighten me about how this could be integrated elegantly ?

-- 
Florent Monnier a écrit :
> > I may have missed it, but did you have an actual patch which
> > incorporates the latest/fastest version which we could review?
>
> Here is a patch (3.8K) or a tarball (69K):
>
> http://www.linux-nantes.org/~fmonnier/temp/extlib-1.5.1_write_double_once.p
>atch
> http://www.linux-nantes.org/~fmonnier/temp/extlib-1.5.1_write_double_once.t
>ar.gz
>
> both are equivalent and include a simple 'test.ml' which shows the
> differences
>
> (the problem as explained in the previous email is that I had to add a new
> field to the type 'a output for the unsafe output)
>
> --
> Florent

Thread: [Ocaml-lib-devel] write_double

ocaml-lib-devel