Re: [Ocaml-lib-devel] write_double

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> If IO module is used to write out to a file, it would sound like the
> overhead of writing to output would far outweigh benefits of tighter
> assembly code for writing out doubles.  Wouldn't calling write_byte eight
> times be much more expensive than the few shift instructions?
>

Looks like the situation is more complicated than what I said above.

I wrote this simple benchmark:

let () =
  let io_out = IO.output_string () in

  for i = 0 to 65535 do
    for j = 0 to 127 do (* j upper bound needs to be 16 when writing out 8
bytes per double! *)
      IO.write_double io_out (float_of_int i)
    done;
  done

which just writes floats to a string.  I verified via .s files that
IO.write_double actually gets called and doesn't get inlined.

If I implement write_i64 (which write_double uses) as a dummy function that
does nothing:

let write_i64 ch i =
  ()

the program can write out 14.5 million write_doubles / sec.  On my 1.8GHz
laptop this means roughly 125 cycles per write_double.  Although not exactly
blazingly fast for almost a non-op, it doesn't feel completely off given
that the program does a lot of conversions from ints to floats and allocates
memory per each alloc (see assembly right below):

.L103:
        call    caml_alloc2
.L107:
        leal    4(%eax), %ebx
        movl    $2301, -4(%ebx)
        movl    4(%esp), %eax
        sarl    $1, %eax
        pushl   %eax
        fildl   (%esp)
        addl    $4, %esp
        fstpl   (%ebx)
        movl    8(%esp), %eax
        call    camlIO__write_double_327
.L108:
        movl    12(%esp), %eax
        movl    %eax, %ebx
        addl    $2, %eax
        movl    %eax, 12(%esp)
        cmpl    $255, %ebx
        jne     .L103

Now if I go and modify the write_i64 to be only slightly more complex:

let write_i64 ch i =
  let ilo = Int64.to_int32 i in
  ()

the write_double rate drops to 9.3 million ops / sec.

Slightly complicating it again like so:

let write_i64 ch i =
  let ilo = Int64.to_int32 i in
  let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in
  ()

the performance almost halves, now at 5.3 million ops / sec.  I'd tend to
believe that this performance drop is due to more allocation, as both ilo
and ihi variables would be of type Int32.t and hence boxed 32-bit ints.

If I implement the write_i64 function with something that actually does
something:

let write_i64 ch i =
  let ilo = Int64.to_int32 i in
  let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in
  let s = String.create 8 in
  let ilo_nat = Int32.to_int ilo in
  s.[0] <- Char.unsafe_chr ilo_nat;
  s.[1] <- Char.unsafe_chr (ilo_nat lsr 8);
  s.[2] <- Char.unsafe_chr (ilo_nat lsr 16);
  s.[3] <- Char.unsafe_chr (Int32.to_int (Int32.shift_right_logical ilo
24));
  let ihi_nat = Int32.to_int ihi in
  s.[4] <- Char.unsafe_chr ihi_nat;
  s.[5] <- Char.unsafe_chr (ihi_nat lsr 8);
  s.[6] <- Char.unsafe_chr (ihi_nat lsr 16);
  s.[7] <- Char.unsafe_chr (Int32.to_int (Int32.shift_right_logical ihi
24));
  nwrite ch s

I now get only about 2 million ops / sec.  Commenting out the byte
extraction code (i.e., write out garbage):

let write_i64 ch i =
  let ilo = Int64.to_int32 i in
  let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in
  let s = String.create 8 in
  nwrite ch s

with this change, I still get roughly 2 million ops (only slightly faster).
This would lead me to believe that time is not spent in doing ALU ops but
rather time is spent either in the garbage collector or data cache misses.

Feeling optimistic that I had made an optimization over the original
write_i64 by calling I/O writes less often (i.e., once as opposed to 8
times), I benchmarked the original write_i64 version.  Well, when writing to
a string output, the 8x write_byte is faster.  When writing to a real file,
the nwrite 8 version ends up slightly faster.

Isn't there a way to perform the same double_cast using some GC/Ocaml object
structure magic like the Obj module?  Ideally the write_double should try to
aoid calling write_i64 as well, as calling write_i64 will cause the
allocation of an Int64.t item.

How fast does a write_double have to be?

Janne