If IO module is used to write out to a file, it would sound like the overhead of writing to output would far outweigh benefits of tighter assembly code for writing out doubles.  Wouldn't calling write_byte eight times be much more expensive than the few shift instructions?

Looks like the situation is more complicated than what I said above.

I wrote this simple benchmark:

let () =
  let io_out = IO.output_string () in

  for i = 0 to 65535 do
    for j = 0 to 127 do (* j upper bound needs to be 16 when writing out 8 bytes per double! *)
      IO.write_double io_out (float_of_int i)

which just writes floats to a string.  I verified via .s files that IO.write_double actually gets called and doesn't get inlined.

If I implement write_i64 (which write_double uses) as a dummy function that does nothing:

let write_i64 ch i =

the program can write out 14.5 million write_doubles / sec.  On my 1.8GHz laptop this means roughly 125 cycles per write_double.  Although not exactly blazingly fast for almost a non-op, it doesn't feel completely off given that the program does a lot of conversions from ints to floats and allocates memory per each alloc (see assembly right below):

        call    caml_alloc2
        leal    4(%eax), %ebx
        movl    $2301, -4(%ebx)
        movl    4(%esp), %eax
        sarl    $1, %eax
        pushl   %eax
        fildl   (%esp)
        addl    $4, %esp
        fstpl   (%ebx)
        movl    8(%esp), %eax
        call    camlIO__write_double_327
        movl    12(%esp), %eax
        movl    %eax, %ebx
        addl    $2, %eax
        movl    %eax, 12(%esp)
        cmpl    $255, %ebx
        jne     .L103

Now if I go and modify the write_i64 to be only slightly more complex:

let write_i64 ch i =
  let ilo = Int64.to_int32 i in

the write_double rate drops to 9.3 million ops / sec.

Slightly complicating it again like so:

let write_i64 ch i =
  let ilo = Int64.to_int32 i in
  let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in

the performance almost halves, now at 5.3 million ops / sec.  I'd tend to believe that this performance drop is due to more allocation, as both ilo and ihi variables would be of type Int32.t and hence boxed 32-bit ints.

If I implement the write_i64 function with something that actually does something:

let write_i64 ch i =
  let ilo = Int64.to_int32 i in
  let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in
  let s = String.create 8 in
  let ilo_nat = Int32.to_int ilo in
  s.[0] <- Char.unsafe_chr ilo_nat;
  s.[1] <- Char.unsafe_chr (ilo_nat lsr 8);
  s.[2] <- Char.unsafe_chr (ilo_nat lsr 16);
  s.[3] <- Char.unsafe_chr (Int32.to_int (Int32.shift_right_logical ilo 24));
  let ihi_nat = Int32.to_int ihi in
  s.[4] <- Char.unsafe_chr ihi_nat;
  s.[5] <- Char.unsafe_chr (ihi_nat lsr 8);
  s.[6] <- Char.unsafe_chr (ihi_nat lsr 16);
  s.[7] <- Char.unsafe_chr (Int32.to_int (Int32.shift_right_logical ihi 24));
  nwrite ch s

I now get only about 2 million ops / sec.  Commenting out the byte extraction code (i.e., write out garbage):

let write_i64 ch i =
  let ilo = Int64.to_int32 i in
  let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in
  let s = String.create 8 in
  nwrite ch s

with this change, I still get roughly 2 million ops (only slightly faster).  This would lead me to believe that time is not spent in doing ALU ops but rather time is spent either in the garbage collector or data cache misses.

Feeling optimistic that I had made an optimization over the original write_i64 by calling I/O writes less often (i.e., once as opposed to 8 times), I benchmarked the original write_i64 version.  Well, when writing to a string output, the 8x write_byte is faster.  When writing to a real file, the nwrite 8 version ends up slightly faster.

Isn't there a way to perform the same double_cast using some GC/Ocaml object structure magic like the Obj module?  Ideally the write_double should try to aoid calling write_i64 as well, as calling write_i64 will cause the allocation of an Int64.t item. 

How fast does a write_double have to be?