> Is it possible to match the speed-ups using pure OCaml code?  eg.  by
> carefully looking at the generated assembler (ocamlopt -S) and
> studying why it might be slow?

yes I have read the gas code of the extlib version compared to the one of the
mixed ocaml/C, and even without this just reading the original code it is
easy to understand what makes the difference:

If IO module is used to write out to a file, it would sound like the overhead of writing to output would far outweigh benefits of tighter assembly code for writing out doubles.  Wouldn't calling write_byte eight times be much more expensive than the few shift instructions?

It looks like your C/Ocaml implementation with the ocaml-side string is not thread safe?  Perhaps this doesn't happen with the current OCaml run-time, but it looks like if two threads would enter double_cast at the same time, you'd corrupt buf_str?

let buf_str = "01234567"
external double_cast: buf_str:string -> float -> unit = "double_cast"