From: Florent M. <fmo...@li...> - 2008-05-29 16:19:01
|
Hi, It would be possible to make write_double more than twice faster (close to 3 times faster). The tradeoff is that there is additional C code. Cheers |
From: Florent M. <fmo...@li...> - 2008-06-01 11:07:02
|
Florent Monnier wrote : > Hi, > It would be possible to make write_double more than twice faster > (close to 3 times faster). > The tradeoff is that there is additional C code. > > Cheers Have you decided to make the extlib with only ocaml code, without any external C ? Anyway if you wish to test the write_double_opt, grab it here: http://www.linux-nantes.org/~fmonnier/TMP/extlib-1.5.1_write_double_opt.tar.bz2 and to test, just open the tarball and type: sh test_write_double.sh Here are the elapsed times that I get: 0.44 sec. -- the implementation of the current extLib 0.16 sec. -- proposal for inclusion in the extLib distribution 0.14 sec. -- with a higher-order function -- Cheers Florent |
From: Amit D. <ad...@du...> - 2008-06-01 12:57:15
|
Hiya, On Sun, Jun 1, 2008 at 12:15 PM, Florent Monnier <fmo...@li...> wrote: > Have you decided to make the extlib with only ocaml code, > without any external C ? Yes, I think that was the original idea. Best, -Amit |
From: Richard J. <ri...@an...> - 2008-06-02 17:33:15
|
On Sun, Jun 01, 2008 at 01:15:28PM +0200, Florent Monnier wrote: > Florent Monnier wrote : > > Hi, > > It would be possible to make write_double more than twice faster > > (close to 3 times faster). > > The tradeoff is that there is additional C code. > > > > Cheers > > Have you decided to make the extlib with only ocaml code, > without any external C ? That's the policy - to make it easier to port extlib to any platform which supports OCaml. Is it possible to match the speed-ups using pure OCaml code? eg. by carefully looking at the generated assembler (ocamlopt -S) and studying why it might be slow? Rich. -- Richard Jones Red Hat |
From: Florent M. <fmo...@li...> - 2008-06-02 18:49:16
|
> > Have you decided to make the extlib with only ocaml code, > > without any external C ? > > That's the policy - to make it easier to port extlib to any platform > which supports OCaml. Maybe write_double_opt could be joined along with the extlib only as a patch and a readme explaining the difference. > Is it possible to match the speed-ups using pure OCaml code? eg. by > carefully looking at the generated assembler (ocamlopt -S) and > studying why it might be slow? yes I have read the gas code of the extlib version compared to the one of the mixed ocaml/C, and even without this just reading the original code it is easy to understand what makes the difference: * the mixed ocaml/C copy the double to an ocaml string buffer (allways the same so no alloc at each call) with the C function memcpy, then the string buffer is writen once. This is minimal, no extra operations: let buf_str = "01234567" external double_cast: buf_str:string -> float -> unit = "double_cast" let write_double_opt_native ch d = double_cast ~buf_str d; nwrite ch buf_str CAMLprim value double_cast( value str, value d ) { memcpy((char *)str, (double *)d, sizeof(double)); return Val_unit; } // as you see there is only one line of C code // we have "Int64.bits_of_float" but not any "String.buf_of_float" :,( // I've tryed with the Marshall module, the binary part is at the end // but sometimes for some particular floats there are not the good number // of bytes in total. ; the assembly code of the C part: double_cast: pushl %ebp movl %esp, %ebp movl 12(%ebp), %edx movl 8(%ebp), %ecx movl (%edx), %eax movl %eax, (%ecx) movl 4(%edx), %eax movl %eax, 4(%ecx) movl $1, %eax popl %ebp ret ; and the gas code of the ocaml part: camlIO__write_double_opt_native_422: subl $4, %esp .L557: movl %eax, 0(%esp) pushl %ebx pushl camlIO + 296 movl $double_cast, %eax call caml_c_call .L558: addl $8, %esp movl camlIO + 296, %ebx movl 0(%esp), %eax addl $4, %esp jmp camlIO__nwrite_140 * the original write_double makes a lot of shifts and convertions, here is the list of the extraneous operations done: - 1 x (Int64.bits_of_float) - 3 x (Int64.to_int32) - 1 x (Int64.shift_right_logical) - 4 x (Int32.to_int) - 2 x (Int32.shift_right_logical) - 4 x (lsr) - 8 x (write_byte) (and there is no surprise in the related gas code which only lists all these operations, if interested I've put it at the end of this email, because it's quite long) * even without additional C, it would be possible to make the implementation a bit more concise, but it doesn't enhance the speed very much (just a very little even with 1 million calls): let write_double_ext_native ch f = let bin = Int64.bits_of_float f in let b7 = Int64.to_int(bin) in let b6 = b7 lsr 8 and b5 = b7 lsr 16 and b4 = Int64.to_int(Int64.shift_right_logical bin 24) in let b3 = b4 lsr 8 and b2 = b4 lsr 16 and b1 = Int64.to_int(Int64.shift_right_logical bin 48) in let b0 = b1 lsr 8 in write_byte ch b7; write_byte ch b6; write_byte ch b5; write_byte ch b4; write_byte ch b3; write_byte ch b2; write_byte ch b1; write_byte ch b0 this version is included too in the test tarball that I have provided in my previous email, and you can easily compare it with all the other implementations adding this line in the test script 'test_write_double.sh': time ./test.opt /dev/null -ext -- With Regards Florent -- ; gas code of the current write_double: camlIO__write_double_391: subl $4, %esp .L527: movl %eax, 0(%esp) pushl %ebx movl $caml_int64_bits_of_float, %eax call caml_c_call .L528: addl $4, %esp movl %eax, %ebx movl 0(%esp), %eax addl $4, %esp jmp camlIO__write_i64_385 ; .... camlIO__write_i64_385: subl $8, %esp .L520: movl %eax, 4(%esp) movl %ebx, 0(%esp) pushl %ebx movl $caml_int64_to_int32, %eax call caml_c_call .L521: addl $4, %esp movl %eax, %ebx movl 4(%esp), %eax call camlIO__write_real_i32_380 .L522: pushl $65 movl 4(%esp), %eax pushl %eax movl $caml_int64_shift_right_unsigned, %eax call caml_c_call .L523: addl $8, %esp pushl %eax movl $caml_int64_to_int32, %eax call caml_c_call .L524: addl $4, %esp movl %eax, %ebx movl 4(%esp), %eax addl $8, %esp jmp camlIO__write_real_i32_380 ; ... camlIO__write_real_i32_380: subl $12, %esp .L516: movl %eax, %ecx movl %ecx, 8(%esp) movl 4(%ebx), %eax sall $1, %eax orl $1, %eax movl %eax, 0(%esp) movl 4(%ebx), %ebx shrl $24, %ebx sall $1, %ebx orl $1, %ebx movl %ebx, 4(%esp) andl $511, %eax movl (%ecx), %ebx movl (%ebx), %ecx call *%ecx .L517: movl 0(%esp), %eax shrl $8, %eax orl $1, %eax andl $511, %eax movl 8(%esp), %ebx movl (%ebx), %ebx movl (%ebx), %ecx call *%ecx .L518: movl 0(%esp), %eax shrl $16, %eax orl $1, %eax andl $511, %eax movl 8(%esp), %ebx movl (%ebx), %ebx movl (%ebx), %ecx call *%ecx .L519: movl 4(%esp), %eax andl $511, %eax movl 8(%esp), %ebx movl (%ebx), %ebx movl (%ebx), %ecx addl $12, %esp jmp *%ecx |
From: Janne H. <jjh...@gm...> - 2008-06-02 22:05:38
|
> > Is it possible to match the speed-ups using pure OCaml code? eg. by > > carefully looking at the generated assembler (ocamlopt -S) and > > studying why it might be slow? > > yes I have read the gas code of the extlib version compared to the one of > the > mixed ocaml/C, and even without this just reading the original code it is > easy to understand what makes the difference: If IO module is used to write out to a file, it would sound like the overhead of writing to output would far outweigh benefits of tighter assembly code for writing out doubles. Wouldn't calling write_byte eight times be much more expensive than the few shift instructions? It looks like your C/Ocaml implementation with the ocaml-side string is not thread safe? Perhaps this doesn't happen with the current OCaml run-time, but it looks like if two threads would enter double_cast at the same time, you'd corrupt buf_str? let buf_str = "01234567" external double_cast: buf_str:string -> float -> unit = "double_cast" Janne |
From: Janne H. <jjh...@gm...> - 2008-06-02 23:37:45
|
> If IO module is used to write out to a file, it would sound like the > overhead of writing to output would far outweigh benefits of tighter > assembly code for writing out doubles. Wouldn't calling write_byte eight > times be much more expensive than the few shift instructions? > Looks like the situation is more complicated than what I said above. I wrote this simple benchmark: let () = let io_out = IO.output_string () in for i = 0 to 65535 do for j = 0 to 127 do (* j upper bound needs to be 16 when writing out 8 bytes per double! *) IO.write_double io_out (float_of_int i) done; done which just writes floats to a string. I verified via .s files that IO.write_double actually gets called and doesn't get inlined. If I implement write_i64 (which write_double uses) as a dummy function that does nothing: let write_i64 ch i = () the program can write out 14.5 million write_doubles / sec. On my 1.8GHz laptop this means roughly 125 cycles per write_double. Although not exactly blazingly fast for almost a non-op, it doesn't feel completely off given that the program does a lot of conversions from ints to floats and allocates memory per each alloc (see assembly right below): .L103: call caml_alloc2 .L107: leal 4(%eax), %ebx movl $2301, -4(%ebx) movl 4(%esp), %eax sarl $1, %eax pushl %eax fildl (%esp) addl $4, %esp fstpl (%ebx) movl 8(%esp), %eax call camlIO__write_double_327 .L108: movl 12(%esp), %eax movl %eax, %ebx addl $2, %eax movl %eax, 12(%esp) cmpl $255, %ebx jne .L103 Now if I go and modify the write_i64 to be only slightly more complex: let write_i64 ch i = let ilo = Int64.to_int32 i in () the write_double rate drops to 9.3 million ops / sec. Slightly complicating it again like so: let write_i64 ch i = let ilo = Int64.to_int32 i in let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in () the performance almost halves, now at 5.3 million ops / sec. I'd tend to believe that this performance drop is due to more allocation, as both ilo and ihi variables would be of type Int32.t and hence boxed 32-bit ints. If I implement the write_i64 function with something that actually does something: let write_i64 ch i = let ilo = Int64.to_int32 i in let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in let s = String.create 8 in let ilo_nat = Int32.to_int ilo in s.[0] <- Char.unsafe_chr ilo_nat; s.[1] <- Char.unsafe_chr (ilo_nat lsr 8); s.[2] <- Char.unsafe_chr (ilo_nat lsr 16); s.[3] <- Char.unsafe_chr (Int32.to_int (Int32.shift_right_logical ilo 24)); let ihi_nat = Int32.to_int ihi in s.[4] <- Char.unsafe_chr ihi_nat; s.[5] <- Char.unsafe_chr (ihi_nat lsr 8); s.[6] <- Char.unsafe_chr (ihi_nat lsr 16); s.[7] <- Char.unsafe_chr (Int32.to_int (Int32.shift_right_logical ihi 24)); nwrite ch s I now get only about 2 million ops / sec. Commenting out the byte extraction code (i.e., write out garbage): let write_i64 ch i = let ilo = Int64.to_int32 i in let ihi = Int64.to_int32 (Int64.shift_right_logical i 32) in let s = String.create 8 in nwrite ch s with this change, I still get roughly 2 million ops (only slightly faster). This would lead me to believe that time is not spent in doing ALU ops but rather time is spent either in the garbage collector or data cache misses. Feeling optimistic that I had made an optimization over the original write_i64 by calling I/O writes less often (i.e., once as opposed to 8 times), I benchmarked the original write_i64 version. Well, when writing to a string output, the 8x write_byte is faster. When writing to a real file, the nwrite 8 version ends up slightly faster. Isn't there a way to perform the same double_cast using some GC/Ocaml object structure magic like the Obj module? Ideally the write_double should try to aoid calling write_i64 as well, as calling write_i64 will cause the allocation of an Int64.t item. How fast does a write_double have to be? Janne |
From: Florent M. <fmo...@li...> - 2008-06-03 18:29:38
|
> with this change, I still get roughly 2 million ops (only slightly faster). > This would lead me to believe that time is not spent in doing ALU ops but > rather time is spent either in the garbage collector or data cache misses. Have you tryed my (write_double_opt) version in your benchs to compare? How much was the difference? and if time is spent more in allocs and GC, this explains too why (write_double_opt) is 3 times faster: it doesn't alloc anything. > Isn't there a way to perform the same double_cast using some GC/Ocaml > object structure magic like the Obj module? I haven't found any ocaml function equivalent to the C function memcpy() even in the Obj module. > How fast does a write_double have to be? Yes I admit happily that this is a geekish issue :) but we are geeks and love trying to optimise code ;-) Well more seriously if I had found a speedup of only 5% or 10% I wouldn't have bother you. If I decided to write it's because an enhancement of close to 3 times faster was not that bad. But with a (String.create) to make it thread safe, the enhancement is only about 2.5 times faster. -- With Regards Florent |
From: Richard J. <ri...@an...> - 2008-06-08 09:18:11
|
On Mon, Jun 02, 2008 at 08:58:14PM +0200, Florent Monnier wrote: > > > Have you decided to make the extlib with only ocaml code, > > > without any external C ? > > > > That's the policy - to make it easier to port extlib to any platform > > which supports OCaml. > > Maybe write_double_opt could be joined along with the extlib only as a patch > and a readme explaining the difference. It's maybe possible to have optimized versions of extlib functions which are only switched in if they can be compiled, with OCaml versions as back ups. I'll see what others think about that though. > > Is it possible to match the speed-ups using pure OCaml code? eg. by > > carefully looking at the generated assembler (ocamlopt -S) and > > studying why it might be slow? > > yes I have read the gas code of the extlib version compared to the > one of the mixed ocaml/C, and even without this just reading the > original code it is easy to understand what makes the difference: This is the code we're discussing: CAMLprim value double_cast( value str, value d ) { memcpy((char *)str, (double *)d, sizeof(double)); return Val_unit; } external double_cast: buf_str:string -> float -> unit = "double_cast" let buf_str = "01234567" let write_double_opt_native ch d = double_cast ~buf_str d; nwrite ch buf_str As was pointed out in another reply, I don't think this is thread safe. Anyhow you can get the same effect in pure OCaml with this hairy bit of Obj magic: open Printf let hairy_string_of_float (d : float) = let r = Obj.repr d in Obj.set_tag r Obj.string_tag; (Obj.obj r : string) let print_bytes s n = for i = 0 to n-1 do printf "%02x " (Char.code (String.unsafe_get s i)) done; printf "\n" let () = print_bytes (hairy_string_of_float 1.0) 8; print_bytes (hairy_string_of_float 3.0) 8; print_bytes (hairy_string_of_float 1e38) 8; Output: 00 00 00 00 00 00 f0 3f 00 00 00 00 00 00 08 40 b1 a1 16 2a d3 ce d2 47 Note that the string returned from hairy_string_of_float isn't a well-formed OCaml string, so it's not safe to call anything except String.unsafe_get on it. eg. functions such as String.length will definitely fail. I haven't tested the performance, but I did look at the assembly code. On my x86-64 it's unfortunate that the compiler didn't inline the call to Obj.set_tag (instead it's a C call, even though the C function is a two-liner). You can probably replace it with a call to String.unsafe_set with a negative offset to modify the tag directly, and with luck the generated code should be faster than your C impl. Rich. -- Richard Jones Red Hat |
From: Richard J. <ri...@an...> - 2008-06-08 09:32:25
|
On Sun, Jun 08, 2008 at 10:17:59AM +0100, rich wrote: > let hairy_string_of_float (d : float) = > let r = Obj.repr d in > Obj.set_tag r Obj.string_tag; > (Obj.obj r : string) In fact, thinking about this a bit more clearly, there's no need to set the tag at all if all you do is call 'String.unsafe_get', so this works: let hairy_string_of_float (d : float) = (Obj.magic d : string) Be really careful that you don't let the returned "string" escape into the wild though :-) It still looks like a double to the garbage collector and it's not a well-formed string either, so any function other than 'String.unsafe_get' is highly likely to fail. Rich. -- Richard Jones Red Hat |
From: Florent M. <fmo...@li...> - 2008-06-08 19:30:09
|
> In fact, thinking about this a bit more clearly, there's no need to > set the tag at all if all you do is call 'String.unsafe_get', so this > works: > > let hairy_string_of_float (d : float) = > (Obj.magic d : string) > > Be really careful that you don't let the returned "string" escape into > the wild though :-) It still looks like a double to the garbage > collector and it's not a well-formed string either, so any function > other than 'String.unsafe_get' is highly likely to fail. Hi, this hairy version is as fast as the one with the external oneliner C, here are the elapsed times (from Sys.time) for 3_000_000 floats written to /dev/null: 0.51 sec. -- with extern oneliner C 0.50 sec. -- the hairy one wins, well done! -- Cheers Florent |
From: Richard J. <ri...@an...> - 2008-06-10 08:48:54
|
On Sun, Jun 08, 2008 at 09:39:39PM +0200, Florent Monnier wrote: > > In fact, thinking about this a bit more clearly, there's no need to > > set the tag at all if all you do is call 'String.unsafe_get', so this > > works: > > > > let hairy_string_of_float (d : float) = > > (Obj.magic d : string) > > > > Be really careful that you don't let the returned "string" escape into > > the wild though :-) It still looks like a double to the garbage > > collector and it's not a well-formed string either, so any function > > other than 'String.unsafe_get' is highly likely to fail. > > Hi, > this hairy version is as fast as the one with the external oneliner C, > here are the elapsed times (from Sys.time) for 3_000_000 floats written > to /dev/null: > > 0.51 sec. -- with extern oneliner C > 0.50 sec. -- the hairy one wins, well done! That's to be expected. Obj.magic is just the %identity compiler primitive, which the compiler optimizes away completely, so the only overhead is (maybe) the function call and (maybe) the boxing up of the float argument. Rich. -- Richard Jones Red Hat |
From: Florent M. <fmo...@li...> - 2008-06-10 16:03:32
|
... > > Hi, > > this hairy version is as fast as the one with the external oneliner C, > > here are the elapsed times (from Sys.time) for 3_000_000 floats written > > to /dev/null: > > > > 0.51 sec. -- with extern oneliner C > > 0.50 sec. -- the hairy one wins, well done! > > That's to be expected. Obj.magic is just the %identity compiler > primitive, which the compiler optimizes away completely, so the only > overhead is (maybe) the function call and (maybe) the boxing up of the > float argument. yes Obj.magic / %identity just allow to go through the type inference, but do not produce any assembly. the one with the extern C should be much slower because there is a call to memcpy (while there's no copy in the hairy one), and moreover there is a C function call that cannot be inlined. why the difference is so thin, I think, is that the hairy one makes 8 char outputs, while the C oneliner makes a single string output. I don't see any other reason why it's so close, this point seems to balance. ... So, I've just tryed to hack it to make a single string output, the hack is to use an unsafe_output just as the previous hairy one used String.unsafe_get. The function unsafe_output is defined in Pervasives, but is not exposed in the .mli file ; however it is still possible to use it from any file if one copy its definition: external unsafe_output: out_channel -> string -> int -> int -> unit = "caml_ml_output" this one avoids the call to string_length that will crash with the fake string there's a significant gain: 0.50 sec. -- rich's hairy one 0.35 sec. -- with the single unsafe ouput 1.34 sec. -- (for recall) the one from current extLib ___________________________ to make it included in the extLib.IO I have added the unsafe output in the type "output": type 'a output = { mutable out_write : char -> unit; mutable out_output : string -> int -> int -> int; mutable out_output_unsafe : string -> int -> int -> int; mutable out_close : unit -> 'a; mutable out_flush : unit -> unit; } let output_channel ch = { out_write = (fun c -> output_char ch c); out_output = (fun s p l -> Pervasives.output ch s p l; l); out_output_unsafe = (fun s p l -> unsafe_output ch s p l; l); out_close = (fun () -> Pervasives.close_out ch); out_flush = (fun () -> Pervasives.flush ch); } (* in other places I've just copy-pasted out_output to out_output_unsafe *) let output_unsafe o s p l = o.out_output_unsafe s p l let write_double_once ch (d : float) = ignore(output_unsafe ch (Obj.magic d : string) 0 8) maybe there could be a more elegant way... -- With Regards Florent |
From: Richard J. <ri...@an...> - 2008-06-16 00:44:25
|
I may have missed it, but did you have an actual patch which incorporates the latest/fastest version which we could review? Rich. -- Richard Jones Red Hat |
From: Florent M. <fmo...@li...> - 2008-06-19 16:02:27
|
> I may have missed it, but did you have an actual patch which > incorporates the latest/fastest version which we could review? Here is a patch (3.8K) or a tarball (69K): http://www.linux-nantes.org/~fmonnier/temp/extlib-1.5.1_write_double_once.patch http://www.linux-nantes.org/~fmonnier/temp/extlib-1.5.1_write_double_once.tar.gz both are equivalent and include a simple 'test.ml' which shows the differences (the problem as explained in the previous email is that I had to add a new field to the type 'a output for the unsafe output) -- Florent -- |
From: Florent M. <fmo...@li...> - 2008-08-30 14:43:52
|
Hi, although this code works, and can be tested to compare perfs, it can not be included as is due to the unsafe output that does not exist in the IO system. So could someone enlighten me about how this could be integrated elegantly ? -- Florent Monnier a écrit : > > I may have missed it, but did you have an actual patch which > > incorporates the latest/fastest version which we could review? > > Here is a patch (3.8K) or a tarball (69K): > > http://www.linux-nantes.org/~fmonnier/temp/extlib-1.5.1_write_double_once.p >atch > http://www.linux-nantes.org/~fmonnier/temp/extlib-1.5.1_write_double_once.t >ar.gz > > both are equivalent and include a simple 'test.ml' which shows the > differences > > (the problem as explained in the previous email is that I had to add a new > field to the type 'a output for the unsafe output) > > -- > Florent |