From: Jeff E. <je...@un...> - 2011-07-27 13:00:05
|
Having thought about this some more, I think your suggestion to have rtapi_clocks_to_ns (and possibly rtapi_ns_to_clocks) makes sense. Encouraging use of delta times mitigates any rollovers that may be inherent in the ns<->clock conversions. Computing nanosecond time from tsc suffers a discontinuity at least when rdtsc() wraps, but now I think that the rtai implementation may have a discontinuity much more frequently--every time (int)rdtsc() wraps. The comment on llimd says that it /* Returns (long long)ll = (int)ll*(int)(mult)/(int)div. */ so the discontinuity actually happens when the TSC crosses a 2^31 (2^32?) boundary, not only when the 64-bit quantity wraps around back to 0. Better would be a routine that takes u64 a, u32 b, and u8 s and calculates the lower 64 bits of the arbitrary-precision (a * b) >> s gcc is able to generate efficient code for this on x86 (two integer multiplies, about 21 cycles per invocation in a tight loop on a core2 CPU). This algorithm should have a discontinuity only at full TSC rollover, not at 32-bit rollovers. It's also faster by a factor of 10 or so than the rtai implementation of today. The code: //---------------------------------------------------------------------- static inline uint64_t mul_32x32_64(uint32_t a, uint32_t b) __attribute__((always_inline)); static inline uint64_t mul_32x32_64(uint32_t a, uint32_t b) { /* gcc is able to do this with a single 32x32 -> 64 multiply on x86 */ return ((uint64_t)a) * b; } /** * Compute the lower 64 bits of '(a * b) >> s', s<=32 * the temporary (a*b) is 96 bits, not truncated to 64 bits */ static inline uint64_t ullms(uint64_t a, uint32_t b, uint8_t s) { uint32_t hi = (a>>32), lo = a & UINT32_C(0xffffffff); uint64_t mul_hi = mul_32x32_64(hi, b), mul_lo = mul_32x32_64(lo, b); return (mul_hi << (32-s)) + (mul_lo >> s); } /** * b = get_scale_factor(num, denom, &s): * Compute 'b' and 's' so that ullms32(a,b,s) is approximately (a * num / denom) * * When using the same num and denom repeatedly, this is much more * efficient than the implementation that actually performs the * division. (In 2011 on x86, a single integer division is still about * 10x the time of a single multiplication) * * However, get_scale_factor itself is not particularly efficient (this * implementation uses fp arithmetic), so it should only be used to * compute b and s for "constant" num / denom pairs */ uint32_t get_scale_factor(uint32_t num, uint32_t denom, uint8_t *scale) { double d = (double) num / denom; uint8_t s = 0; while(d < 2147483647) { d *= 2; s++; } *scale = s; return (uint32_t)(round(d)); } //---------------------------------------------------------------------- Then the rtapi code would look like this: //---------------------------------------------------------------------- // globals to rtapi.ko uint32_t tsc2ns_factor, ns2tsc_factor; uint8_t tsc2ns_shift, ns2tsc_shift; // somewhere in setup code { tsc2ns_factor = get_scale_factor(1000000, cpu_khz, &tsc2ns_shift); ns2tsc_factor = get_scale_factor(cpu_khz, 1000000, &ns2tsc_shift); // } uint64_t rtapi_clocks_to_ns(uint64_t clocks) { return ullms(clocks, tsc2ns_factor, tsc2ns_shift); } uint64_t rtapi_ns_to_clocks(uint64_t ns) { return ullms(ns, ns2tsc_factor, ns2tsc_shift); } //---------------------------------------------------------------------- Jeff |