From: Charles R H. <cha...@gm...> - 2006-09-19 21:21:16
|
On 9/19/06, Tim Hochberg <tim...@ie...> wrote: > > Charles R Harris wrote: > > > > > > On 9/19/06, *Tim Hochberg* <tim...@ie... > > <mailto:tim...@ie...>> wrote: > > > > A. M. Archibald wrote: > > > On 19/09/06, Tim Hochberg <tim...@ie... > > <mailto:tim...@ie...>> wrote: > > > > > >> Keith Goodman wrote: > > >> > > >>> In what order would you like argsort to sort the values -inf, > > nan, inf? > > >>> > > >>> > > >> Ideally, -inf should sort first, inf should sort last and nan > > should > > >> raise an exception if present. > > >> > > >> -tim > > >> > > > > > > Mmm. Somebody who's working with NaNs has more or less already > > decided > > > they don't want to be pestered with exceptions for invalid data. > > Do you really think so? In my experience NaNs are nearly always > > just an > > indication of a mistake somewhere that didn't get trapped for one > > reason > > or another. > > > > > I'd > > > be happy if they wound up at either end, but I'm not sure it's > worth > > > hacking up the sort algorithm when a simple isnan() can pull > > them out. > > > > > Moving them to the end seems to be the worst choice to me. Leaving > > them > > alone is fine with me. Or raising an exception would be fine. Or > doing > > one or the other depending on the error mode settings would be even > > better if it is practical. > > > > > What's happening now is that NaN<a, NaN==a, and NaN>a are all > > false, > > > which rather confuses the sorting algorithm. But as long as it > > doesn't > > > actually *break* (that is, leave some of the non-NaNs incorrectly > > > sorted) I don't care. > > > > > Is that true? Are all of numpy's sorting algorithms robust against > > nontransitive objects laying around? The answer to that appears to > be > > no. Try running this a couple of times to see what I mean: > > > > n = 10 > > for i in range(10): > > for kind in 'quicksort', 'heapsort', 'mergesort': > > a = rand(n) > > b = a.copy() > > a[n//2] = nan > > a.sort(kind=kind) > > b.sort(kind=kind) > > assert alltrue(a[:n//2] == b[:n//2]), (kind, a, b) > > > > > > The values don't correctly cross the inserted NaN and the sort is > > incorrect. > > > > > > Except for heapsort those are doing insertion sorts because n is so > > small. Heapsort is trickier to understand because of the way the heap > > is formed and values pulled off. > I'm not sure where the breakpoint is, but I was seeing failures for all > three sort types with N as high as 10000. I suspect that they're all > broken in the presence of NaNs. I further suspect you'd need some > punishingly slow n**2 algorithm to be robust in the presence of NaNs. Quicksort and Mergesort are divide and conquer types. When the pieces get below a certain length they finish up with insertion sort as it is more efficient for small bits. In numpy the breakover points are 15 and 20 respectively. I believe microsoft did a project on this and ended up with a number somewhere around 7, but I didn't see that when doing the tuning for numarray way back when. Not that I did a *lot* of careful tuning work ;) > I'll look into that. I bet searchsorted gets messed up by nan's. Do > > you think it worthwhile to worry about it? > > No. But that's because it's my contention is that any sequence with NaNs > in it is *not sorted* and thus isn't suitable input for searchsorted. If this sort of thing can cause unexpected errors I wonder if it would be worth it to have a global debugging flag that essentially causes isnan to be called before any function applications. Chuck |