|
From: Jeremy F. <je...@go...> - 2002-10-23 23:57:17
|
On Wed, 2002-10-23 at 15:58, Julian Seward wrote: > A busy day at goop.org, I see. Where in the physical universe are you > located, just out of curiosity? Yeah, not quite sure what's been happening; maybe one of the lists went mad. I'm an Australian living in San Francisco (so the notion of a physical universe is a little ill-defined). > I've just merged Quick work! I was just about to upload a new patch. (Now uploaded: 20-hg-lockgraph-report). BTW, it it possible to set up a SourceForge list which has CVS checkins? It would be useful. > 13-fix-printf > 13-kill-1ifroot > 14-hg-mmap-magic-virgin > 02-sysv-msg (I assume that 16-function-intercept makes this safe) Well, it was never unsafe, but I think it should work reliably. > 16-function-intercept > 18-recv-nonblock > 19-hg-lockgraph > > I also peered (again) at 09-rdtsc-calibration and think I'll merge that > too, although not this evening. Any thoughts about 00-lazyfp or 01-partial-mul? And I'd quite like vgprof to be a core skin. > Thanks for this hackery. pth_threadpool (my canonical small threading test) > now runs with fewer and fewer errors. Yes, the only error I'm getting out of it now is the initial hit on _dl_num_relocations. > Am amused to see you had a go at weird_LockSet_equals(). Nick wrote the > first version; I found it buggy and completely rewrote it; and so on ... > Hopefully 3rd time lucky. Am a bit surprised; I spent ages trying to > convince myself my version was right :) Yes, I poked about the the CVS history and noticed it had already been beaten up a bit already. I hope you'll agree that my version is obviously correct (in the normal highly qualified sense of "obvious"). The two bugs were 1) it returned equal if 'a' and 'b' were the same and missing_mutex was greater than the last element in 'a' and 'b' and 2) the check which ignored missing_mutex if it were already in 'a' broke removal (though that doesn't sound right now that I describe it). > Anyway: > > I wonder if I can encourage you to use mozilla-1.0 (standard binary install > from mozilla.org) as a stress test? I'm doing > > valgrind -v --error-limit=no --skin=helgrind --trace-children=yes mozilla > > and just exiting it as soon as it comes up. I'll give it a go. > It generates thousands of errors, and I'm suspicious -- if I trust anyone to > make a large threaded app and do it right, it's the mozilla people. That's trusting. > One > very common thing is this (kmail apologises for making it ugly): > > ==12966== Possible data race writing variable at 0x438A868C > ==12966== at 0x439BCA35: inflate_blocks (in > /mnt/globe/Apps/mozilla-1.0/libmozz.so) > ==12966== by 0x439BBDF1: inflate (in > /mnt/globe/Apps/mozilla-1.0/libmozz.so) > ==12966== by 0x44FC6E49: (within > /mnt/globe/Apps/mozilla-1.0/components/libjar50.so) > ==12966== by 0x44FC5F5B: (within > /mnt/globe/Apps/mozilla-1.0/components/libjar50.so) > ==12966== Previous state: shared RW, locked by: 0x45540E90 > ==12966== Address 0x438A868C is 1044 bytes inside a block of size 1280 > alloc'd by thread 1 at > ==12966== at 0x4009AAED: calloc (vg_clientfuncs.c:242) > ==12966== by 0x4027C7A1: nsRecyclingAllocator::Malloc(unsigned int, int) > (in /mnt/globe/Apps/mozilla-1.0/libxpcom.so) > ==12966== by 0x44FC5CAE: (within > /mnt/globe/Apps/mozilla-1.0/components/libjar50.so) > ==12966== by 0x439BC812: inflate_blocks (in > /mnt/globe/Apps/mozilla-1.0/libmozz.so) > > My question is: memory allocated by calloc(), what state should it start > out in? Surely it should be in some non-shared state (not shared RW as > claimed here) ? No. I should be exclusive to the thread which did the allocation (or maybe magic virgin). That said, the name "nsRecyclingAllocator::Malloc" suggests that the memory has been used by someone else before, and has been released back into a pool before being used by inflate/inflate_blocks; that would need an client call to tell helgrind to reset the memory state. If a lot of Moz's memory has been got from that allocator, I'd expect a lot of spurious errors. > Mozilla has just finished -- generating 13975 errors in 13924 contexts -- > mostly duplicates of a small handful of errors. Incidentally, 1164 different > lock sets were required, so I raised M_LOCKSET_TABLE to 5000 (and committed > it). Most of the locksets were small (1, 2 or 3 elems), but some got quite > large; from a quick scan the biggest is > > [1124] = { 0x4365A748 0x4365E424 0x436650F4 0x4366B6C8 0x4374088C > 0x437F94E8 0x438214B4 0x43826378 0x4382696C 0x4384CF20 > 0x438A28E8 0x438A3424 0x45377038 0x45377130 0x45377228 > 0x45377394 0x462D8E10 0x46C2CB98 0x46C9F474 0x473A20FC > 0x473A407C } > > Dunno if that's remotely interesting or useful, but anyway. Yes, it is. A near-term plan I have is to reimplement the lockset stuff entirely: replace the table with a structural hash, and replace the lists with arrays. That should speed things up quite a bit (but I wouldn't have bothered if there weren't many sets, or if they don't get large). I think the lock cycle test makes quite large locksets, because it keeps computing the union of a thread's current lockset and the mutex's previous dependent locks. > Oh yes, one other thing. I can't figure out the numbering scheme for > your patches. At first I thought they were were a simple patch counter, > but then the appearance of (16-function-intercept.patch, 16-ld-nodelete.patch) > and (18-hg-err-reporting.patch, 18-recv-nonblock.patch) made me discard > that idea. Or are you reusing numbers for patches I've merged? Basically. For any given set of unmerged patches, I just number them to keep their lexical order the same as order they should be applied in. Generally when I start a new patch I pick the next number up, but occasionally I get started and decide I need a prerequisite patch, so I choose an available lower number (which may well recycle a number). J |