From: <don...@is...> - 2003-12-06 20:04:23
|
I'm still trying to figure out how the test posted by Bruno works. Can someone explain? Below is a version that's improved for purposes of adding an entry to the test suite. See the comments therein. ;; no longer used (defun hash-table-keys (ht) (loop :for kk :being :each :hash-key :of ht :collect kk)) ;; no longer used - see next (defun check-hash-unique (ht) (let ((hash-ok t)) (do* ((keys (hash-table-keys ht) (cdr keys)) (key (car keys) (car keys)) (other-keys (cdr keys) (cdr keys))) ((null keys) hash-ok) (when (member key other-keys) (format t "ERROR! key ~s occurs multiple times!~%" key) (setf hash-ok nil))))) ;; if you really want to find the duplicate entries, this is much faster ;; if you just want to test it's not necessary (defun check-hash-unique2 (ht) (let ((h2 (make-hash-table :test 'equal))) (loop :for kk :being :each :hash-key :of ht do (if (gethash kk h2) (format t "ERROR! key ~s occurs multiple times!~%" kk) (setf (gethash kk h2) t))))) (defun do-hash-test (ht) (clrhash ht) (loop for countval upfrom 1 upto 15000 for key = (cons "HT" countval) ;; (format nil "HT-~D" countval) ;; same problem occurs with cons but a lot faster than format do ;;(gethash key ht) (setf (gethash key ht) t) (setf (gethash key ht) t) ;; was countval ;; The gethash seems unnecessary in the sense that the failure ;; occurs without it. Similarly, it occurs whether the two ;; assignments use the same value or not. ;; But two assignments are needed. ;; This part I don't understand. ;; Is the problem related to gc between the two? ;; I tried putting a gc between the two and the problem went away. ;; If there's no gc between the two then I'd expect the second ;; to be a noop. If the problem is a gc during one then why do ;; we need two? finally (format t "~A entries~%" (loop :for kk :being :each :hash-key :of ht :count t)) ;; as a test it suffices to count the entries (above) ;; (check-hash-unique2 ht) ;; only if you want to see the duplicates )) (loop :repeat 2 :do (loop :with ht = (make-hash-table :size 1000) ;; the size doesn't seem to matter ;; I get the bug with no size or 20000 or 100000 :for i :from 1 :to 10 :do (format t "~& --- ~d ---~%" i) (do-hash-test ht))) |
From: Sam S. <sd...@gn...> - 2003-12-06 20:22:22
|
> * Don Cohen <qba...@vf...3-vap.pbz> [2003-12-06 12:00:35 -0800]: > > I'm still trying to figure out how the test posted by Bruno works. Bruno? > (defun check-hash-unique2 (ht) > (let ((h2 (make-hash-table :test 'equal))) > (loop :for kk :being :each :hash-key :of ht do > (if (gethash kk h2) > (format t "ERROR! key ~s occurs multiple times!~%" kk) > (setf (gethash kk h2) t))))) you are using HT to check HT. no good. > ;; But two assignments are needed. > ;; This part I don't understand. > ;; Is the problem related to gc between the two? the problem was with hashcode caching. hashcode is (mod raw-hash-code ht-size). raw-hash-code depends only on the object, but for EQ&EQL HTs it can change after a GC. when we add an element to the HT, we may need to expand the HT, which may trigger GC. raw-hash-code calculation may be expensive, especially for user-defined hash-tables, so we should cache it while expanding HT. we cannot cache the EQ&EQL raw-hash-code (GC may change it). caching of the EQ&EQL raw-hash-code was the problem. see ChangeLog & hashtabl.d for details. -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> There's always free cheese in a mouse trap. |
From: <don...@is...> - 2003-12-06 23:06:10
|
Sam Steingold writes: > > * Don Cohen <qba...@vf...3-vap.pbz> [2003-12-06 12:00:35 -0800]: > > > > I'm still trying to figure out how the test posted by Bruno works. > Bruno? > > > (defun check-hash-unique2 (ht) > > (let ((h2 (make-hash-table :test 'equal))) > > (loop :for kk :being :each :hash-key :of ht do > > (if (gethash kk h2) > > (format t "ERROR! key ~s occurs multiple times!~%" kk) > > (setf (gethash kk h2) t))))) > > you are using HT to check HT. no good. Only to iterate over it, just as the original did. Try it. > > ;; But two assignments are needed. > > ;; This part I don't understand. > > ;; Is the problem related to gc between the two? > > the problem was with hashcode caching. > hashcode is (mod raw-hash-code ht-size). > raw-hash-code depends only on the object, but for EQ&EQL HTs it can > change after a GC. But that doesn't explain why you need two assignments. > when we add an element to the HT, we may need to expand the HT, which > may trigger GC. Expanding HT was not the problem, since the same problem occurs even with a very large initial size. > raw-hash-code calculation may be expensive, especially for user-defined > hash-tables, so we should cache it while expanding HT. That I understood. So where do you cache it? If you can save such data on an arbitrary object, why not go ahead and define that as the raw hash code. Then you'll never have to rehash! So this means that every EQ/EQL hashtable has to be totally rehashed whenever it's accessed and GC has occurred since the last access? That would seem very bad. Or perhaps only if the key is of one of the types that move? That doesn't seem like a big improvement. Or maybe you can tell the last time a given object moved and compare that to the last time the table was rehashed? > we cannot cache the EQ&EQL raw-hash-code (GC may change it). > caching of the EQ&EQL raw-hash-code was the problem. > see ChangeLog & hashtabl.d for details. I did see that much, but as you can tell, there are lots of questions remaining. |
From: Sam S. <sd...@gn...> - 2003-12-07 05:20:27
|
> * Don Cohen <qba...@vf...3-vap.pbz> [2003-12-06 15:02:21 -0800]: > > > you are using HT to check HT. no good. > Only to iterate over it, just as the original did. so? you just don't use check something with itself (even if it sometimes catches an error). > > the problem was with hashcode caching. > > hashcode is (mod raw-hash-code ht-size). > > raw-hash-code depends only on the object, but for EQ&EQL HTs it can > > change after a GC. > But that doesn't explain why you need two assignments. you probably don't. you just discover that the thing you put into the HT is not there. > > when we add an element to the HT, we may need to expand the HT, which > > may trigger GC. > Expanding HT was not the problem, since the same problem occurs even > with a very large initial size. see rehash() in hashtabl.d > > raw-hash-code calculation may be expensive, especially for user-defined > > hash-tables, so we should cache it while expanding HT. > That I understood. So where do you cache it? hash_prepare_store() in hashtabl.d > So this means that every EQ/EQL hashtable has to be totally rehashed > whenever it's accessed and GC has occurred since the last access? of course. I hope Bruno will chime in here. The issues seems to be - how to make the test I added faster (it is now one of the slowest tests in the testsuite) - general CLISP hash table implementation issues. -- Sam Steingold (http://www.podval.org/~sds) running w2k <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.mideasttruth.com/> <http://www.honestreporting.com> XFM: Exit file manager? [Continue] [Cancel] [Abort] |
From: <don...@is...> - 2003-12-07 07:16:23
|
Sam Steingold writes: > > * Don Cohen <qba...@vf...3-vap.pbz> [2003-12-06 15:02:21 -0800]: > > > > > you are using HT to check HT. no good. > > Only to iterate over it, just as the original did. > > so? you just don't use check something with itself (even if it > sometimes catches an error). Try check-hash-unique and check-hash-unique2 and you'll see they report the same things. They both use the original ht in the same way, but the second one is much more efficient, linear instead of quadratic. > > > the problem was with hashcode caching. > > > hashcode is (mod raw-hash-code ht-size). > > > raw-hash-code depends only on the object, but for EQ&EQL HTs it can > > > change after a GC. > > But that doesn't explain why you need two assignments. > > you probably don't. This is not true. As my comments report, if you remove one assignment there is no failure. > you just discover that the thing you put into the HT is not there. > > > > when we add an element to the HT, we may need to expand the HT, which > > > may trigger GC. > > Expanding HT was not the problem, since the same problem occurs even > > with a very large initial size. > > see rehash() in hashtabl.d Is this supposed to convince me that the bug is related to growing the hashtable? I admit I was hoping to understand this issue without understanding (or even reading) all the code. > > > raw-hash-code calculation may be expensive, especially for user-defined > > > hash-tables, so we should cache it while expanding HT. hmm, is there any guarantee that user defined hash functions remain the same across GC ? That doesn't appear in the spec in impnotes. Also I'd have expected that hash functions should return integers. > > That I understood. So where do you cache it? > > hash_prepare_store() in hashtabl.d I think you're telling me where in the code the caching is done. I meant where is the data stored. In the hash table? In the object? In a temporary table? > > So this means that every EQ/EQL hashtable has to be totally rehashed > > whenever it's accessed and GC has occurred since the last access? > > of course. If you can associate a hash code with arbitrary objects (or even only those that can move and have this affect their raw hash code), then by leaving that data alone you could avoid all rehashing. > I hope Bruno will chime in here. > The issues seems to be > - how to make the test I added faster (it is now one of the slowest > tests in the testsuite) If you mean the code I just sent you, it's a whole lot faster than the original. > - general CLISP hash table implementation issues. I want to understand both this and the test that demonstrates the failure recently fixed. |