On 2011-06-30, at 18:35, Faré <fahree@...> wrote:
>> Most other runtimes instrument writes instead. As a few papers from 90's note, as execution speeds grow, and interrupts become more expensive, the situations in which it is beneficial to mprotect instead of instrumenting become more and more rare. Another potentially significant downside is that increasing the protection granularity makes GC more expensive. Generally speaking, mprotect tricks also make SBCL behave less like the C programs modern OSes are optimised for, so we don't benefit (as well) from, e.g., transparent merging of 4k pages into 2M huge pages.
> Speaking of granularity — how do we determine the correct granularity
> for such a "card marking" scheme? [...] If marking based on a range of bytes, can we
> make that a compile-time option so we could benchmark our applications
> based on several settings and find the best setting?
Currently, I'm assuming byte ranges, and it's a build-time option, much like current gencgc.
> Presumably, when you write, you know the type of the object you're
> modifying, and can do the proper thing to have object granularity, if
> that helps (might not if you have large objects, e.g. non-specialized
I'm not sure I follow. We do have access to the destination object's type, more or less precisely. How would you use that to have object granularity, when the object is smaller than the card size? I guess it would be possible to try and use one bit from the header of non-cons objects, but that would double the number of writes.
> Something else that might or might not help would be reading
> the flag and only setting it if it was clear, as the saving in writes
> may (or may not) save more than the test costs.
I'll test that, but the main attraction of card marking is its simplicity, which makes it less likely for such tricks to pay off.
> Finally, aren't there
> bit-diddling instructions such as BTS and BTR that you could use on
> x86(64)? Are they slower than a MOV byte? Did you time them?
There are. In additional to potential slowdowns (I'll have to try and measure that as well) I see two issues with trying to work at the bit level:
1. Threads, as byte writes are respectively atomic, but not BTS, BTR. That can be solved by thread-local cards, which might or might not be needed with byte maps anyway (cacheline ping-ponging).
2. Computing the bit offset (in addition to the word's address) is, relative to the current sequence, mildly expensive. However, it may be possible to use the extra bits for some other kind of information that's more easily computed at compile time. Lowtag info isn't that useful, since nearly every (mutable) thing is an other pointer. A rough idea of the offset of the mutated slot in the object, maybe.