From: mingming c. <cm...@us...> - 2002-10-24 21:54:30
Attachments:
ipclock-2544mm4.patch
|
Hi Andrew, Here is the updated ipc lock patch: - It greatly reduces the lock contention by having one lock per id. The global spinlock is removed and a spinlock is added in kern_ipc_perm structure. - Uses ReadCopyUpdate in grow_ary() for locking-free resizing. - In the places where ipc_rmid() is called, delay calling ipc_free() to RCU callbacks. This is to prevent ipc_lock() returning an invalid pointer after ipc_rmid(). In addition, use the workqueue to enable RCU freeing vmalloced entries. Also some other changes: - Remove redundant ipc_lockall/ipc_unlockall - Now ipc_unlock() directly takes IPC ID pointer as argument, avoid extra looking up the array. The changes are made based on the input from Huge Dickens, Manfred Spraul and Dipankar Sarma. In addition, Cliff White has run OSDL's dbt1 test on a 2 way against the earlier version of this patch. Results shows about 2-6% improvement on the average number of transactions per second. Here is the summary of his tests: 2.5.42-mm2 2.5.42-mm2-ipclock ---------------------------------------------------------- Average over 5 runs 85.0 BT 89.8 BT Std Deviation 5 runs 7.4 BT 1.0 BT Average over 4 best 88.15 BT 90.2 BT Std Deviation 4 best 2.8 BT 0.5 BT Full details of the tests could be found here: http://www.osdl.org/projects/dbt1prfrns/results/mingming/index.html patch is against 2.5.44-mm4. Please include or give any feedback. Thanks, Mingming Cao |
From: Rick L. <ric...@us...> - 2002-10-25 00:37:10
|
slightly offtopic ... > There is an insane amount of inlining in the ipc code. I > couldn't keep my paws off it. I agree tempting: I thought you might like that in a subsequent patch, yes? Mingming was splitting locks, not doing a cleanup of inlines. There was a time when "inline" was a very cool tool because it had been judged that the overhead of actually calling a function was just too heinous to contemplate. From comments in this and other discussions, is it safe to say that the pendulum has now swung the other way? I see a lot of people concerned about code size and apparently returning to the axiom of "if you use it more than once, make it a function." Are we as a community coming around to using inlining only on very tight, very critical functions? Rick |
From: Robert L. <rm...@te...> - 2002-10-25 01:08:05
|
On Thu, 2002-10-24 at 20:35, Rick Lindsley wrote: > There was a time when "inline" was a very cool tool because it had been > judged that the overhead of actually calling a function was just too > heinous to contemplate. From comments in this and other discussions, > is it safe to say that the pendulum has now swung the other way? I see > a lot of people concerned about code size and apparently returning to > the axiom of "if you use it more than once, make it a function." Are > we as a community coming around to using inlining only on very tight, > very critical functions? I think so, at least Andrew is championing us in that direction. But I agree. It somewhere became the notion if the function is small, it automatically should be inlined. I suspect Andrew has even stricter criteria than me (I think super small functions should be inlined) but the general "its only a couple of lines" or "it could be a macro" are not sufficient criterion for inlining. So, my thoughts on suitable criteria would be: - used only once and only ever in that one spot (i.e. it really could be part of the caller, but it was pulled out for cleanliness. Keep it inline to not have the cleanliness cause a performance degradation (however small)). - small functions, where small is so small the function overhead is nearly the same size. Stuff that might not even do anything but return a permutation of an argument, etc. - very very time critical functions in time critical places So that removes the previous criteria of "the function is N lines or smaller" where N is some number less than 100 :) Robert Love |
From: Cliff W. <cl...@os...> - 2002-10-25 00:38:57
|
> mingming cao wrote: > > > > Hi Andrew, > > > > Here is the updated ipc lock patch: > > Well I can get you a bit of testing and attention, but I'm afraid > my knowledge of the IPC code is negligible. > > So to be able to commend this change to Linus I'd have to rely on > assurances from people who _do_ understand IPC (Hugh?) and on lots > of testing. > > So yes, I'll include it, and would solicit success reports from > people who are actually exercising that code path, thanks. > > > http://www.osdl.org/projects/dbt1prfrns/results/mingming/index.html > > DBT1 is really interesting, and I'm glad the OSDL team have > put it together. If people would only stop sending me patches > I'd be using it ;) > Thank you very much for that :) > Could someone please help explain the results? Comparing, say, > http://www.osdl.org/projects/dbt1prfrns/results/mingming/run.2cpu.42-mm2.r5/index.html > and > http://www.osdl.org/projects/dbt1prfrns/results/mingming/run.18.r5/index.html > > It would appear that 2.5 completely smoked 2.4 on response time, > yet the overall bogotransactions/sec is significantly lower. > What should we conclude from this? Whoa - we ran these 5 times for an average. The 2.5 run you picked was the 'off' run - It has the worse results. You will notice on this run, there are a large number of errors which didn't happen on the other runs - this lowered the BT/sec number. Use one of the other 2.5 ones and you'll see something more sensible. ( say, 42-mm2.r3) Unfortunately, on average, 2.4 still beats 2.5 on both response time and BT's 2.5.42-mm2 2.5.42-mm2-ipclock 2.4.18 Average over 5 runs 85.0 BT 89.8 BT 96.92 BT Std Deviation 5 runs 7.4 BT 1.0 BT 2.07 BT Average of best 4 runs 88.15 BT 90.2 BT 97.2 BT Std Deviation 4 run 2.8 BT 0.5 BT 2.3 BT > One other place to start comparing is in the system information which is at the bottom of the page. Some points (might be minor) : Cpu statistics: 2.4.18 - cpu %idle averages around 1.5% %system swings between 3-7% %nice steady at ~3.6% 2.5.42-mm2 cpu %idle 0.0 all thru run, %system steady at ~6% % nice up ~5.5 Swap (sar -r) Very slight differences - we consumed ~98% of the memory in both cases, 2.4 swapped a little bit (%28) more than 2.5 (%26) We also include profile data for both the load and run phase. (profile=2) > Also I see: > > 14.7 minute duration > and > Time for DBT run 19:36 > > What is the 14.7 minutes referring to? > The 14.7 minute time comes from the workload driver log, which are parsed to get the response numbers. The 'Time for' stamps come from the master driver script, and include some of the workload startup and shutdown time. The workload driver waits a bit to be sure things are stable, before the official run data is collected. The script timestamp waits until the run clients are dead. So there's always a bit of a delta between the two. > Also: > > 2.5: Time for key creation 1:27 > 2.4: Time for key creation 14:24 > versus: > 2.5: Time for table creation 16:48 > 2.4: Time for table creation 8:58 > This is a Mystery Question - we don't have an answer, we were hoping _you would see something :) Table creation involves sequential inserts of data from a flat file to an SAPDB B-tree on a devspace. Our devspace is a raw device, so we're doing raw io, plus some processing. This op is write-intensive 'Key creation' is establishing a foreign key column contraint on various tables. For each table, it examines every row in the table, looks up (does a B-tree index lookup) the column value in a second table to find a specific primary key that matches the column value in the first table. So again, some I/O, a bit of processing. Key creation (foreign key) is read-intensive. Also interesting is the delta in index creation: 2.5 Time for index creation 27:58 2.4 Time for index creation 17:21 Index creation requires a read of the table, a sort, then creation of a B-tree index. Both the index and table creates build a B-tree for SAP-DB ( both run slower on 2.5 ) - the table creation does no sorting. We also notice that the times for both index and key creation varies a bit more across runs with the -mm2 kernel, as shown by the standard deviation across the runs. mingming and 2.4.18 are a bit more consistent. ( we threw out -mm2 run 5 for this average, due to the errors) Results are: average time[std dev] Action 2.4.18 2.5.42-mm2 2.5.42-mm2-ipclock table create 8:55 [0:04] 19:03 [2:40] 19.39 [0:50] index create 17:17 [0:11] 25:19 [5:31] 28:05 [0:02] key create 14:23 [0:16] 15:21 [6:37] 18:46 [0:17] Also interesting is -mm2 run2 - foreign key creation took 5:26, the run completed with no errors...why so fast, only one time? It is an ongoing mystery. We Just Don't Know Why Right Now. We are working on better data capture of db/run errors, and we'd love to hear suggestions on improving the instrumentation. > So it's all rather confusing. Masses of numbers usually _are_ > confusing. What really adds tons of value to such an exercise is > for the person who ran the test to write up some conclusions. Yes, agreed. We don't yet know enough to map from test results to an exact kernel area. We just added a database expert to staff (Mary Edie Meredith) so we intend to get better. We'll probably be nagging you a bit, and again we very much appreciate all suggestions. To > tell the developers what went well, what went poorly, what areas > to focus on, etc. To use your own judgement to tell us what to > zoom in on. > > Is that something which could be added? > It is something we are working on adding. cliffw > > ------------------------------------------------------- > This sf.net email is sponsored by: Influence the future > of Java(TM) technology. Join the Java Community > Process(SM) (JCP(SM)) program now. > http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0003en > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech > |
From: mingming c. <cm...@us...> - 2002-10-25 20:22:01
|
Paul Larson wrote: > > I havn't seen this test fail before but I'll be happy to do more testing > with your patch to see if I can reproduce it. You may also want to > consider updating LTP to the newest version. I'm fairly certain that > shmctl01 hasn't been changed since the version you have, but just to be > consistent you may want to do that. > Ha! Sorry about the confusion. I re-install ltp test suites and the error is gone. My old tests must be dirty. Mingming |
From: Andrew M. <ak...@di...> - 2002-10-24 22:29:53
|
mingming cao wrote: > > Hi Andrew, > > Here is the updated ipc lock patch: Well I can get you a bit of testing and attention, but I'm afraid my knowledge of the IPC code is negligible. So to be able to commend this change to Linus I'd have to rely on assurances from people who _do_ understand IPC (Hugh?) and on lots of testing. So yes, I'll include it, and would solicit success reports from people who are actually exercising that code path, thanks. > http://www.osdl.org/projects/dbt1prfrns/results/mingming/index.html DBT1 is really interesting, and I'm glad the OSDL team have put it together. If people would only stop sending me patches I'd be using it ;) Could someone please help explain the results? Comparing, say, http://www.osdl.org/projects/dbt1prfrns/results/mingming/run.2cpu.42-mm2.r5/index.html and http://www.osdl.org/projects/dbt1prfrns/results/mingming/run.18.r5/index.html It would appear that 2.5 completely smoked 2.4 on response time, yet the overall bogotransactions/sec is significantly lower. What should we conclude from this? Also I see: 14.7 minute duration and Time for DBT run 19:36 What is the 14.7 minutes referring to? Also: 2.5: Time for key creation 1:27 2.4: Time for key creation 14:24 versus: 2.5: Time for table creation 16:48 2.4: Time for table creation 8:58 So it's all rather confusing. Masses of numbers usually _are_ confusing. What really adds tons of value to such an exercise is for the person who ran the test to write up some conclusions. To tell the developers what went well, what went poorly, what areas to focus on, etc. To use your own judgement to tell us what to zoom in on. Is that something which could be added? |
From: Hugh D. <hu...@ve...> - 2002-10-24 22:56:01
|
On Thu, 24 Oct 2002, Andrew Morton wrote: > mingming cao wrote: > > > > Hi Andrew, > > > > Here is the updated ipc lock patch: > > Well I can get you a bit of testing and attention, but I'm afraid > my knowledge of the IPC code is negligible. > > So to be able to commend this change to Linus I'd have to rely on > assurances from people who _do_ understand IPC (Hugh?) and on lots > of testing. > > So yes, I'll include it, and would solicit success reports from > people who are actually exercising that code path, thanks. Manfred and I have both reviewed the patch (or the 2.5.44 version) and we both recommend it highly (well, let Manfred speak for himself). I can't claim great expertise on IPC (never on msg, but some on shm and sem), but (unless there's an error we've missed) there's no change to IPC functionality here - it's an exercise in "self-evidently" better locking (there used to be just one spinlock covering all e.g. sems), with RCU to avoid the dirty cacheline bouncing in earlier version. And I rarely exercise IPC paths, except when testing if I change something: I do hope someone else can vouch for it in practice, we believe Mingming has devised a fine patch here. Hugh |
From: Andrew M. <ak...@di...> - 2002-10-24 23:30:41
|
Hugh Dickins wrote: > > ... > Manfred and I have both reviewed the patch (or the 2.5.44 version) > and we both recommend it highly (well, let Manfred speak for himself). > OK, thanks. So I took a look. Wish I hadn't :( The locking rules in there are outrageously uncommented. You must be brave people. What about this code? void ipc_rcu_free(void* ptr, int size) { struct rcu_ipc_free* arg; arg = (struct rcu_ipc_free *) kmalloc(sizeof(*arg), GFP_KERNEL); if (arg == NULL) return; arg->ptr = ptr; arg->size = size; call_rcu(&arg->rcu_head, ipc_free_callback, arg); } Are we sure that it's never called under locks? And it seems that if the kmalloc fails, we decide to leak some memory, yes? If so it would be better to use GFP_ATOMIC there. Avoids any locking problems and also increases the chance of the allocation succeeding. (With an explanatory comment, naturally :)). Even better: is it possible to embed the rcu_ipc_free inside the object-to-be-freed? Perhaps not? Stylistically, it is best to not typecast the return value from kmalloc, btw. You should never typecast the return value of anything which returns a void *, because it weakens your compile-time checking. Example: foo *bar = (foo *)zot(); The compiler will swallow that, regardless of what zot() returns. Someone could go and change zot() to return a reiserfs_inode * and you would never know about it. Whereas: foo *bar = zot(); Says to the compiler "zot() must return a bar * or a void *", which is much tighter checking, yes? There is an insane amount of inlining in the ipc code. I couldn't keep my paws off it. Before: mnm:/usr/src/25> size ipc/*.o text data bss dec hex filename 28346 224 192 28762 705a ipc/built-in.o 7390 20 64 7474 1d32 ipc/msg.o 11236 16 64 11316 2c34 ipc/sem.o 8136 160 64 8360 20a8 ipc/shm.o 1584 0 0 1584 630 ipc/util.o After: mnm:/usr/src/25> size ipc/*.o text data bss dec hex filename 19274 224 192 19690 4cea ipc/built-in.o 4846 20 64 4930 1342 ipc/msg.o 7636 16 64 7716 1e24 ipc/sem.o 4808 160 64 5032 13a8 ipc/shm.o 1984 0 0 1984 7c0 ipc/util.o --- 25/ipc/util.h~ipc-akpm Thu Oct 24 16:03:32 2002 +++ 25-akpm/ipc/util.h Thu Oct 24 16:08:25 2002 @@ -54,63 +54,11 @@ void* ipc_alloc(int size); void ipc_free(void* ptr, int size); void ipc_rcu_free(void* arg, int size); -extern inline struct kern_ipc_perm* ipc_get(struct ipc_ids* ids, int id) -{ - struct kern_ipc_perm* out; - int lid = id % SEQ_MULTIPLIER; - if(lid >= ids->size) - return NULL; - rmb(); - out = ids->entries[lid].p; - return out; -} - -extern inline struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id) -{ - struct kern_ipc_perm* out; - int lid = id % SEQ_MULTIPLIER; - - rcu_read_lock(); - if(lid >= ids->size) { - rcu_read_unlock(); - return NULL; - } - rmb(); - out = ids->entries[lid].p; - if(out == NULL) { - rcu_read_unlock(); - return NULL; - } - spin_lock(&out->lock); - - /* ipc_rmid() may have already freed the ID while ipc_lock - * was spinning: here verify that the structure is still valid - */ - if (out->deleted) { - spin_unlock(&out->lock); - rcu_read_unlock(); - return NULL; - } - return out; -} - -extern inline void ipc_unlock(struct kern_ipc_perm* perm) -{ - spin_unlock(&perm->lock); - rcu_read_unlock(); -} - -extern inline int ipc_buildid(struct ipc_ids* ids, int id, int seq) -{ - return SEQ_MULTIPLIER*seq + id; -} - -extern inline int ipc_checkid(struct ipc_ids* ids, struct kern_ipc_perm* ipcp, int uid) -{ - if(uid/SEQ_MULTIPLIER != ipcp->seq) - return 1; - return 0; -} +struct kern_ipc_perm* ipc_get(struct ipc_ids* ids, int id); +struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id); +void ipc_unlock(struct kern_ipc_perm* perm); +int ipc_buildid(struct ipc_ids* ids, int id, int seq); +int ipc_checkid(struct ipc_ids* ids, struct kern_ipc_perm* ipcp, int uid); void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out); void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out); --- 25/ipc/util.c~ipc-akpm Thu Oct 24 16:07:07 2002 +++ 25-akpm/ipc/util.c Thu Oct 24 16:07:51 2002 @@ -359,6 +359,61 @@ void ipc64_perm_to_ipc_perm (struct ipc6 out->seq = in->seq; } +struct kern_ipc_perm* ipc_get(struct ipc_ids* ids, int id) +{ + struct kern_ipc_perm* out; + int lid = id % SEQ_MULTIPLIER; + if(lid >= ids->size) + return NULL; + rmb(); + out = ids->entries[lid].p; + return out; +} + +struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id) +{ + struct kern_ipc_perm* out; + int lid = id % SEQ_MULTIPLIER; + + rcu_read_lock(); + if(lid >= ids->size) + goto fail; + rmb(); + out = ids->entries[lid].p; + if (out == NULL) + goto fail; + spin_lock(&out->lock); + + /* ipc_rmid() may have already freed the ID while ipc_lock + * was spinning: here verify that the structure is still valid + */ + if (!out->deleted) + return out; + + spin_unlock(&out->lock); +fail: + rcu_read_unlock(); + return NULL; +} + +void ipc_unlock(struct kern_ipc_perm* perm) +{ + spin_unlock(&perm->lock); + rcu_read_unlock(); +} + +int ipc_buildid(struct ipc_ids* ids, int id, int seq) +{ + return SEQ_MULTIPLIER*seq + id; +} + +int ipc_checkid(struct ipc_ids* ids, struct kern_ipc_perm* ipcp, int uid) +{ + if(uid/SEQ_MULTIPLIER != ipcp->seq) + return 1; + return 0; +} + #ifndef __ia64__ /** . |
From: Hugh D. <hu...@ve...> - 2002-10-24 23:58:15
|
On Thu, 24 Oct 2002, Andrew Morton wrote: > Hugh Dickins wrote: > > > > ... > > Manfred and I have both reviewed the patch (or the 2.5.44 version) > > and we both recommend it highly (well, let Manfred speak for himself). > > OK, thanks. > > So I took a look. Wish I hadn't :( The locking rules in there > are outrageously uncommented. You must be brave people. Ah, we all like to criticize the lack of comments in others' code. > What about this code? > > void ipc_rcu_free(void* ptr, int size) > { > struct rcu_ipc_free* arg; > > arg = (struct rcu_ipc_free *) kmalloc(sizeof(*arg), GFP_KERNEL); > if (arg == NULL) > return; > arg->ptr = ptr; > arg->size = size; > call_rcu(&arg->rcu_head, ipc_free_callback, arg); > } > > Are we sure that it's never called under locks? Yes. > And it seems that if the kmalloc fails, we decide to leak some > memory, yes? Yes, but why would it fail? and what do you think should be the alternative? > If so it would be better to use GFP_ATOMIC there. Avoids any > locking problems and also increases the chance of the allocation > succeeding. (With an explanatory comment, naturally :)). There are no locking doubts here. GFP_ATOMIC would _reduce_ the chance of the allocation succeeding: GFP_KERNEL does include the __GFP_WAIT flag, GFP_ATOMIC does not. > Even better: is it possible to embed the rcu_ipc_free inside the > object-to-be-freed? Perhaps not? It would certainly be possible (I did suggest it as a maybe), but it's unclear whether it's worthwhile wasting the extra memory longterm like that. Mingming chose not to embed, I see no reason to overrule. > Stylistically, it is best to not typecast the return value > from kmalloc, btw. You should never typecast the return > value of anything which returns a void *, because it weakens > your compile-time checking. Example: > > foo *bar = (foo *)zot(); > > The compiler will swallow that, regardless of what zot() returns. > Someone could go and change zot() to return a reiserfs_inode * > and you would never know about it. Whereas: > > foo *bar = zot(); > > Says to the compiler "zot() must return a bar * or a void *", > which is much tighter checking, yes? You have too much time on your hands, Andrew :-) > There is an insane amount of inlining in the ipc code. I > couldn't keep my paws off it. I agree tempting: I thought you might like that in a subsequent patch, yes? Mingming was splitting locks, not doing a cleanup of inlines. Hugh |
From: mingming c. <cm...@us...> - 2002-10-25 00:11:59
|
Andrew Morton wrote: > > What about this code? > > void ipc_rcu_free(void* ptr, int size) > { > struct rcu_ipc_free* arg; > > arg = (struct rcu_ipc_free *) kmalloc(sizeof(*arg), GFP_KERNEL); > if (arg == NULL) > return; > arg->ptr = ptr; > arg->size = size; > call_rcu(&arg->rcu_head, ipc_free_callback, arg); > } > > Are we sure that it's never called under locks? Did you see any place where this is called with lock(s) hold? Maybe there is, but I could not see here. They are called from the functions which are used by IPC code only. Inside IPC there is only spin_lock per ID and sem_undo lock. Both of them are not hold when ipc_rcu_free is called. > > And it seems that if the kmalloc fails, we decide to leak some > memory, yes? > yes. > If so it would be better to use GFP_ATOMIC there. Avoids any > locking problems and also increases the chance of the allocation > succeeding. (With an explanatory comment, naturally :)). > Good point. I agree GFP_ATOMIC fits better here. > Even better: is it possible to embed the rcu_ipc_free inside the > object-to-be-freed? Perhaps not? Are you saying that have a static RCU header structure in the object-to-be-freed? I think it's possible. It fits well in the rmid case, where the object to be freed is an kern_ipc_perm structure. But for the grow_ary() case, the object to be freed is a array of struct ipc_id, so it need a little bit more changes there. Maybe add a new structure ipc_entries, which include the RCU header structure and the pointer to the entries array. Then have the ipc_ids->entries point to ipc_entries. Just a little concern that this way we added a reference when looking up the IPC ID from the array. |
From: Andrew M. <ak...@di...> - 2002-10-25 00:24:16
|
mingming cao wrote: > > > Even better: is it possible to embed the rcu_ipc_free inside the > > object-to-be-freed? Perhaps not? > > Are you saying that have a static RCU header structure in the > object-to-be-freed? I think it's possible. It fits well in the rmid > case, where the object to be freed is an kern_ipc_perm structure. But > for the grow_ary() case, the object to be freed is a array of struct > ipc_id, so it need a little bit more changes there. Maybe add a new > structure ipc_entries, which include the RCU header structure and the > pointer to the entries array. Then have the ipc_ids->entries point to > ipc_entries. Just a little concern that this way we added a reference > when looking up the IPC ID from the array. This is a place where a mempool is appropriate. The objects have a "guaranteed to be returned if you wait for long enough" lifecycle. But Hugh's right here. The chance of the single-page GFP_KERNEL allocation failing is tiny; the probability depending upon the VM-of-the-day. Let's leave it be. |
From: Rusty R. <ru...@ru...> - 2002-10-25 04:22:51
|
On Thu, 24 Oct 2002 16:30:32 -0700 Andrew Morton <ak...@di...> wrote: > Hugh Dickins wrote: > > > > ... > > Manfred and I have both reviewed the patch (or the 2.5.44 version) > > and we both recommend it highly (well, let Manfred speak for himself). > > > > OK, thanks. > > So I took a look. Wish I hadn't :( The locking rules in there > are outrageously uncommented. You must be brave people. Agreed. Here's my brief audit: >+ int max_id = ids->max_id; > >- for (id = 0; id <= ids->max_id; id++) { >+ read_barrier_depends(); >+ for (id = 0; id <= max_id; id++) { That needs to be a rmb(), not a read_barrier_depends(). And like all barriers, it *requires* a comment: /* We must read max_id before reading any entries */ I can't see the following in the patch posted, but: > void ipc_rcu_free(void* ptr, int size) > { > struct rcu_ipc_free* arg; > > arg = (struct rcu_ipc_free *) kmalloc(sizeof(*arg), GFP_KERNEL); > if (arg == NULL) > return; > arg->ptr = ptr; > arg->size = size; > call_rcu(&arg->rcu_head, ipc_free_callback, arg); > } This is unacceptable crap, sorry. You *must* allocate the resources required to free the object *at the time you allocate the object*, since freeing must not fail. > Even better: is it possible to embed the rcu_ipc_free inside the > object-to-be-freed? Perhaps not? Yes, this must be done. Rusty. -- there are those who do and those who hang on and you don't see too many doers quoting their contemporaries. -- Larry McVoy |
From: Manfred S. <ma...@co...> - 2002-10-25 05:36:54
|
Andrew Morton wrote: >Hugh Dickins wrote: > > >>... >>Manfred and I have both reviewed the patch (or the 2.5.44 version) >>and we both recommend it highly (well, let Manfred speak for himself). >> >> >> > >OK, thanks. > >So I took a look. Wish I hadn't :( The locking rules in there >are outrageously uncommented. You must be brave people. > > Ahm. No idea who wrote the current locking. But the patch is very nice, it reduces the lock contention without increasing the number of spinlock calls. -- Manfred |
From: mingming c. <cm...@us...> - 2002-10-24 23:28:00
|
Andrew Morton wrote: > > mingming cao wrote: > > > > Hi Andrew, > > > > Here is the updated ipc lock patch: > > Well I can get you a bit of testing and attention, but I'm afraid > my knowledge of the IPC code is negligible. > > So to be able to commend this change to Linus I'd have to rely on > assurances from people who _do_ understand IPC (Hugh?) and on lots > of testing. Thanks for your quick feedback. I did LTP tests on it--it passed(well, I saw a failure on shmctl(), but the failure was there since 2.5.43 kernel). I will do more stress tests on it soon. Mingming |
From: Paul L. <pl...@li...> - 2002-10-25 14:31:17
|
On Thu, 2002-10-24 at 18:23, mingming cao wrote: > Thanks for your quick feedback. I did LTP tests on it--it passed(well, > I saw a failure on shmctl(), but the failure was there since 2.5.43 > kernel). I will do more stress tests on it soon. Which shmctl() test is this? To my knowledge, there are no current known issues with shmctl tests. There is however one with sem02 in semctl() that last I heard has been partially fixed in the kernel and still needs to be fixed in glibc. Is that the one you are referring to, or is there really some other shmctl test in LTP that is failing? Thanks, Paul Larson |
From: mingming c. <cm...@us...> - 2002-10-25 17:22:59
|
Paul Larson wrote: > > On Thu, 2002-10-24 at 18:23, mingming cao wrote: > > Thanks for your quick feedback. I did LTP tests on it--it passed(well, > > I saw a failure on shmctl(), but the failure was there since 2.5.43 > > kernel). I will do more stress tests on it soon. > Which shmctl() test is this? To my knowledge, there are no current > known issues with shmctl tests. There is however one with sem02 in > semctl() that last I heard has been partially fixed in the kernel and > still needs to be fixed in glibc. Is that the one you are referring to, > or is there really some other shmctl test in LTP that is failing? Here is the failure I saw on LTP test. The one failed is /ltp-20020807/testcases/kernel/syscalls/ipc/shmctl/shmctl01 <<<test_start>>> tag=shmctl01 stime=1035475025 cmdline="shmctl01" contacts="" analysis=exit initiation_status="ok" <<<test_output>>> shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 2 PASS : pid, size, # of attaches and mode are correct - pass #2 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 2 PASS : pid, size, # of attaches and mode are correct - pass #2 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 2 PASS : pid, size, # of attaches and mode are correct - pass #2 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 2 PASS : pid, size, # of attaches and mode are correct - pass #2 shmctl01 3 FAIL : # of attaches is incorrect - 0 shmctl01 4 PASS : new mode and change time are correct <<<execution_status>>> duration=1 termination_type=exited termination_id=1 corefile=no cutime=0 cstime=0 <<<test_end>>> |
From: Paul L. <pl...@li...> - 2002-10-25 18:30:07
|
On Fri, 2002-10-25 at 12:17, mingming cao wrote: > > shmctl01 3 FAIL : # of attaches is incorrect - 0 I guess you are running it with -i2? I just tried shmctl01 -i2 on a 2.5.44 kernel and did not get this error. shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 2 PASS : pid, size, # of attaches and mode are correct - pass #2 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 3 PASS : new mode and change time are correct shmctl01 4 PASS : shared memory appears to be removed shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 2 PASS : pid, size, # of attaches and mode are correct - pass #2 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 3 PASS : new mode and change time are correct shmctl01 4 PASS : shared memory appears to be removed If I can find some time, I'll try to grab your patch and see if I can reproduce the error on my machine. -Paul Larson |
From: mingming c. <cm...@us...> - 2002-10-25 18:57:39
|
Paul Larson wrote: > > On Fri, 2002-10-25 at 12:17, mingming cao wrote: > > > > shmctl01 3 FAIL : # of attaches is incorrect - 0 > I guess you are running it with -i2? No, I did not use -i2. What I did is just run ./shmctl01 > I just tried shmctl01 -i2 on a > 2.5.44 kernel and did not get this error. Sorry, Paul. Could you try 2.5.44-mm4? I saw the error on clean 2.5.44-mm4(without my patch). And I remember I saw this on 2.5.42-mm2 also. Here is what I saw: [root@elm3b83 shmctl]# ./shmctl01 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 2 PASS : pid, size, # of attaches and mode are correct - pass #2 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 3 FAIL : # of attaches is incorrect - 0 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 4 PASS : new mode and change time are correct [root@elm3b83 shmctl]# ./shmctl01 -i2 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 2 PASS : pid, size, # of attaches and mode are correct - pass #2 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 3 FAIL : # of attaches is incorrect - 0 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 4 PASS : new mode and change time are correct shmctl01 1 BROK : couldn't create the shared memory segment shmctl01 2 BROK : Remaining cases broken shmctl01 3 BROK : Remaining cases broken shmctl01 4 BROK : Remaining cases broken |
From: Paul L. <pl...@li...> - 2002-10-25 19:16:32
|
On Fri, 2002-10-25 at 13:51, mingming cao wrote: > Paul Larson wrote: > > > > On Fri, 2002-10-25 at 12:17, mingming cao wrote: > > > > > > shmctl01 3 FAIL : # of attaches is incorrect - 0 > > I guess you are running it with -i2? > No, I did not use -i2. Maybe I just read it wrong. > What I did is just run ./shmctl01 > > > I just tried shmctl01 -i2 on a > > 2.5.44 kernel and did not get this error. > Sorry, Paul. Could you try 2.5.44-mm4? I saw the error on clean > 2.5.44-mm4(without my patch). And I remember I saw this on 2.5.42-mm2 > also. > > Here is what I saw: I still have my results from testing 2.5.44-mm4, here's a cut and paste from shmctl01: <<<test_start>>> tag=shmctl01 stime=1035486589 cmdline="shmctl01" contacts="" analysis=exit initiation_status="ok" <<<test_output>>> shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 0 INFO : shmdt() failed - 22 shmctl01 1 PASS : pid, size, # of attaches and mode are correct - pass #1 shmctl01 2 PASS : pid, size, # of attaches and mode are correct - pass #2 shmctl01 3 PASS : new mode and change time are correct shmctl01 4 PASS : shared memory appears to be removed <<<execution_status>>> duration=1 termination_type=exited termination_id=0 corefile=no cutime=0 cstime=1 <<<test_end>>> I havn't seen this test fail before but I'll be happy to do more testing with your patch to see if I can reproduce it. You may also want to consider updating LTP to the newest version. I'm fairly certain that shmctl01 hasn't been changed since the version you have, but just to be consistent you may want to do that. Thanks, Paul Larson |
From: Manfred S. <ma...@co...> - 2002-10-25 20:24:21
Attachments:
patch-ltp
|
mingming cao wrote: >Here is what I saw: > >[root@elm3b83 shmctl]# ./shmctl01 >shmctl01 1 PASS : pid, size, # of attaches and mode are correct - >pass #1 >shmctl01 0 INFO : shmdt() failed - 22 >shmctl01 0 INFO : shmdt() failed - 22 >shmctl01 0 INFO : shmdt() failed - 22 >shmctl01 0 INFO : shmdt() failed - 22 > These failures are caused by a bug in the ltp test. See the attached patch. >shmctl01 2 PASS : pid, size, # of attaches and mode are correct - >pass #2 >shmctl01 0 INFO : shmdt() failed - 22 >shmctl01 0 INFO : shmdt() failed - 22 >shmctl01 0 INFO : shmdt() failed - 22 >shmctl01 0 INFO : shmdt() failed - 22 >shmctl01 3 FAIL : # of attaches is incorrect - 0 > This one is odd. The testcase contains races, but they can only increase # of attaches. Could you strace shmctl01? The testcase with shmat(), then fork() fails. -- Manfred |
From: mingming c. <cm...@us...> - 2002-10-25 05:59:22
|
Rusty Russell wrote: > > > Here's my brief audit: > > >+ int max_id = ids->max_id; > > > >- for (id = 0; id <= ids->max_id; id++) { > >+ read_barrier_depends(); > >+ for (id = 0; id <= max_id; id++) { > > That needs to be a rmb(), not a read_barrier_depends(). Thanks for spending some time reviewing the barriers for me. While I was thinking the reason why a rmb is needed here, I found that maybe we don't need a barrier here at all. Since ipc_findkey()(the code above) and the grow_ary() are both protected by ipc_ids.sem(there missing document for this), so both the max_id and the the entries array seen by ipc_findkey should be the latest one. Also I think it's safe to remove the rmb() in ipc_get() for the same reason. ipc_get() is only used by shm_get_stat() through shm_get() and is called with the shm_ids.sem protected. (Maybe ipc_get should be removed totally?) > And like all > barriers, it *requires* a comment: > /* We must read max_id before reading any entries */ > Sure. I will add such comments on all places where barriers are being used. I will do as much as I can to add more comments in the code about what lock/sem are hold before/after the funtion is called.:-) > I can't see the following in the patch posted, but: > > void ipc_rcu_free(void* ptr, int size) > > { > > struct rcu_ipc_free* arg; > > > > arg = (struct rcu_ipc_free *) kmalloc(sizeof(*arg), GFP_KERNEL); > > if (arg == NULL) > > return; > > arg->ptr = ptr; > > arg->size = size; > > call_rcu(&arg->rcu_head, ipc_free_callback, arg); > > } > > This is unacceptable crap, sorry. You *must* allocate the resources > required to free the object *at the time you allocate the object*, > since freeing must not fail. > > > Even better: is it possible to embed the rcu_ipc_free inside the > > object-to-be-freed? Perhaps not? > > Yes, this must be done. > I thought about embed rcu_ipc_free inside the ipc_ids structure before. But there could be a problem if grow_ary() is called again before the old array associated with the previous grow_ary() has not scheduled to be freed yet. I see a need to do that now, as you made very good point. I will make the changes tomorrow. Thanks a lot for your comments. Mingming |
From: Bill H. <ha...@au...> - 2002-10-31 17:55:29
|
Andrew Morton wrote: > > mingming cao wrote: > > > > Hi Andrew, > > > > Here is the updated ipc lock patch: > > > So to be able to commend this change to Linus I'd have to rely on > assurances from people who _do_ understand IPC (Hugh?) and on lots > of testing. > > So yes, I'll include it, and would solicit success reports from > people who are actually exercising that code path, thanks. > Andrew, I tested Mingming's RCU ipc lock patch using a *new* microbenchmark - semopbench. semopbench was written to test the performance of Mingming's patch. I also ran a 3 hour stress and it completed successfully. Explanation of the microbenchmark is below the results. Here is a link to the microbenchmark source. http://www-124.ibm.com/developerworks/opensource/linuxperf/semopbench/semopbench.c SUT : 8-way 700 Mhz PIII I tested 2.5.44-mm2 and 2.5.44-mm2 + RCU ipc patch >semopbench -g 64 -s 16 -n 16384 -r > sem.results.out >readprofile -m /boot/System.map | sort -n +0 -r > sem.profile.out The metric is seconds / per repetition. Lower is better. kernel run 1 run 2 seconds seconds ================== ======= ======= 2.5.44-mm2 515.1 515.4 2.5.44-mm2+rcu-ipc 46.7 46.7 With Mingming's patch, the test completes 10X faster. ----- 2.4.44-mm2 readprofile shows 70 % of 8 CPUs spinning on .text.lock.sem : http://www-124.ibm.com/developerworks/opensource/linuxperf/semopbench/sem.profile.1.out 2.5.44-mm2 + Mingming's patch shows that the spin on .text.lock.sem is gone : http://www-124.ibm.com/developerworks/opensource/linuxperf/semopbench/sem.rcu.profile.1.out Here is the semopbench results for 2.5.44-mm2 : http://www-124.ibm.com/developerworks/opensource/linuxperf/semopbench/sem.results.1.out Here is the semopbench results for 2.5.44-mm2 + Mingming's patch : http://www-124.ibm.com/developerworks/opensource/linuxperf/semopbench/sem.rcu.results.1.out ----- Here is some info on how the microbenchmark works : >semopbench -g 64 -s 16 -n 16384 -r -g 64 creates 64 sema4 groups group0 group1 ... group63 -s 16 creates 16 sema4s in each group group0 - sem0, sem1, ... sem15 group1 - sem0, sem1, ... sem15 ... group63 - sem0, sem1, ... sem15 For each of the 1024 (64*16) sema4s, a process is forked and sleeps on it's own sema4. When the test starts, the master process will post the sema4 for the 1st process in each group. When the 1st process in each group wakes up it will : (a) resets it's own sema4 (b) post the sema4 for the next process in the group (c) waits on his own sema4 -n 16384 runs through each sema4 group in the above manner 16384 times. semopbench reports : (1) average microseconds that it takes each process to complete repetitions. (2) CPU utilization -d turns on debug printfs -v turns on per process times. -r does a readprofile -r , reset of the profile buffer before test starts Bill Hartner -- IBM Linux Technology Center Performance Team http://www-124.ibm.com/developerworks/oss/linux ha...@au... |