From: SourceForge.net <no...@so...> - 2012-06-07 18:55:08
|
Bugs item #3137873, was opened at 2010-12-15 04:45 Message generated for change (Comment added) made by randyhb You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=684730&aid=3137873&group_id=119701 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: External Functions Group: Next Release Status: Pending Resolution: Fixed Priority: 7 Private: No Submitted By: Henning G (ubihga) Assigned to: Rick McGuire (bigrixx) Summary: rxapi looping Initial Comment: Hi I have recently installed rexx version 4.1 on one off my AIX'es. AIX version: 61-05-011016_SP, runs on POWER7 hardware, same problem seen on POWER5 and POWER6 hardware. rexx works fine but after some time I se that rxapi starts looping. I have seen same behaviour in rexx 4.0.1, but never got a solution for it. It looks like it happens more frequently if I add more CPU's to my AIX, With only one CPU it works for 5-7 days before looping starts. If I have 2 or more CPU's on the system the loop usualy starts within 1 day. Is there anything I can do to collect some documentation to you when the loops is happening ? I hope you can help me finding this problem or I am stuck at version 3.2 that is working fine. /Henning Gammelmark ---------------------------------------------------------------------- Comment By: Randy (randyhb) Date: 2012-06-07 11:55 Message: WHEN a fix will be avail (I meant)... ---------------------------------------------------------------------- Comment By: Randy (randyhb) Date: 2012-06-07 11:54 Message: I opened a new bug report, but was really trying go find out with a fix for this will be available for 64 bit AIX. Thanks!! ---------------------------------------------------------------------- Comment By: Mark Miesfeld (miesfeld) Date: 2012-03-27 08:01 Message: Rainer will have to do the build, none of the other developers have access to an AIX box. I can't say when he might do that. You could try doing a build yourself from source, it should be relatively simple. ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2012-03-27 04:21 Message: Hi Can you say anything about when 4.1.1 will be available on AIX ? /HGA ---------------------------------------------------------------------- Comment By: Mark Miesfeld (miesfeld) Date: 2012-02-17 06:46 Message: Henning, The 32-bit and 64-bit versions of ooRexx use the same code, so the fix is in either version. ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2012-02-17 00:55 Message: Hi. Thanks for the info, I test the new version as soon as possible. What about the 32-bit version, any news in that area ? /Henning ---------------------------------------------------------------------- Comment By: Mark Miesfeld (miesfeld) Date: 2012-02-16 09:44 Message: Committed revision 7539. This commit back ports the fix Rick did in trunk, r6571. Henning - this fix, that you said worked for your 64-bit system with the test build, will be in the next bug fix release. Please be sure and test the beta 4.1.1 when we are in the beta period. ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-05-05 00:09 Message: Hi We still have problem in the 32-bit version. So the problem has not been fixed. What do you need for futher investigating ? /Henning ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-03-10 00:29 Message: In January you built version rexx-4.1-64 bit, for me with the "rxapi-looping-fix" included. But I can not remember if you also built a 32 bit version with patch built in, All of my 64-bit implementations work, but I still have the looping bug on the LPAR's where the application requires 32 bits rexx. Can I get you to build a "rexx 4.1 32 bits" with this fix included? ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-16 23:09 Message: I have during the weekend been running stress test on my system, at I can no longer create the problem. So I would say the problem have been fixed. Final questio: when can I have the official fix so I can deploy rexx 4.1 (incl fix) to all my systems ? /HGA ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-14 03:01 Message: I have applyed the new rxapi to my system. After the fix has been applyed I have not (yet) been able to recreate the rxapi loop problem. I have just finnished test #100K Not only the rxapi loop seeming to be solved, also the "REX121: Storage for data queues is exhausted" and "REX124: The queue 'quename' does not exist." problem seems also to be solved. I will continue my testing. /Henning ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-13 04:56 Message: Observations and questions. If my rexx look like this: #! /usr/bin/rexx CALL init_prog CALL doit CALL prog_end EXIT prog_rc /* end main */ init_prog: /* define rexx queue */ newq = rxqueue('c') oldq = rxqueue('s',newq) RETURN /* end init_prog */ doit: 'hostname | rxqueue ' newq RETURN /* */ prog_end: /* del tmp queue */ rx = rxqueue('s',oldq) ry = rxqueue('d',newq) RETURN /* prog_end */ And it is started several times (i.e. 10000 times from other rexx) I get "REX121: Storage for data queues is exhausted." Or I can recreate the original rxapi loop problem. If I to procedure "doit:" add DO QUEUED() PULL data END It looks like with this minor change, I can't recreate the problem, and "REX121: Storage for data queues is exhausted." is not happening. I also tryed to recreate the problem if I removed "newq = rxqueue('c')... rxqueue('d',newq) from the rexx, this also works without any problems. This made me consider why I had added the "rxqueue(c) - rxqueue('d',newq) " to our rexx's and the story was: 10 years ago We had a problem on our AIX's with rexx if We had several rexx running simultaneous. The problem was that rexx's "snatch" data from each others rxqueue. I think we at that time used rexx version 1 or 2. So the question is: have this problem been solved so I can avoid the masive use of "rxqueue(c)" thus reduce the need for "seriliasion in rxapi" ? The "rxapi loop" porblem will still exist, but I have eliminated (or reduced) the the reason in my environment ? ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-12 23:19 Message: Hi Now I am 100% sure, the problem is NOT fixed. This morning I had tle loop without running my stress rexx's. This is the status: (dbx) thread run thread state-k wchan state-u k-tid mode held scope function >$t9681 run blocked 34406595 k no sys _global_lock_common $t9702 run running 50921699 u no sys tolower (dbx) thread current 9702 (dbx) listi 0x100015038 (tolower+0x10) 7c0903a6 mtctr r0 0x10001503c (tolower+0x14) 4e800420 bctr 0x100015040 (tolower+0x18) 00000000 Invalid opcode. 0x100015044 (tolower+0x1c) 000ca000 Invalid opcode. 0x100015048 (tolower+0x20) 00000000 Invalid opcode. 0x10001504c (tolower+0x24) 00000018 Invalid opcode. 0x100015050 (toupper) e98203d8 ld r12,0x3d8(r2) 0x100015054 (toupper+0x4) f8410028 std r2,0x28(r1) 0x100015058 (toupper+0x8) e80c0000 ld r0,0x0(r12) 0x10001505c (toupper+0xc) e84c0008 ld r2,0x8(r12) (dbx) func glink64.tolower (dbx) list no source file (dbx) /Henning ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-12 04:12 Message: Hi supllemental to last comment. It looks like t is easier to provoke the error if SMT=on and cpu cppunt > 1. I can recrate the problem on a 1 cpu system but the rexx runs 3-10 times longer before the problem shows up. Bun then only one cpu means also less posability for concurrency! /Henning /Henning ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-12 03:26 Message: Hi The fix received, applyed and testet. it is better but I am still able to recrate the problem, and also get other errors. Now it looks like rxapi miss to clean up in memory. I get: REX121: Storage for data queues is exhausted. To fix this I do: stop/kill all running rexx programs kill rxapi wait untiil all rexx has terminated. futher documentation: (dbx) thread run thread state-k wchan state-u k-tid mode held scope function >$t5 run blocked 35258983 k no sys _global_lock_common $t12026 run running 58261573 u no sys isupper (dbx) chread current t12026 chread current t12026 ^ unrecognized command (dbx) thread current 12026 (dbx) listi 0x90000000021b880 (isupper+0x60) 7d8803a6 mtlr r12 0x90000000021b884 (isupper+0x64) 4e800020 blr 0x90000000021b888 (isupper+0x68) 00000000 Invalid opcode. 0x90000000021b88c (isupper+0x6c) 00002041 Invalid opcode. 0x90000000021b890 (isupper+0x70) 80000100 lwz r0,0x100(r0) 0x90000000021b894 (isupper+0x74) 00000000 Invalid opcode. 0x90000000021b898 (isupper+0x78) 00000068 Invalid opcode. 0x90000000021b89c (isupper+0x7c) 00076973 Invalid opcode. 0x90000000021b8a0 (isupper+0x80) 75707065 andis. r16,r11,0x7065 0x90000000021b8a4 (isupper+0x84) 72000000 andi. r0,r16,0x0 (dbx) func isupper (dbx) func isupper REX121: Storage for data queues is exhausted. retest Looping 1200 Looping 1300 Looping 1400 Looping 1500 REX121: Storage for data queues is exhausted. REX121: Storage for data queues is exhausted. At this point my test go "bananas" (Danish for out off control) root@aix031:/rxtest-> ps -ef | grep rxapi nobody 3342576 3538944 120 10:29:31 - 0:23 /opt/ooRexx/bin/rxapi root 56361044 2294156 1 10:56:51 pts/2 0:00 grep rxapi root@aix031:/rxtest-> dnx -a 3342576 ksh: dnx: not found. root@aix031:/rxtest-> dbx -a 3342576 Waiting to attach to process 3342576 ... Successfully attached to rxapi. warning: Directory containing rxapi could not be determined. Apply 'use' command to initialize source path. Type 'help' for help. reading symbolic information ... stopped in recv at 0x9000000002eef68 ($t235) 0x9000000002eef68 (recv+0xc8) e8410028 ld r2,0x28(r1) (dbx) func recv (dbx) listi 0x9000000002eef68 (recv+0xc8) e8410028 ld r2,0x28(r1) 0x9000000002eef6c (recv+0xcc) 2c030000 cmpi cr0,0x0,r3,0x0 0x9000000002eef70 (recv+0xd0) 60000000 ori r0,r0,0x0 0x9000000002eef74 (recv+0xd4) 7c7c07b4 Invalid opcode. 0x9000000002eef78 (recv+0xd8) 4180002c blt 0x9000000002eefa4 (recv+0x104) 0x9000000002eef7c (recv+0xdc) 63830000 ori r3,r28,0x0 0x9000000002eef80 (recv+0xe0) eb210088 ld r25,0x88(r1) 0x9000000002eef84 (recv+0xe4) e98100d0 ld r12,0xd0(r1) 0x9000000002eef88 (recv+0xe8) 382100c0 addi r1,0xc0(r1) 0x9000000002eef8c (recv+0xec) 7d8803a6 mtlr r12 (dbx) thread run thread state-k wchan state-u k-tid mode held scope function >$t235 run running 26738801 k no sys recv $t1775 run running 11600205 u no sys Utilities::strCaselessCompare(const char*,const char*) (dbx) thread current t235 pthread current t235 ^ syntax error (dbx) thread current 1775 (dbx) thread run thread state-k wchan state-u k-tid mode held scope function *$t235 run running 26738801 k no sys recv >$t1775 run running 11600205 u no sys Utilities::strCaselessCompare(const char*,const char*) (dbx) listi 0x100014f38 (Utilities::strCaselessCompare(const char*,const char*)+0x58) e8410028 ld r2,0x28(r1) 0x100014f3c (Utilities::strCaselessCompare(const char*,const char*)+0x5c) 607f0000 ori r31,r3,0x0 0x100014f40 (Utilities::strCaselessCompare(const char*,const char*)+0x60) e86100b8 ld r3,0xb8(r1) 0x100014f44 (Utilities::strCaselessCompare(const char*,const char*)+0x64) 88630000 lbz r3,0x0(r3) 0x100014f48 (Utilities::strCaselessCompare(const char*,const char*)+0x68) 480000e1 bl 0x100015028 (tolower) 0x100014f4c (Utilities::strCaselessCompare(const char*,const char*)+0x6c) e8410028 ld r2,0x28(r1) 0x100014f50 (Utilities::strCaselessCompare(const char*,const char*)+0x70) 63e50000 ori r5,r31,0x0 0x100014f54 (Utilities::strCaselessCompare(const char*,const char*)+0x74) 60640000 ori r4,r3,0x0 0x100014f58 (Utilities::strCaselessCompare(const char*,const char*)+0x78) 60a30000 ori r3,r5,0x0 0x100014f5c (Utilities::strCaselessCompare(const char*,const char*)+0x7c) 7c032000 cmp cr0,0x0,r3,r4 ---------------------------------------------------------------------- Comment By: Rainer Tammer (tammer) Date: 2011-01-11 08:51 Message: Hello, I will build a fix on base of 4.1.0 for that. Bye Rainer ---------------------------------------------------------------------- Comment By: Mark Miesfeld (miesfeld) Date: 2011-01-11 06:55 Message: Henning, We'll try to get you a build to test, but it may take a little while. Be a little patient here. ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-11 06:30 Message: Hi If you can send me the fix (or give a link where I can download it) I can apply it ASAP. The AIX aI am using is a "crash-and-burn" LPAR so I have no problems with applying a "non offial fix" Ti the stack trace question. I need some help to do that because I an not a dbx expert. But if someone can give me hje method/steps I can syply the needed data. /Henning ---------------------------------------------------------------------- Comment By: Rick McGuire (bigrixx) Date: 2011-01-11 06:13 Message: Committed revision 6571. I'm about 99% certain I have a fix for this problem. It indeed was a synchronization issue that can occur when new named queues are frequently created and deleted. This problem could have shown up on any of the platforms, but for some reason we've only seen this on AIX so far. I've fixed this in the trunk. It would be nice to see if we can get a version of 4.1.0 created with this fix applied to see if it has really gone away. ---------------------------------------------------------------------- Comment By: Rick McGuire (bigrixx) Date: 2011-01-11 02:44 Message: No, what I really need is the callback chain for that thread, not the instructions at the point where this is running. The previous post showed what method this is looping in, but I need to understand how we got at that method and what request is currently being executed. ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-11 02:38 Message: Is this what you wanted ? (dbx) pthread thread state-k wchan state-u k-tid mode held scope function $t1 wait 0xf100050014616a08 running 78512381 k no sys >$t45 run blocked 49807565 k no sys _global_lock_common $t544 run running 15991017 u no sys __towlower_std $t534 wait 0xf1000500001b08c8 running 62783673 k no sys $t536 wait 0xf1000500007e28c8 running 37027857 k no sys $t539 wait 0xf1000500003540c8 running 37093621 k no sys $t545 wait 0x0000000110011730 blocked 71565447 k no sys _mutex_lock $t546 wait 0x0000000110011730 blocked 61800563 k no sys _mutex_lock $t547 wait 0x0000000110011730 blocked 19071095 k no sys _mutex_lock thrad current 544 (dbx) pthread thread state-k wchan state-u k-tid mode held scope function $t1 wait 0xf100050014616a08 running 78512381 k no sys *$t45 run blocked 49807565 k no sys _global_lock_common >$t544 run running 15991017 u no sys __towlower_std $t534 wait 0xf1000500001b08c8 running 62783673 k no sys $t536 wait 0xf1000500007e28c8 running 37027857 k no sys (dbx) listi 0x90000000022a2a0 (__towlower_std) f8610030 std r3,0x30(r1) 0x90000000022a2a4 (__towlower_std+0x4) 7cc307b4 Invalid opcode. 0x90000000022a2a8 (__towlower_std+0x8) e8810030 ld r4,0x30(r1) 0x90000000022a2ac (__towlower_std+0xc) 8004003c lwz r0,0x3c(r4) 0x90000000022a2b0 (__towlower_std+0x10) 7c060040 cmpl cr0,0x0,r6,r0 0x90000000022a2b4 (__towlower_std+0x14) 4d810020 bgtlr 0x90000000022a2b8 (__towlower_std+0x18) 78651764 Invalid opcode. 0x90000000022a2bc (__towlower_std+0x1c) 80040038 lwz r0,0x38(r4) 0x90000000022a2c0 (__towlower_std+0x20) 7c060040 cmpl cr0,0x0,r6,r0 0x90000000022a2c4 (__towlower_std+0x24) 4d800020 bltlr (dbx) ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-11 02:00 Message: Hi Nor am I an dbx expert, so could you please finde someone who could guide me to do the right stack trace /Hga ---------------------------------------------------------------------- Comment By: Rick McGuire (bigrixx) Date: 2011-01-11 01:42 Message: I don't have access to an AIX system myself and I'm not the most literate in using dbx either. The thread in question I'd like to see a stack trace for is this one: $t9257 run running 14221547 u no sys QueueTable::locate(const char*) That sort of confirms my suspicions on what might be going wrong, but I'd like to see a stack trace so I can see what operation is in process. ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-11 00:13 Message: Hi I have re-created the problem and done some more dbx commands. I hope uou can use the output, if not please do as I requested in the previous comment (describe what I have to do). Question: have you tryed to recreate the problem with rx4 and rx5 on your system ? /Henning The dbx output: (dbx) listi 0x9000000000027dc (_global_lock_common+0x4fc) 60210000 ori r1,r1,0x0 0x9000000000027e0 (_global_lock_common+0x500) 7c421378 mr r2,r2 0x9000000000027e4 (_global_lock_common+0x504) e8610088 ld r3,0x88(r1) 0x9000000000027e8 (_global_lock_common+0x508) 38800000 li r4,0x0 0x9000000000027ec (_global_lock_common+0x50c) 63850000 ori r5,r28,0x0 0x9000000000027f0 (_global_lock_common+0x510) 48003423 bla 0x0000000000003420 0x9000000000027f4 (_global_lock_common+0x514) 2c030000 cmpi cr0,0x0,r3,0x0 0x9000000000027f8 (_global_lock_common+0x518) e8610088 ld r3,0x88(r1) 0x9000000000027fc (_global_lock_common+0x51c) 4182fba0 beq 0x90000000000239c (_global_lock_common+0xbc) 0x900000000002800 (_global_lock_common+0x520) 3c802000 lis r4,0x2000 (dbx) (dbx) func _global_lock_common (dbx) (dbx) proc { -Identification/Authentication Info--------------------------- pi_pid: 2162768 pi_sid: 2162768 pi_ppid: 2097256 pi_pgrp: -2 pi_uid: -2 pi_suid: -2 ---------------Controlling TTY Info--------------------------- pi_ttyp: 0 pi_ttyd: 0x0000000000000000 pi_ttympx: 0x0000000000000000 -----------------------------Scheduler Information------------ pi_nice: 0x00000014 pi_state: SSTOP pi_flags: SLOAD | STRC | SWTED | SORPHANPGRP | SEXECED pi_flags2: S64BIT | SPTHREADED | SPROJFIXED pi_thcount: 406 pi_cpu: 0 pi_pri: 60 ---------------------------------------------File Management-- pi_maxofile: 0x00000192 pi_cmask: 0x0022 pi_cdir: 0xf1000102813fbc20 pi_rdir: 0x0000000000000000 pi_comm: "rxapi" ----------------------------------Memory---------------------- pi_adspace: 0x00000017001f0480 pi_majflt: 0x000000000000008d pi_minflt: 0x00000000000015cc pi_repage: 0x0000000000000000 pi_size: 0x0000000000000c40 pi_utime: N/A pi_stime: N/A -------Credentials, Accounting, Profiling & Resource Limits--- pi_cred: (use proc cred) pi_ru: (use proc ru) pi_cru: (use proc cru) pi_ioch: 0x000000000007c34d pi_irss: 0x000000001201d5e0 pi_start: Tue Jan 11 08:37:41 2011 pi_rlimit: (use proc rlimit) -Memory Usage------------------------------------------------- pi_drss: 0x0000000000000c26 pi_trss: 0x0000000000000065 pi_dvm: 0x0000000000000c26 pi_pi_prm: 0x0000000000000000 pi_tsize: 0x000000000001a757 pi_dsize: 0x0000000003155b80 pi_sdsize: 0x0000000000000000 ------------------Signal Management--------------------------- pi_signal: (use proc signal) pi_sigflags: (use proc sigflags) pi_sig: <none> ---------------------------------WLM Information-------------- pi_classname: Unclassified pi_tag: <none> pi_chk_utime: N/A pi_chk_ctime: N/A } (dbx) pthread thread state-k wchan state-u k-tid mode held scope function $t1 wait 0xf10005001455d208 running 8454163 k no sys >$t45 run blocked 27000849 k no sys _global_lock_common $t9267 wait 0xf1000500001c30c8 running 31326369 k no sys $t9263 wait 0x0000000110011730 blocked 33554473 k no sys _mutex_lock $t9254 wait 0xf1000500002bf0c8 running 35192941 k no sys $t9264 wait 0xf1000500009438c8 running 33292391 k no sys $t9257 run running 14221547 u no sys QueueTable::locate(const char*) $t9265 wait 0xf1000500007dd0c8 running 35586283 k no sys $t9273 wait 0x0000000110011730 blocked 24510523 k no sys _mutex_lock $t9269 wait 0xf1000500004530c8 running 23003283 k no sys $t9274 wait 0x0000000110011730 blocked 32964753 k no sys _mutex_lock $t9275 wait 0x0000000110011730 blocked 37159101 k no sys _mutex_lock $t9276 wait 0x0000000110011730 blocked 20709531 k no sys _mutex_lock $t9277 wait 0x0000000110011730 blocked 37486597 k no sys _mutex_lock $t9278 wait 0x0000000110011730 blocked 38076475 k no sys _mutex_lock $t9279 wait 0x0000000110011730 blocked 33882203 k no sys _mutex_lock $t9280 wait 0x0000000110011730 blocked 33751211 k no sys _mutex_lock $t9281 wait 0x0000000110011730 blocked 35258479 k no sys _mutex_lock ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-10 23:32 Message: Hi Can you (in detail) describe how you want the stack trace, Then I can recreate the problem and give you the documentation you request. Somthing like this" a. recreate the problen b. start dbx dbx -a nnn c. dbx command d. dbx command e. f. g. /Henning ---------------------------------------------------------------------- Comment By: Rick McGuire (bigrixx) Date: 2011-01-10 08:11 Message: The dbx informaton is only showing the state for a single thread, not for the thread that's doing the looping. It appears that somehow the queue manager is stuck in an infinite loop for a previous request, so all subsequent requests are locked out because the looping thread is holding the mutex. If we could get a stack trace showing where the loop is occurring, this might give use a better idea. ---------------------------------------------------------------------- Comment By: Rick McGuire (bigrixx) Date: 2011-01-10 08:05 Message: No, I disagree with this assessment. There is only one close() method on this class. The other close call is for the system API to close the socket handled that's contained in the instance variable. In any event, if the application runs out of sockets, then this should be reflected as an exception, not as a looping condition. ---------------------------------------------------------------------- Comment By: David Ashley (wdashley) Date: 2011-01-10 07:48 Message: I think I have a handle on this, but I am very unsure and Rick will probably need to look into this. In SysCSStream.cpp and .hpp there are references to the method "close". Some references call the method with no arguments and some references call the method with a single argument (a socket handle). The problem is that there is no method "close" defined that takes a single argument. There is a "close" method defined that takes zero arguments. So what I think is happening is that sockets are never being closed and eventually the application runs out of sockets and then loops trying to get new ones. Again, this is all conjecture at this point but it seems logical to me. ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-10 06:53 Message: My cut/past has "destroyed" topas layout. If you want my txt file, please ask by mail. /Henning ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-10 06:48 Message: I am now able to reproduce the “rxapi-loop” error. I have created 2 small rexx programs (rx4 and rx5) Rx4 is just d “hostname | rxqueue” Rx5 is calls rx4 10000 times. After approx 2000 -5000 calls the rxapi stops respomding to “rxqueue('c')” When this happens, all subsequent rexx’s is “hanging” waiting for response . The document attached (rxapi_loop_doc.txt) document this situation. The documentations include dbx stack when we have the loop. Rx4 and rx5 is only created to show the problem. In my real environment the problem is seen by far more complex rexx’s. The typical situations is my LPAR is running OK for 1-10 days, then suddenly the “rxapi” starts using 100% CPU (on a 1 CPU system). When this is happening I see 2-3-4 rxqueue commands hanging (documented inrxa-pi_loop_doc.txt). rx1,rx2 and rx3 can show a more “production like situation, but sometime you have to run rx3 50-100 times before the problem can be seen. rx4 and rx5 is ONLY to reproduce the problem. To me it looks like a “serialization” problem in the rxapi environment. /Henning Gammelmark Can not attach files, so utp/pasted insted. rxapi_loop_doc.tst: Topas Monitor for host: aix031 EVENTS/QUEUES FILE/TTY Mon Jan 10 13:29:37 2011 Interval: 2 Cswitch 152 Readch 13294 Syscall 301 Writech 3625 CPU User% Kern% Wait% Idle% Physc Entc Reads 10 Rawin 0 ALL 62.8 1.1 0.0 36.1 1.02 254.3 Writes 4 Ttyout 868 Forks 0 Igets 0 Network KBPS I-Pack O-Pack KB-In KB-Out Execs 0 Namei 28 Total 1.1 1.0 2.0 0.0 1.0 Runqueue 1.0 Dirblk 0 Waitqueue 0.0 Disk Busy% KBPS TPS KB-Read KB-Writ MEMORY Total 0.0 0.0 0.0 0.0 0.0 PAGING Real,MB 40960 Faults 149 % Comp 25 FileSystem KBPS TPS KB-Read KB-Writ Steals 0 % Noncomp 0 Total 11.2 8.5 11.2 0.0 PgspIn 0 % Client 0 PgspOut 0 Name PID CPU% PgSp Owner PageIn 0 PAGING SPACE rxapi 48037908 62.7 10.7 nobody PageOut 0 Size,MB 7168 vmmd 393232 0.5 0.2 root Sios 0 % Used 1 topas 721218 0.1 8.9 root % Free 99 gil 1507388 0.1 0.1 root AME nfs_stat 2490700 0.0 0.8 root TMEM,MB 20480 WPAR Activ 0 random 2818232 0.0 0.1 root CMEM,MB 141 WPAR Total 0 sshd 6619344 0.0 0.7 bduhga EF[T/A] 2.0/2.0 Press: "h"-help tecad_lo 5243094 0.0 1.7 root CI:0.0 CO:0.0 "q"-quit java 6881428 0.0 53.4 root swapper 264 0.0 0.1 root root@aix031:/home/bduhga/rxapi-> ps -ef | grep rexx ( 357 lines ) root 2949242 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 4063410 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 4129008 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 4325486 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 4849864 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 5374178 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 5505062 1 0 13:27:36 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 5570582 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 5898336 1 0 13:27:35 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 6160392 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 6357210 1 0 13:27:35 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 6815832 1 0 13:27:37 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 7667788 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 7995590 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 8061016 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 8257742 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9240792 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9306186 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9437372 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9568314 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9633924 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9830420 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 10092688 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 10420402 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 10682378 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root@aix031:/home/bduhga/rxapi-> ps -ef | grep rxq |grep -v grep root 6029554 48103518 0 13:27:34 pts/0 0:00 rxqueue S2de005eQ11003ec90 root 38273254 22348242 0 13:27:34 pts/0 0:00 rxqueue S15501d2Q11003f010 root 24707444 40108058 0 13:27:34 pts/0 0:00 rxqueue S264001aQ11003ef30 root@aix031:/home/bduhga/rxapi-> oot@aix031:/home/bduhga/rxapi-> ps -ef | grep 48103518 |grep -v grep root 6029554 48103518 0 13:27:34 pts/0 0:00 rxqueue S2de005eQ11003ec90 root 48103518 1 0 13:27:34 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root@aix031:/home/bduhga/rxapi-> ps -ef | grep 22348242 |grep -v grep root 38273254 22348242 0 13:27:34 pts/0 0:00 rxqueue S15501d2Q11003f010 root 22348242 1 0 13:27:34 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root@aix031:/home/bduhga/rxapi-> ps -ef | grep 40108058 |grep -v grep root 40108058 1 0 13:27:34 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 24707444 40108058 0 13:27:34 pts/0 0:00 rxqueue S264001aQ11003ef30 root@aix031:/home/bduhga/rxapi-> root@aix031:/home/bduhga/rxapi-> ps -ef | grep -c rexx 357 dbx -a <pid off rxapi> (dbx) where _global_lock_common(??, ??, ??) at 0x9000000000027b4 _mutex_lock(??, ??, ??) at 0x90000000000f8bc APIServer.SysMutex::request()(0x110013e28), line 81 in "SysSemaphore.hpp" Lock::Lock(SysMutex&)(this = 0x000000011005c2d0, l = &(...)), line 55 in "SynchronizedBlock.hpp" unnamed block in ServerQueueManager::dispatch(ServiceMessage&)(this = 0x0000000110013e18, message = &(...)), line 848 in "QueueManager.cpp" ServerQueueManager::dispatch(ServiceMessage&)(this = 0x0000000110013e18, message = &(...)), line 848 in "QueueManager.cpp" APIServerInstance::dispatch(ServiceMessage&)(this = 0x0000000110013e10, message = &(...)), line 63 in "APIServerInstance.cpp" unnamed block in APIServer::processMessages(SysServerConnection*)(this = 0x0000000110001be8, connection = 0x0000000110014270), line 173 in "APIServer.cpp" APIServer::processMessages(SysServerConnection*)(this = 0x0000000110001be8, connection = 0x0000000110014270), line 173 in "APIServer.cpp" APIServerThread::dispatch()(this = 0x0000000110013d90), line 56 in "APIServerThread.cpp" call_thread_function(void*)(argument = 0x0000000110013d90), line 124 in "SysThread.cpp" rx: #! /usr/bin/rexx CALL init_prog CALL doit CALL prog_end EXIT prog_rc /* end main */ init_prog: /* define rexx queue */ newq = rxqueue('c') oldq = rxqueue('s',newq) 'hostname | rxqueue' newq PARSE PULL my_name RETURN /* end init_prog */ doit: 'hostname | rxqueue ' newq RETURN /* */ prog_end: /* del tmp queue */ rx = rxqueue('s',oldq) ry = rxqueue('d',newq) RETURN /* prog_end */ rx5: #! /usr/bin/rexx CALL init_prog CALL doit CALL prog_end EXIT prog_rc /* end main */ init_prog: /* define rexx queue */ newq = rxqueue('c') oldq = rxqueue('s',newq) 'hostname | rxqueue' newq PARSE PULL my_name RETURN /* end init_prog */ doit: i = 0 DO i = 1 TO 10000 IF i // 100 = 0 THEN SAY "Looping " i '/home/bduhga/rxapi/rx4 &' END CALL clrq RETURN /* */ prog_end: prog_end: /* slet tmp queue */ rx = rxqueue('s',oldq) ry = rxqueue('d',newq) RETURN /* prog_end */ clrq: DO QUEUED() PULL smat END RETURN If you want rx1-3 pleas contact me, and I will forward to you. /HGA ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-03 01:50 Message: Hi Any newa on this ? The problem prevent me from upgrading from rexx 3.2... /HGA ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2010-12-15 06:31 Message: Hi By looping I mean that the rxapi program (nobody 3211508 2228346 0 Dec 14 - 0:05 /opt/ooRexx/bin/rxapi) uses all CPU resources on the AIX. i.e. topas show 0% idle CPU. No other applications is running at this time. I don't know how to recreate the problem, because the system is idle (no applications running), only my monitoring rexx's are running i.e. rexx to monitor filssystem usage, rexx to monitor processes and so on. theese rexx'es runs evry 5 minuts, and runs for days (5-10-20 days) before the problem arise, then suddently the rxapi starts using 100% CPU resources. The problem has not been seen on systems using oorexx 3.2. That is why I ask for help and method (tools) to document the error. Can rxapi be started in debug mode so we can se what it actualy is going on, or is ther any other way to debug this. I would be glad to provide you with some documentation, but before I can get some usefull "data" from the looping rxapi I can's se what I can to send to you. I can send you our rexx'es but they work fine for several days, so I cant se how this can help. Regards Henning Gammelmark ---------------------------------------------------------------------- Comment By: Mark Miesfeld (miesfeld) Date: 2010-12-15 05:59 Message: Rainer will have to take a look at this, probably. I don't understand what you mean by rxapi is "looping." Could you try to explain better what is happening? You'll either need to explain in detail how to reproduce this and / or attach an example program that causes the problem. Thanks. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=684730&aid=3137873&group_id=119701 |