From: SourceForge.net <no...@so...> - 2011-01-11 14:13:57
|
Bugs item #3137873, was opened at 2010-12-15 07:45 Message generated for change (Settings changed) made by bigrixx You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=684730&aid=3137873&group_id=119701 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. >Category: External Functions >Group: v4.2.0 >Status: Pending >Resolution: Fixed Priority: 7 Private: No Submitted By: Henning G (ubihga) >Assigned to: Rick McGuire (bigrixx) Summary: rxapi looping Initial Comment: Hi I have recently installed rexx version 4.1 on one off my AIX'es. AIX version: 61-05-011016_SP, runs on POWER7 hardware, same problem seen on POWER5 and POWER6 hardware. rexx works fine but after some time I se that rxapi starts looping. I have seen same behaviour in rexx 4.0.1, but never got a solution for it. It looks like it happens more frequently if I add more CPU's to my AIX, With only one CPU it works for 5-7 days before looping starts. If I have 2 or more CPU's on the system the loop usualy starts within 1 day. Is there anything I can do to collect some documentation to you when the loops is happening ? I hope you can help me finding this problem or I am stuck at version 3.2 that is working fine. /Henning Gammelmark ---------------------------------------------------------------------- >Comment By: Rick McGuire (bigrixx) Date: 2011-01-11 09:13 Message: Committed revision 6571. I'm about 99% certain I have a fix for this problem. It indeed was a synchronization issue that can occur when new named queues are frequently created and deleted. This problem could have shown up on any of the platforms, but for some reason we've only seen this on AIX so far. I've fixed this in the trunk. It would be nice to see if we can get a version of 4.1.0 created with this fix applied to see if it has really gone away. ---------------------------------------------------------------------- Comment By: Rick McGuire (bigrixx) Date: 2011-01-11 05:44 Message: No, what I really need is the callback chain for that thread, not the instructions at the point where this is running. The previous post showed what method this is looping in, but I need to understand how we got at that method and what request is currently being executed. ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-11 05:38 Message: Is this what you wanted ? (dbx) pthread thread state-k wchan state-u k-tid mode held scope function $t1 wait 0xf100050014616a08 running 78512381 k no sys >$t45 run blocked 49807565 k no sys _global_lock_common $t544 run running 15991017 u no sys __towlower_std $t534 wait 0xf1000500001b08c8 running 62783673 k no sys $t536 wait 0xf1000500007e28c8 running 37027857 k no sys $t539 wait 0xf1000500003540c8 running 37093621 k no sys $t545 wait 0x0000000110011730 blocked 71565447 k no sys _mutex_lock $t546 wait 0x0000000110011730 blocked 61800563 k no sys _mutex_lock $t547 wait 0x0000000110011730 blocked 19071095 k no sys _mutex_lock thrad current 544 (dbx) pthread thread state-k wchan state-u k-tid mode held scope function $t1 wait 0xf100050014616a08 running 78512381 k no sys *$t45 run blocked 49807565 k no sys _global_lock_common >$t544 run running 15991017 u no sys __towlower_std $t534 wait 0xf1000500001b08c8 running 62783673 k no sys $t536 wait 0xf1000500007e28c8 running 37027857 k no sys (dbx) listi 0x90000000022a2a0 (__towlower_std) f8610030 std r3,0x30(r1) 0x90000000022a2a4 (__towlower_std+0x4) 7cc307b4 Invalid opcode. 0x90000000022a2a8 (__towlower_std+0x8) e8810030 ld r4,0x30(r1) 0x90000000022a2ac (__towlower_std+0xc) 8004003c lwz r0,0x3c(r4) 0x90000000022a2b0 (__towlower_std+0x10) 7c060040 cmpl cr0,0x0,r6,r0 0x90000000022a2b4 (__towlower_std+0x14) 4d810020 bgtlr 0x90000000022a2b8 (__towlower_std+0x18) 78651764 Invalid opcode. 0x90000000022a2bc (__towlower_std+0x1c) 80040038 lwz r0,0x38(r4) 0x90000000022a2c0 (__towlower_std+0x20) 7c060040 cmpl cr0,0x0,r6,r0 0x90000000022a2c4 (__towlower_std+0x24) 4d800020 bltlr (dbx) ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-11 05:00 Message: Hi Nor am I an dbx expert, so could you please finde someone who could guide me to do the right stack trace /Hga ---------------------------------------------------------------------- Comment By: Rick McGuire (bigrixx) Date: 2011-01-11 04:42 Message: I don't have access to an AIX system myself and I'm not the most literate in using dbx either. The thread in question I'd like to see a stack trace for is this one: $t9257 run running 14221547 u no sys QueueTable::locate(const char*) That sort of confirms my suspicions on what might be going wrong, but I'd like to see a stack trace so I can see what operation is in process. ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-11 03:13 Message: Hi I have re-created the problem and done some more dbx commands. I hope uou can use the output, if not please do as I requested in the previous comment (describe what I have to do). Question: have you tryed to recreate the problem with rx4 and rx5 on your system ? /Henning The dbx output: (dbx) listi 0x9000000000027dc (_global_lock_common+0x4fc) 60210000 ori r1,r1,0x0 0x9000000000027e0 (_global_lock_common+0x500) 7c421378 mr r2,r2 0x9000000000027e4 (_global_lock_common+0x504) e8610088 ld r3,0x88(r1) 0x9000000000027e8 (_global_lock_common+0x508) 38800000 li r4,0x0 0x9000000000027ec (_global_lock_common+0x50c) 63850000 ori r5,r28,0x0 0x9000000000027f0 (_global_lock_common+0x510) 48003423 bla 0x0000000000003420 0x9000000000027f4 (_global_lock_common+0x514) 2c030000 cmpi cr0,0x0,r3,0x0 0x9000000000027f8 (_global_lock_common+0x518) e8610088 ld r3,0x88(r1) 0x9000000000027fc (_global_lock_common+0x51c) 4182fba0 beq 0x90000000000239c (_global_lock_common+0xbc) 0x900000000002800 (_global_lock_common+0x520) 3c802000 lis r4,0x2000 (dbx) (dbx) func _global_lock_common (dbx) (dbx) proc { -Identification/Authentication Info--------------------------- pi_pid: 2162768 pi_sid: 2162768 pi_ppid: 2097256 pi_pgrp: -2 pi_uid: -2 pi_suid: -2 ---------------Controlling TTY Info--------------------------- pi_ttyp: 0 pi_ttyd: 0x0000000000000000 pi_ttympx: 0x0000000000000000 -----------------------------Scheduler Information------------ pi_nice: 0x00000014 pi_state: SSTOP pi_flags: SLOAD | STRC | SWTED | SORPHANPGRP | SEXECED pi_flags2: S64BIT | SPTHREADED | SPROJFIXED pi_thcount: 406 pi_cpu: 0 pi_pri: 60 ---------------------------------------------File Management-- pi_maxofile: 0x00000192 pi_cmask: 0x0022 pi_cdir: 0xf1000102813fbc20 pi_rdir: 0x0000000000000000 pi_comm: "rxapi" ----------------------------------Memory---------------------- pi_adspace: 0x00000017001f0480 pi_majflt: 0x000000000000008d pi_minflt: 0x00000000000015cc pi_repage: 0x0000000000000000 pi_size: 0x0000000000000c40 pi_utime: N/A pi_stime: N/A -------Credentials, Accounting, Profiling & Resource Limits--- pi_cred: (use proc cred) pi_ru: (use proc ru) pi_cru: (use proc cru) pi_ioch: 0x000000000007c34d pi_irss: 0x000000001201d5e0 pi_start: Tue Jan 11 08:37:41 2011 pi_rlimit: (use proc rlimit) -Memory Usage------------------------------------------------- pi_drss: 0x0000000000000c26 pi_trss: 0x0000000000000065 pi_dvm: 0x0000000000000c26 pi_pi_prm: 0x0000000000000000 pi_tsize: 0x000000000001a757 pi_dsize: 0x0000000003155b80 pi_sdsize: 0x0000000000000000 ------------------Signal Management--------------------------- pi_signal: (use proc signal) pi_sigflags: (use proc sigflags) pi_sig: <none> ---------------------------------WLM Information-------------- pi_classname: Unclassified pi_tag: <none> pi_chk_utime: N/A pi_chk_ctime: N/A } (dbx) pthread thread state-k wchan state-u k-tid mode held scope function $t1 wait 0xf10005001455d208 running 8454163 k no sys >$t45 run blocked 27000849 k no sys _global_lock_common $t9267 wait 0xf1000500001c30c8 running 31326369 k no sys $t9263 wait 0x0000000110011730 blocked 33554473 k no sys _mutex_lock $t9254 wait 0xf1000500002bf0c8 running 35192941 k no sys $t9264 wait 0xf1000500009438c8 running 33292391 k no sys $t9257 run running 14221547 u no sys QueueTable::locate(const char*) $t9265 wait 0xf1000500007dd0c8 running 35586283 k no sys $t9273 wait 0x0000000110011730 blocked 24510523 k no sys _mutex_lock $t9269 wait 0xf1000500004530c8 running 23003283 k no sys $t9274 wait 0x0000000110011730 blocked 32964753 k no sys _mutex_lock $t9275 wait 0x0000000110011730 blocked 37159101 k no sys _mutex_lock $t9276 wait 0x0000000110011730 blocked 20709531 k no sys _mutex_lock $t9277 wait 0x0000000110011730 blocked 37486597 k no sys _mutex_lock $t9278 wait 0x0000000110011730 blocked 38076475 k no sys _mutex_lock $t9279 wait 0x0000000110011730 blocked 33882203 k no sys _mutex_lock $t9280 wait 0x0000000110011730 blocked 33751211 k no sys _mutex_lock $t9281 wait 0x0000000110011730 blocked 35258479 k no sys _mutex_lock ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-11 02:32 Message: Hi Can you (in detail) describe how you want the stack trace, Then I can recreate the problem and give you the documentation you request. Somthing like this" a. recreate the problen b. start dbx dbx -a nnn c. dbx command d. dbx command e. f. g. /Henning ---------------------------------------------------------------------- Comment By: Rick McGuire (bigrixx) Date: 2011-01-10 11:11 Message: The dbx informaton is only showing the state for a single thread, not for the thread that's doing the looping. It appears that somehow the queue manager is stuck in an infinite loop for a previous request, so all subsequent requests are locked out because the looping thread is holding the mutex. If we could get a stack trace showing where the loop is occurring, this might give use a better idea. ---------------------------------------------------------------------- Comment By: Rick McGuire (bigrixx) Date: 2011-01-10 11:05 Message: No, I disagree with this assessment. There is only one close() method on this class. The other close call is for the system API to close the socket handled that's contained in the instance variable. In any event, if the application runs out of sockets, then this should be reflected as an exception, not as a looping condition. ---------------------------------------------------------------------- Comment By: David Ashley (wdashley) Date: 2011-01-10 10:48 Message: I think I have a handle on this, but I am very unsure and Rick will probably need to look into this. In SysCSStream.cpp and .hpp there are references to the method "close". Some references call the method with no arguments and some references call the method with a single argument (a socket handle). The problem is that there is no method "close" defined that takes a single argument. There is a "close" method defined that takes zero arguments. So what I think is happening is that sockets are never being closed and eventually the application runs out of sockets and then loops trying to get new ones. Again, this is all conjecture at this point but it seems logical to me. ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-10 09:53 Message: My cut/past has "destroyed" topas layout. If you want my txt file, please ask by mail. /Henning ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-10 09:48 Message: I am now able to reproduce the “rxapi-loop” error. I have created 2 small rexx programs (rx4 and rx5) Rx4 is just d “hostname | rxqueue” Rx5 is calls rx4 10000 times. After approx 2000 -5000 calls the rxapi stops respomding to “rxqueue('c')” When this happens, all subsequent rexx’s is “hanging” waiting for response . The document attached (rxapi_loop_doc.txt) document this situation. The documentations include dbx stack when we have the loop. Rx4 and rx5 is only created to show the problem. In my real environment the problem is seen by far more complex rexx’s. The typical situations is my LPAR is running OK for 1-10 days, then suddenly the “rxapi” starts using 100% CPU (on a 1 CPU system). When this is happening I see 2-3-4 rxqueue commands hanging (documented inrxa-pi_loop_doc.txt). rx1,rx2 and rx3 can show a more “production like situation, but sometime you have to run rx3 50-100 times before the problem can be seen. rx4 and rx5 is ONLY to reproduce the problem. To me it looks like a “serialization” problem in the rxapi environment. /Henning Gammelmark Can not attach files, so utp/pasted insted. rxapi_loop_doc.tst: Topas Monitor for host: aix031 EVENTS/QUEUES FILE/TTY Mon Jan 10 13:29:37 2011 Interval: 2 Cswitch 152 Readch 13294 Syscall 301 Writech 3625 CPU User% Kern% Wait% Idle% Physc Entc Reads 10 Rawin 0 ALL 62.8 1.1 0.0 36.1 1.02 254.3 Writes 4 Ttyout 868 Forks 0 Igets 0 Network KBPS I-Pack O-Pack KB-In KB-Out Execs 0 Namei 28 Total 1.1 1.0 2.0 0.0 1.0 Runqueue 1.0 Dirblk 0 Waitqueue 0.0 Disk Busy% KBPS TPS KB-Read KB-Writ MEMORY Total 0.0 0.0 0.0 0.0 0.0 PAGING Real,MB 40960 Faults 149 % Comp 25 FileSystem KBPS TPS KB-Read KB-Writ Steals 0 % Noncomp 0 Total 11.2 8.5 11.2 0.0 PgspIn 0 % Client 0 PgspOut 0 Name PID CPU% PgSp Owner PageIn 0 PAGING SPACE rxapi 48037908 62.7 10.7 nobody PageOut 0 Size,MB 7168 vmmd 393232 0.5 0.2 root Sios 0 % Used 1 topas 721218 0.1 8.9 root % Free 99 gil 1507388 0.1 0.1 root AME nfs_stat 2490700 0.0 0.8 root TMEM,MB 20480 WPAR Activ 0 random 2818232 0.0 0.1 root CMEM,MB 141 WPAR Total 0 sshd 6619344 0.0 0.7 bduhga EF[T/A] 2.0/2.0 Press: "h"-help tecad_lo 5243094 0.0 1.7 root CI:0.0 CO:0.0 "q"-quit java 6881428 0.0 53.4 root swapper 264 0.0 0.1 root root@aix031:/home/bduhga/rxapi-> ps -ef | grep rexx ( 357 lines ) root 2949242 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 4063410 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 4129008 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 4325486 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 4849864 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 5374178 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 5505062 1 0 13:27:36 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 5570582 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 5898336 1 0 13:27:35 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 6160392 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 6357210 1 0 13:27:35 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 6815832 1 0 13:27:37 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 7667788 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 7995590 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 8061016 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 8257742 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9240792 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9306186 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9437372 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9568314 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9633924 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 9830420 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 10092688 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 10420402 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 10682378 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root@aix031:/home/bduhga/rxapi-> ps -ef | grep rxq |grep -v grep root 6029554 48103518 0 13:27:34 pts/0 0:00 rxqueue S2de005eQ11003ec90 root 38273254 22348242 0 13:27:34 pts/0 0:00 rxqueue S15501d2Q11003f010 root 24707444 40108058 0 13:27:34 pts/0 0:00 rxqueue S264001aQ11003ef30 root@aix031:/home/bduhga/rxapi-> oot@aix031:/home/bduhga/rxapi-> ps -ef | grep 48103518 |grep -v grep root 6029554 48103518 0 13:27:34 pts/0 0:00 rxqueue S2de005eQ11003ec90 root 48103518 1 0 13:27:34 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root@aix031:/home/bduhga/rxapi-> ps -ef | grep 22348242 |grep -v grep root 38273254 22348242 0 13:27:34 pts/0 0:00 rxqueue S15501d2Q11003f010 root 22348242 1 0 13:27:34 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root@aix031:/home/bduhga/rxapi-> ps -ef | grep 40108058 |grep -v grep root 40108058 1 0 13:27:34 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4 root 24707444 40108058 0 13:27:34 pts/0 0:00 rxqueue S264001aQ11003ef30 root@aix031:/home/bduhga/rxapi-> root@aix031:/home/bduhga/rxapi-> ps -ef | grep -c rexx 357 dbx -a <pid off rxapi> (dbx) where _global_lock_common(??, ??, ??) at 0x9000000000027b4 _mutex_lock(??, ??, ??) at 0x90000000000f8bc APIServer.SysMutex::request()(0x110013e28), line 81 in "SysSemaphore.hpp" Lock::Lock(SysMutex&)(this = 0x000000011005c2d0, l = &(...)), line 55 in "SynchronizedBlock.hpp" unnamed block in ServerQueueManager::dispatch(ServiceMessage&)(this = 0x0000000110013e18, message = &(...)), line 848 in "QueueManager.cpp" ServerQueueManager::dispatch(ServiceMessage&)(this = 0x0000000110013e18, message = &(...)), line 848 in "QueueManager.cpp" APIServerInstance::dispatch(ServiceMessage&)(this = 0x0000000110013e10, message = &(...)), line 63 in "APIServerInstance.cpp" unnamed block in APIServer::processMessages(SysServerConnection*)(this = 0x0000000110001be8, connection = 0x0000000110014270), line 173 in "APIServer.cpp" APIServer::processMessages(SysServerConnection*)(this = 0x0000000110001be8, connection = 0x0000000110014270), line 173 in "APIServer.cpp" APIServerThread::dispatch()(this = 0x0000000110013d90), line 56 in "APIServerThread.cpp" call_thread_function(void*)(argument = 0x0000000110013d90), line 124 in "SysThread.cpp" rx: #! /usr/bin/rexx CALL init_prog CALL doit CALL prog_end EXIT prog_rc /* end main */ init_prog: /* define rexx queue */ newq = rxqueue('c') oldq = rxqueue('s',newq) 'hostname | rxqueue' newq PARSE PULL my_name RETURN /* end init_prog */ doit: 'hostname | rxqueue ' newq RETURN /* */ prog_end: /* del tmp queue */ rx = rxqueue('s',oldq) ry = rxqueue('d',newq) RETURN /* prog_end */ rx5: #! /usr/bin/rexx CALL init_prog CALL doit CALL prog_end EXIT prog_rc /* end main */ init_prog: /* define rexx queue */ newq = rxqueue('c') oldq = rxqueue('s',newq) 'hostname | rxqueue' newq PARSE PULL my_name RETURN /* end init_prog */ doit: i = 0 DO i = 1 TO 10000 IF i // 100 = 0 THEN SAY "Looping " i '/home/bduhga/rxapi/rx4 &' END CALL clrq RETURN /* */ prog_end: prog_end: /* slet tmp queue */ rx = rxqueue('s',oldq) ry = rxqueue('d',newq) RETURN /* prog_end */ clrq: DO QUEUED() PULL smat END RETURN If you want rx1-3 pleas contact me, and I will forward to you. /HGA ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2011-01-03 04:50 Message: Hi Any newa on this ? The problem prevent me from upgrading from rexx 3.2... /HGA ---------------------------------------------------------------------- Comment By: Henning G (ubihga) Date: 2010-12-15 09:31 Message: Hi By looping I mean that the rxapi program (nobody 3211508 2228346 0 Dec 14 - 0:05 /opt/ooRexx/bin/rxapi) uses all CPU resources on the AIX. i.e. topas show 0% idle CPU. No other applications is running at this time. I don't know how to recreate the problem, because the system is idle (no applications running), only my monitoring rexx's are running i.e. rexx to monitor filssystem usage, rexx to monitor processes and so on. theese rexx'es runs evry 5 minuts, and runs for days (5-10-20 days) before the problem arise, then suddently the rxapi starts using 100% CPU resources. The problem has not been seen on systems using oorexx 3.2. That is why I ask for help and method (tools) to document the error. Can rxapi be started in debug mode so we can se what it actualy is going on, or is ther any other way to debug this. I would be glad to provide you with some documentation, but before I can get some usefull "data" from the looping rxapi I can's se what I can to send to you. I can send you our rexx'es but they work fine for several days, so I cant se how this can help. Regards Henning Gammelmark ---------------------------------------------------------------------- Comment By: Mark Miesfeld (miesfeld) Date: 2010-12-15 08:59 Message: Rainer will have to take a look at this, probably. I don't understand what you mean by rxapi is "looping." Could you try to explain better what is happening? You'll either need to explain in detail how to reproduce this and / or attach an example program that causes the problem. Thanks. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=684730&aid=3137873&group_id=119701 |