#963 rxapi looping

4.1.1
closed
Rick McGuire
7
2013-06-11
2010-12-15
Henning G
No

Hi

I have recently installed rexx version 4.1 on one off my AIX'es.
AIX version: 61-05-011016_SP, runs on POWER7 hardware, same problem seen on POWER5 and POWER6 hardware.

rexx works fine but after some time I se that rxapi starts looping.
I have seen same behaviour in rexx 4.0.1, but never got a solution for it.
It looks like it happens more frequently if I add more CPU's to my AIX,

With only one CPU it works for 5-7 days before looping starts.
If I have 2 or more CPU's on the system the loop usualy starts within 1 day.

Is there anything I can do to collect some documentation to you when the loops is happening ?
I hope you can help me finding this problem or I am stuck at version 3.2 that is working fine.
/Henning Gammelmark

Discussion

1 2 > >> (Page 1 of 2)
  • Mark Miesfeld
    Mark Miesfeld
    2010-12-15

    Rainer will have to take a look at this, probably.

    I don't understand what you mean by rxapi is "looping." Could you try to explain better what is happening?

    You'll either need to explain in detail how to reproduce this and / or attach an example program that causes the problem. Thanks.

     
  • Henning G
    Henning G
    2010-12-15

    Hi

    By looping I mean that the rxapi program (nobody 3211508 2228346 0 Dec 14 - 0:05 /opt/ooRexx/bin/rxapi) uses all CPU resources on the AIX. i.e. topas show 0% idle CPU. No other applications is running at this time.

    I don't know how to recreate the problem, because the system is idle (no applications running), only my monitoring rexx's are running i.e. rexx to monitor filssystem usage, rexx to monitor processes and so on. theese rexx'es runs evry 5 minuts, and runs for days (5-10-20 days) before the problem arise, then suddently the rxapi starts using 100% CPU resources.

    The problem has not been seen on systems using oorexx 3.2.

    That is why I ask for help and method (tools) to document the error. Can rxapi be started in debug mode so we can se what it actualy is going on, or is ther any other way to debug this.

    I would be glad to provide you with some documentation, but before I can get some usefull "data" from the looping rxapi I can's se what I can to send to you.

    I can send you our rexx'es but they work fine for several days, so I cant se how this can help.

    Regards
    Henning Gammelmark

     
  • Henning G
    Henning G
    2011-01-03

    Hi

    Any newa on this ?

    The problem prevent me from upgrading from rexx 3.2...

    /HGA

     
  • Henning G
    Henning G
    2011-01-10

    I am now able to reproduce the “rxapi-loop” error.

    I have created 2 small rexx programs (rx4 and rx5)
    Rx4 is just d “hostname | rxqueue”
    Rx5 is calls rx4 10000 times.

    After approx 2000 -5000 calls the rxapi stops respomding to “rxqueue('c')”

    When this happens, all subsequent rexx’s is “hanging” waiting for response .

    The document attached (rxapi_loop_doc.txt) document this situation.
    The documentations include dbx stack when we have the loop.

    Rx4 and rx5 is only created to show the problem. In my real environment the problem is seen by far more complex rexx’s. The typical situations is my LPAR is running OK for 1-10 days, then suddenly the “rxapi” starts using 100% CPU (on a 1 CPU system).
    When this is happening I see 2-3-4 rxqueue commands hanging (documented inrxa-pi_loop_doc.txt).
    rx1,rx2 and rx3 can show a more “production like situation, but sometime you have to run rx3 50-100 times before the problem can be seen.
    rx4 and rx5 is ONLY to reproduce the problem.

    To me it looks like a “serialization” problem in the rxapi environment.

    /Henning Gammelmark
    Can not attach files, so utp/pasted insted.

    rxapi_loop_doc.tst:
    Topas Monitor for host: aix031 EVENTS/QUEUES FILE/TTY
    Mon Jan 10 13:29:37 2011 Interval: 2 Cswitch 152 Readch 13294
    Syscall 301 Writech 3625
    CPU User% Kern% Wait% Idle% Physc Entc Reads 10 Rawin 0
    ALL 62.8 1.1 0.0 36.1 1.02 254.3 Writes 4 Ttyout 868
    Forks 0 Igets 0
    Network KBPS I-Pack O-Pack KB-In KB-Out Execs 0 Namei 28
    Total 1.1 1.0 2.0 0.0 1.0 Runqueue 1.0 Dirblk 0
    Waitqueue 0.0
    Disk Busy% KBPS TPS KB-Read KB-Writ MEMORY
    Total 0.0 0.0 0.0 0.0 0.0 PAGING Real,MB 40960
    Faults 149 % Comp 25
    FileSystem KBPS TPS KB-Read KB-Writ Steals 0 % Noncomp 0
    Total 11.2 8.5 11.2 0.0 PgspIn 0 % Client 0
    PgspOut 0
    Name PID CPU% PgSp Owner PageIn 0 PAGING SPACE
    rxapi 48037908 62.7 10.7 nobody PageOut 0 Size,MB 7168
    vmmd 393232 0.5 0.2 root Sios 0 % Used 1
    topas 721218 0.1 8.9 root % Free 99
    gil 1507388 0.1 0.1 root AME
    nfs_stat 2490700 0.0 0.8 root TMEM,MB 20480 WPAR Activ 0
    random 2818232 0.0 0.1 root CMEM,MB 141 WPAR Total 0
    sshd 6619344 0.0 0.7 bduhga EF[T/A] 2.0/2.0 Press: "h"-help
    tecad_lo 5243094 0.0 1.7 root CI:0.0 CO:0.0 "q"-quit
    java 6881428 0.0 53.4 root
    swapper 264 0.0 0.1 root

    root@aix031:/home/bduhga/rxapi-> ps -ef | grep rexx ( 357 lines )
    root 2949242 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 4063410 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 4129008 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 4325486 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 4849864 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 5374178 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 5505062 1 0 13:27:36 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 5570582 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 5898336 1 0 13:27:35 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 6160392 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 6357210 1 0 13:27:35 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 6815832 1 0 13:27:37 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 7667788 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 7995590 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 8061016 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 8257742 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 9240792 1 0 13:27:42 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 9306186 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 9437372 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 9568314 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 9633924 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 9830420 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 10092688 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 10420402 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 10682378 1 0 13:27:41 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4

    root@aix031:/home/bduhga/rxapi-> ps -ef | grep rxq |grep -v grep
    root 6029554 48103518 0 13:27:34 pts/0 0:00 rxqueue S2de005eQ11003ec90
    root 38273254 22348242 0 13:27:34 pts/0 0:00 rxqueue S15501d2Q11003f010
    root 24707444 40108058 0 13:27:34 pts/0 0:00 rxqueue S264001aQ11003ef30
    root@aix031:/home/bduhga/rxapi->

    oot@aix031:/home/bduhga/rxapi-> ps -ef | grep 48103518 |grep -v grep
    root 6029554 48103518 0 13:27:34 pts/0 0:00 rxqueue S2de005eQ11003ec90
    root 48103518 1 0 13:27:34 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4

    root@aix031:/home/bduhga/rxapi-> ps -ef | grep 22348242 |grep -v grep
    root 38273254 22348242 0 13:27:34 pts/0 0:00 rxqueue S15501d2Q11003f010
    root 22348242 1 0 13:27:34 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4

    root@aix031:/home/bduhga/rxapi-> ps -ef | grep 40108058 |grep -v grep
    root 40108058 1 0 13:27:34 pts/0 0:00 /usr/bin/rexx /home/bduhga/rxapi/rx4
    root 24707444 40108058 0 13:27:34 pts/0 0:00 rxqueue S264001aQ11003ef30
    root@aix031:/home/bduhga/rxapi->

    root@aix031:/home/bduhga/rxapi-> ps -ef | grep -c rexx
    357

    dbx -a <pid off="" rxapi="">
    (dbx) where
    _global_lock_common(??, ??, ??) at 0x9000000000027b4
    _mutex_lock(??, ??, ??) at 0x90000000000f8bc
    APIServer.SysMutex::request()(0x110013e28), line 81 in "SysSemaphore.hpp"
    Lock::Lock(SysMutex&)(this = 0x000000011005c2d0, l = &(...)), line 55 in "SynchronizedBlock.hpp"
    unnamed block in ServerQueueManager::dispatch(ServiceMessage&)(this = 0x0000000110013e18, message = &(...)), line 848 in "QueueManager.cpp"
    ServerQueueManager::dispatch(ServiceMessage&)(this = 0x0000000110013e18, message = &(...)), line 848 in "QueueManager.cpp"
    APIServerInstance::dispatch(ServiceMessage&)(this = 0x0000000110013e10, message = &(...)), line 63 in "APIServerInstance.cpp"
    unnamed block in APIServer::processMessages(SysServerConnection)(this = 0x0000000110001be8, connection = 0x0000000110014270), line 173 in "APIServer.cpp"
    APIServer::processMessages(SysServerConnection
    )(this = 0x0000000110001be8, connection = 0x0000000110014270), line 173 in "APIServer.cpp"
    APIServerThread::dispatch()(this = 0x0000000110013d90), line 56 in "APIServerThread.cpp"
    call_thread_function(void*)(argument = 0x0000000110013d90), line 124 in "SysThread.cpp"

    rx:

    ! /usr/bin/rexx

    CALL init_prog
    CALL doit
    CALL prog_end
    EXIT prog_rc / end main /

    init_prog:
    / define rexx queue /
    newq = rxqueue('c')
    oldq = rxqueue('s',newq)
    'hostname | rxqueue' newq
    PARSE PULL my_name
    RETURN / end init_prog /

    doit:
    'hostname | rxqueue ' newq
    RETURN / /

    prog_end:
    / del tmp queue /
    rx = rxqueue('s',oldq)
    ry = rxqueue('d',newq)
    RETURN / prog_end /

    rx5:

    ! /usr/bin/rexx

    CALL init_prog
    CALL doit
    CALL prog_end
    EXIT prog_rc / end main /

    init_prog:
    / define rexx queue /
    newq = rxqueue('c')
    oldq = rxqueue('s',newq)
    'hostname | rxqueue' newq
    PARSE PULL my_name
    RETURN / end init_prog /

    doit:
    i = 0
    DO i = 1 TO 10000
    IF i // 100 = 0 THEN SAY "Looping " i
    '/home/bduhga/rxapi/rx4 &'
    END
    CALL clrq
    RETURN / /

    prog_end:
    prog_end:
    / slet tmp queue /
    rx = rxqueue('s',oldq)
    ry = rxqueue('d',newq)
    RETURN / prog_end /

    clrq:
    DO QUEUED()
    PULL smat
    END
    RETURN

    If you want rx1-3 pleas contact me, and I will forward to you.
    /HGA

     
  • Henning G
    Henning G
    2011-01-10

    My cut/past has "destroyed" topas layout.
    If you want my txt file, please ask by mail.
    /Henning

     
  • David Ashley
    David Ashley
    2011-01-10

    I think I have a handle on this, but I am very unsure and Rick will probably need to look into this.

    In SysCSStream.cpp and .hpp there are references to the method "close". Some references call the method with no arguments and some references call the method with a single argument (a socket handle). The problem is that there is no method "close" defined that takes a single argument. There is a "close" method defined that takes zero arguments.

    So what I think is happening is that sockets are never being closed and eventually the application runs out of sockets and then loops trying to get new ones.

    Again, this is all conjecture at this point but it seems logical to me.

     
  • Rick McGuire
    Rick McGuire
    2011-01-10

    No, I disagree with this assessment. There is only one close() method on this class. The other close call is for the system API to close the socket handled that's contained in the instance variable.

    In any event, if the application runs out of sockets, then this should be reflected as an exception, not as a looping condition.

     
  • Rick McGuire
    Rick McGuire
    2011-01-10

    The dbx informaton is only showing the state for a single thread, not for the thread that's doing the looping. It appears that somehow the queue manager is stuck in an infinite loop for a previous request, so all subsequent requests are locked out because the looping thread is holding the mutex. If we could get a stack trace showing where the loop is occurring, this might give use a better idea.

     
  • Henning G
    Henning G
    2011-01-11

    Hi

    Can you (in detail) describe how you want the stack trace, Then I can recreate the problem and give you the documentation you request.

    Somthing like this"
    a. recreate the problen
    b. start dbx dbx -a nnn
    c. dbx command
    d. dbx command
    e.
    f.
    g.
    /Henning

     
  • Henning G
    Henning G
    2011-01-11

    Hi
    I have re-created the problem and done some more dbx commands.
    I hope uou can use the output, if not please do as I requested in the previous comment (describe what I have to do).

    Question: have you tryed to recreate the problem with rx4 and rx5 on your system ?

    /Henning

    The dbx output:
    (dbx) listi
    0x9000000000027dc (_global_lock_common+0x4fc) 60210000 ori r1,r1,0x0
    0x9000000000027e0 (_global_lock_common+0x500) 7c421378 mr r2,r2
    0x9000000000027e4 (_global_lock_common+0x504) e8610088 ld r3,0x88(r1)
    0x9000000000027e8 (_global_lock_common+0x508) 38800000 li r4,0x0
    0x9000000000027ec (_global_lock_common+0x50c) 63850000 ori r5,r28,0x0
    0x9000000000027f0 (_global_lock_common+0x510) 48003423 bla 0x0000000000003420
    0x9000000000027f4 (_global_lock_common+0x514) 2c030000 cmpi cr0,0x0,r3,0x0
    0x9000000000027f8 (_global_lock_common+0x518) e8610088 ld r3,0x88(r1)
    0x9000000000027fc (_global_lock_common+0x51c) 4182fba0 beq 0x90000000000239c (_global_lock_common+0xbc)
    0x900000000002800 (_global_lock_common+0x520) 3c802000 lis r4,0x2000
    (dbx)

    (dbx) func
    _global_lock_common
    (dbx)

    (dbx) proc
    {
    -Identification/Authentication Info---------------------------
    pi_pid: 2162768 pi_sid: 2162768
    pi_ppid: 2097256 pi_pgrp: -2
    pi_uid: -2 pi_suid: -2

    ---------------Controlling TTY Info---------------------------
    pi_ttyp: 0 pi_ttyd: 0x0000000000000000
    pi_ttympx: 0x0000000000000000

    -----------------------------Scheduler Information------------
    pi_nice: 0x00000014 pi_state: SSTOP
    pi_flags: SLOAD | STRC | SWTED | SORPHANPGRP |
    SEXECED
    pi_flags2: S64BIT | SPTHREADED | SPROJFIXED
    pi_thcount: 406 pi_cpu: 0
    pi_pri: 60

    ---------------------------------------------File Management--
    pi_maxofile: 0x00000192 pi_cmask: 0x0022
    pi_cdir: 0xf1000102813fbc20 pi_rdir: 0x0000000000000000
    pi_comm: "rxapi"

    ----------------------------------Memory----------------------
    pi_adspace: 0x00000017001f0480
    pi_majflt: 0x000000000000008d pi_minflt: 0x00000000000015cc
    pi_repage: 0x0000000000000000 pi_size: 0x0000000000000c40
    pi_utime: N/A pi_stime: N/A

    -------Credentials, Accounting, Profiling & Resource Limits---
    pi_cred: (use proc cred)
    pi_ru: (use proc ru)
    pi_cru: (use proc cru)
    pi_ioch: 0x000000000007c34d pi_irss: 0x000000001201d5e0
    pi_start: Tue Jan 11 08:37:41 2011
    pi_rlimit: (use proc rlimit)

    -Memory Usage-------------------------------------------------
    pi_drss: 0x0000000000000c26 pi_trss: 0x0000000000000065
    pi_dvm: 0x0000000000000c26 pi_pi_prm: 0x0000000000000000
    pi_tsize: 0x000000000001a757 pi_dsize: 0x0000000003155b80
    pi_sdsize: 0x0000000000000000

    ------------------Signal Management---------------------------
    pi_signal: (use proc signal) pi_sigflags: (use proc sigflags)
    pi_sig:

    ---------------------------------WLM Information--------------
    pi_classname: Unclassified
    pi_tag:
    pi_chk_utime: N/A pi_chk_ctime: N/A
    }

    (dbx) pthread
    thread state-k wchan state-u k-tid mode held scope function
    $t1 wait 0xf10005001455d208 running 8454163 k no sys

    $t45 run blocked 27000849 k no sys _global_lock_common
    $t9267 wait 0xf1000500001c30c8 running 31326369 k no sys
    $t9263 wait 0x0000000110011730 blocked 33554473 k no sys _mutex_lock
    $t9254 wait 0xf1000500002bf0c8 running 35192941 k no sys
    $t9264 wait 0xf1000500009438c8 running 33292391 k no sys
    $t9257 run running 14221547 u no sys QueueTable::locate(const char*)
    $t9265 wait 0xf1000500007dd0c8 running 35586283 k no sys
    $t9273 wait 0x0000000110011730 blocked 24510523 k no sys _mutex_lock
    $t9269 wait 0xf1000500004530c8 running 23003283 k no sys
    $t9274 wait 0x0000000110011730 blocked 32964753 k no sys _mutex_lock
    $t9275 wait 0x0000000110011730 blocked 37159101 k no sys _mutex_lock
    $t9276 wait 0x0000000110011730 blocked 20709531 k no sys _mutex_lock
    $t9277 wait 0x0000000110011730 blocked 37486597 k no sys _mutex_lock
    $t9278 wait 0x0000000110011730 blocked 38076475 k no sys _mutex_lock
    $t9279 wait 0x0000000110011730 blocked 33882203 k no sys _mutex_lock
    $t9280 wait 0x0000000110011730 blocked 33751211 k no sys _mutex_lock
    $t9281 wait 0x0000000110011730 blocked 35258479 k no sys _mutex_lock

     
  • Rick McGuire
    Rick McGuire
    2011-01-11

    I don't have access to an AIX system myself and I'm not the most literate in using dbx either. The thread in question I'd like to see a stack trace for is this one:

    $t9257 run running 14221547 u no sys
    QueueTable::locate(const char*)

    That sort of confirms my suspicions on what might be going wrong, but I'd like to see a stack trace so I can see what operation is in process.

     
  • Henning G
    Henning G
    2011-01-11

    Hi

    Nor am I an dbx expert, so could you please finde someone who could guide me to do the right stack trace

    /Hga

     
  • Henning G
    Henning G
    2011-01-11

    Is this what you wanted ?

    (dbx) pthread
    thread state-k wchan state-u k-tid mode held scope function
    $t1 wait 0xf100050014616a08 running 78512381 k no sys

    $t45 run blocked 49807565 k no sys _global_lock_common
    $t544 run running 15991017 u no sys __towlower_std
    $t534 wait 0xf1000500001b08c8 running 62783673 k no sys
    $t536 wait 0xf1000500007e28c8 running 37027857 k no sys
    $t539 wait 0xf1000500003540c8 running 37093621 k no sys
    $t545 wait 0x0000000110011730 blocked 71565447 k no sys _mutex_lock
    $t546 wait 0x0000000110011730 blocked 61800563 k no sys _mutex_lock
    $t547 wait 0x0000000110011730 blocked 19071095 k no sys _mutex_lock

    thrad current 544

    (dbx) pthread
    thread state-k wchan state-u k-tid mode held scope function
    $t1 wait 0xf100050014616a08 running 78512381 k no sys
    *$t45 run blocked 49807565 k no sys _global_lock_common

    $t544 run running 15991017 u no sys __towlower_std
    $t534 wait 0xf1000500001b08c8 running 62783673 k no sys
    $t536 wait 0xf1000500007e28c8 running 37027857 k no sys

    (dbx) listi
    0x90000000022a2a0 (towlower_std) f8610030 std r3,0x30(r1)
    0x90000000022a2a4 (
    towlower_std+0x4) 7cc307b4 Invalid opcode.
    0x90000000022a2a8 (towlower_std+0x8) e8810030 ld r4,0x30(r1)
    0x90000000022a2ac (
    towlower_std+0xc) 8004003c lwz r0,0x3c(r4)
    0x90000000022a2b0 (towlower_std+0x10) 7c060040 cmpl cr0,0x0,r6,r0
    0x90000000022a2b4 (
    towlower_std+0x14) 4d810020 bgtlr
    0x90000000022a2b8 (towlower_std+0x18) 78651764 Invalid opcode.
    0x90000000022a2bc (
    towlower_std+0x1c) 80040038 lwz r0,0x38(r4)
    0x90000000022a2c0 (towlower_std+0x20) 7c060040 cmpl cr0,0x0,r6,r0
    0x90000000022a2c4 (
    towlower_std+0x24) 4d800020 bltlr
    (dbx)

     
  • Rick McGuire
    Rick McGuire
    2011-01-11

    No, what I really need is the callback chain for that thread, not the instructions at the point where this is running. The previous post showed what method this is looping in, but I need to understand how we got at that method and what request is currently being executed.

     
  • Rick McGuire
    Rick McGuire
    2011-01-11

    Committed revision 6571.

    I'm about 99% certain I have a fix for this problem. It indeed was a synchronization issue that can occur when new named queues are frequently created and deleted. This problem could have shown up on any of the platforms, but for some reason we've only seen this on AIX so far. I've fixed this in the trunk. It would be nice to see if we can get a version of 4.1.0 created with this fix applied to see if it has really gone away.

     
  • Henning G
    Henning G
    2011-01-11

    Hi

    If you can send me the fix (or give a link where I can download it) I can apply it ASAP.

    The AIX aI am using is a "crash-and-burn" LPAR so I have no problems with applying a "non offial fix"

    Ti the stack trace question.
    I need some help to do that because I an not a dbx expert.
    But if someone can give me hje method/steps I can syply the needed data.

    /Henning

     
  • Mark Miesfeld
    Mark Miesfeld
    2011-01-11

    Henning,

    We'll try to get you a build to test, but it may take a little while. Be a little patient here.

     
  • Henning G
    Henning G
    2011-01-12

    Hi

    The fix received, applyed and testet.

    it is better but I am still able to recrate the problem, and also get other errors. Now it looks like rxapi miss to clean up in memory.
    I get:
    REX121: Storage for data queues is exhausted.

    To fix this I do:
    stop/kill all running rexx programs
    kill rxapi
    wait untiil all rexx has terminated.

    futher documentation:
    (dbx) thread run
    thread state-k wchan state-u k-tid mode held scope function

    $t5 run blocked 35258983 k no sys _global_lock_common
    $t12026 run running 58261573 u no sys isupper
    (dbx) chread current t12026
    chread current t12026
    ^ unrecognized command
    (dbx) thread current 12026
    (dbx) listi
    0x90000000021b880 (isupper+0x60) 7d8803a6 mtlr r12
    0x90000000021b884 (isupper+0x64) 4e800020 blr
    0x90000000021b888 (isupper+0x68) 00000000 Invalid opcode.
    0x90000000021b88c (isupper+0x6c) 00002041 Invalid opcode.
    0x90000000021b890 (isupper+0x70) 80000100 lwz r0,0x100(r0)
    0x90000000021b894 (isupper+0x74) 00000000 Invalid opcode.
    0x90000000021b898 (isupper+0x78) 00000068 Invalid opcode.
    0x90000000021b89c (isupper+0x7c) 00076973 Invalid opcode.
    0x90000000021b8a0 (isupper+0x80) 75707065 andis. r16,r11,0x7065
    0x90000000021b8a4 (isupper+0x84) 72000000 andi. r0,r16,0x0
    (dbx) func
    isupper
    (dbx) func
    isupper

    REX121: Storage for data queues is exhausted.

    retest
    Looping 1200
    Looping 1300
    Looping 1400
    Looping 1500

    REX121: Storage for data queues is exhausted.

    REX121: Storage for data queues is exhausted.

    At this point my test go "bananas" (Danish for out off control)

    root@aix031:/rxtest-> ps -ef | grep rxapi
    nobody 3342576 3538944 120 10:29:31 - 0:23 /opt/ooRexx/bin/rxapi
    root 56361044 2294156 1 10:56:51 pts/2 0:00 grep rxapi
    root@aix031:/rxtest-> dnx -a 3342576
    ksh: dnx: not found.
    root@aix031:/rxtest-> dbx -a 3342576
    Waiting to attach to process 3342576 ...
    Successfully attached to rxapi.
    warning: Directory containing rxapi could not be determined.
    Apply 'use' command to initialize source path.

    Type 'help' for help.
    reading symbolic information ...
    stopped in recv at 0x9000000002eef68 ($t235)
    0x9000000002eef68 (recv+0xc8) e8410028 ld r2,0x28(r1)
    (dbx) func
    recv
    (dbx) listi
    0x9000000002eef68 (recv+0xc8) e8410028 ld r2,0x28(r1)
    0x9000000002eef6c (recv+0xcc) 2c030000 cmpi cr0,0x0,r3,0x0
    0x9000000002eef70 (recv+0xd0) 60000000 ori r0,r0,0x0
    0x9000000002eef74 (recv+0xd4) 7c7c07b4 Invalid opcode.
    0x9000000002eef78 (recv+0xd8) 4180002c blt 0x9000000002eefa4 (recv+0x104)
    0x9000000002eef7c (recv+0xdc) 63830000 ori r3,r28,0x0
    0x9000000002eef80 (recv+0xe0) eb210088 ld r25,0x88(r1)
    0x9000000002eef84 (recv+0xe4) e98100d0 ld r12,0xd0(r1)
    0x9000000002eef88 (recv+0xe8) 382100c0 addi r1,0xc0(r1)
    0x9000000002eef8c (recv+0xec) 7d8803a6 mtlr r12
    (dbx) thread run
    thread state-k wchan state-u k-tid mode held scope function

    $t235 run running 26738801 k no sys recv
    $t1775 run running 11600205 u no sys Utilities::strCaselessCompare(const char,const char)
    (dbx) thread current t235
    pthread current t235
    ^ syntax error
    (dbx) thread current 1775
    (dbx) thread run
    thread state-k wchan state-u k-tid mode held scope function
    $t235 run running 26738801 k no sys recv
    $t1775 run running 11600205 u no sys Utilities::strCaselessCompare(const char
    ,const char)
    (dbx) listi
    0x100014f38 (Utilities::strCaselessCompare(const char
    ,const char)+0x58) e8410028 ld r2,0x28(r1)
    0x100014f3c (Utilities::strCaselessCompare(const char
    ,const char)+0x5c) 607f0000 ori r31,r3,0x0
    0x100014f40 (Utilities::strCaselessCompare(const char
    ,const char)+0x60) e86100b8 ld r3,0xb8(r1)
    0x100014f44 (Utilities::strCaselessCompare(const char
    ,const char)+0x64) 88630000 lbz r3,0x0(r3)
    0x100014f48 (Utilities::strCaselessCompare(const char
    ,const char)+0x68) 480000e1 bl 0x100015028 (tolower)
    0x100014f4c (Utilities::strCaselessCompare(const char
    ,const char)+0x6c) e8410028 ld r2,0x28(r1)
    0x100014f50 (Utilities::strCaselessCompare(const char
    ,const char)+0x70) 63e50000 ori r5,r31,0x0
    0x100014f54 (Utilities::strCaselessCompare(const char
    ,const char)+0x74) 60640000 ori r4,r3,0x0
    0x100014f58 (Utilities::strCaselessCompare(const char
    ,const char)+0x78) 60a30000 ori r3,r5,0x0
    0x100014f5c (Utilities::strCaselessCompare(const char
    ,const char*)+0x7c) 7c032000 cmp cr0,0x0,r3,r4

     
  • Henning G
    Henning G
    2011-01-12

    Hi
    supllemental to last comment.

    It looks like t is easier to provoke the error if SMT=on and cpu cppunt > 1.
    I can recrate the problem on a 1 cpu system but the rexx runs 3-10 times longer before the problem shows up. Bun then only one cpu means also less posability for concurrency!
    /Henning

    /Henning

     
  • Henning G
    Henning G
    2011-01-13

    Hi
    Now I am 100% sure, the problem is NOT fixed.

    This morning I had tle loop without running my stress rexx's.

    This is the status:
    (dbx) thread run
    thread state-k wchan state-u k-tid mode held scope function

    $t9681 run blocked 34406595 k no sys _global_lock_common
    $t9702 run running 50921699 u no sys tolower
    (dbx) thread current 9702
    (dbx) listi
    0x100015038 (tolower+0x10) 7c0903a6 mtctr r0
    0x10001503c (tolower+0x14) 4e800420 bctr
    0x100015040 (tolower+0x18) 00000000 Invalid opcode.
    0x100015044 (tolower+0x1c) 000ca000 Invalid opcode.
    0x100015048 (tolower+0x20) 00000000 Invalid opcode.
    0x10001504c (tolower+0x24) 00000018 Invalid opcode.
    0x100015050 (toupper) e98203d8 ld r12,0x3d8(r2)
    0x100015054 (toupper+0x4) f8410028 std r2,0x28(r1)
    0x100015058 (toupper+0x8) e80c0000 ld r0,0x0(r12)
    0x10001505c (toupper+0xc) e84c0008 ld r2,0x8(r12)
    (dbx) func
    glink64.tolower
    (dbx) list
    no source file
    (dbx)

    /Henning

     
  • Henning G
    Henning G
    2011-01-13

    Observations and questions.

    If my rexx look like this:

    ! /usr/bin/rexx

    CALL init_prog
    CALL doit
    CALL prog_end
    EXIT prog_rc / end main /

    init_prog:
    / define rexx queue /
    newq = rxqueue('c')
    oldq = rxqueue('s',newq)
    RETURN / end init_prog /

    doit:
    'hostname | rxqueue ' newq
    RETURN / /

    prog_end:
    / del tmp queue /
    rx = rxqueue('s',oldq)
    ry = rxqueue('d',newq)
    RETURN / prog_end /

    And it is started several times (i.e. 10000 times from other rexx)
    I get "REX121: Storage for data queues is exhausted."
    Or I can recreate the original rxapi loop problem.

    If I to procedure "doit:" add
    DO QUEUED()
    PULL data
    END

    It looks like with this minor change, I can't recreate the problem, and "REX121: Storage for data queues is exhausted." is not happening.

    I also tryed to recreate the problem if I removed "newq = rxqueue('c')... rxqueue('d',newq) from the rexx, this also works without any problems.

    This made me consider why I had added the "rxqueue(c) - rxqueue('d',newq) " to our rexx's and the story was:
    10 years ago We had a problem on our AIX's with rexx if We had several rexx running simultaneous.
    The problem was that rexx's "snatch" data from each others rxqueue.
    I think we at that time used rexx version 1 or 2.

    So the question is: have this problem been solved so I can avoid the masive use of "rxqueue(c)" thus reduce the need for "seriliasion in rxapi" ?

    The "rxapi loop" porblem will still exist, but I have eliminated (or reduced) the the reason in my environment ?

     
  • Henning G
    Henning G
    2011-01-14

    I have applyed the new rxapi to my system.

    After the fix has been applyed I have not (yet) been able to recreate the rxapi loop problem.

    I have just finnished test #100K

    Not only the rxapi loop seeming to be solved, also the "REX121: Storage for data queues is exhausted" and "REX124: The queue 'quename' does not exist." problem seems also to be solved.

    I will continue my testing.

    /Henning

     
  • Henning G
    Henning G
    2011-01-17

    I have during the weekend been running stress test on my system, at I can no longer create the problem.

    So I would say the problem have been fixed.

    Final questio: when can I have the official fix so I can deploy rexx 4.1 (incl fix) to all my systems ?
    /HGA

     
  • Henning G
    Henning G
    2011-03-10

    In January you built version rexx-4.1-64 bit, for me with the "rxapi-looping-fix" included.

    But I can not remember if you also built a 32 bit version with patch built in, All of my 64-bit implementations work, but I still have the looping bug on the LPAR's where the application requires 32 bits rexx.
    Can I get you to build a "rexx 4.1 32 bits" with this fix included?

     
1 2 > >> (Page 1 of 2)


Anonymous


Cancel   Add attachments