#18 JOB Command uses up File Descriptors

closed-fixed
None
4
2002-01-11
2001-06-07
Bob Isch
No

GTM seems to generate duplicate file descriptors for database
and mutex files when jobs are created with the JOB command.

If one has an application that "jobs" processes that may job
another process, etc. (repeatedly) then this eventually will
result in a process running out of open file descriptors,
typically at around 1024 or whatever the "ulimit -n" value is.
This seems to be the per process limit, not the system-wide limit.

Anyway, the process will eventually fail. Usually with one of
the following errors:

%GTM-E-ZLINKFILE, Error while zlinking "myfile.o"
%SYSTEM-E-ENO23, Too many open files in system

-or-

%GTM-F-KILLBYSIGSINFO1, GT.M process 6942 has been killed by a signal 11 at address 0x808540A
(vaddr 0x4)

This problem is demonstrated by the following sample program.
To test this execute the following from the command line:

> k ^foo d ^bugfd

The values in ^foo("fdcnt",...) are the number of open
file descriptors for a given iteration.

Don't try this from root, some systems do not limit
root in this way.

See the 'lsof' output below for to see an example of
the numerous file descriptors referencing open
copies of the database "mutex" files.

Fixes or work-arounds most welcome.

Bob

------------------------------------

bugfd ; Demonstrate problem with ever duplicating file descriptors
;
s $ztrap="g err"
lock +^foo
s (n,^foo)=$g(^foo)+1
s fn="/tmp/bugfd."_$j
zsystem "/usr/sbin/lsof -p "_$j_" | wc -l >"_fn
;
o fn:(readonly)
u fn
r fdcnt
c fn:(delete)
;
s fdcnt=$tr(fdcnt," ")
s ^foo("fdcnt",n)=fdcnt
;
;i n>100 halt ; To test without hitting hard limit
;
job ^bugfd
;
lock -^foo
q ; >>> bugfd

err ;
zshow "*"
halt

------------------------------------

tst> zwr ^foo
^foo=509
^foo("fdcnt",1)=18
^foo("fdcnt",2)=20
^foo("fdcnt",3)=22
^foo("fdcnt",4)=24
^foo("fdcnt",5)=26
^foo("fdcnt",6)=28
^foo("fdcnt",7)=30
^foo("fdcnt",8)=32
^foo("fdcnt",9)=34
^foo("fdcnt",10)=36
^foo("fdcnt",11)=38
^foo("fdcnt",12)=40
^foo("fdcnt",13)=42
^foo("fdcnt",14)=44
^foo("fdcnt",15)=46
^foo("fdcnt",16)=48
^foo("fdcnt",17)=50
...
^foo("fdcnt",495)=1007
^foo("fdcnt",496)=1009
^foo("fdcnt",497)=1011
^foo("fdcnt",498)=1013
^foo("fdcnt",499)=1015
^foo("fdcnt",500)=1017
^foo("fdcnt",501)=1019
^foo("fdcnt",502)=1021
^foo("fdcnt",503)=1023
^foo("fdcnt",504)=1025
^foo("fdcnt",505)=1027
^foo("fdcnt",506)=1029
^foo("fdcnt",507)=1031
^foo("fdcnt",508)=0

------------------------------------

COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
mumps 4949 oper cwd DIR 72,8 16384 783361 /u/tst/prg
mumps 4949 oper rtd DIR 72,6 4096 2 /
mumps 4949 oper txt REG 72,8 1423904 391692 /u/gtm/mumps
mumps 4949 oper mem REG 72,6 340663 375363 /lib/ld-2.1.3.so
mumps 4949 oper mem REG 72,6 262884 244838 /usr/lib/libncurses.so.4.0
mumps 4949 oper mem REG 72,6 527442 375381 /lib/libm-2.1.3.so
mumps 4949 oper mem REG 72,6 75131 375379 /lib/libdl-2.1.3.so
mumps 4949 oper mem REG 72,6 4101324 375370 /lib/libc-2.1.3.so
mumps 4949 oper mem REG 72,6 246652 375401 /lib/libnss_files-2.1.3.so
mumps 4949 oper mem REG 72,6 252234 375407 /lib/libnss_nisplus-2.1.3.so
mumps 4949 oper mem REG 72,6 370141 375383 /lib/libnsl-2.1.3.so
mumps 4949 oper mem REG 72,6 255963 375405 /lib/libnss_nis-2.1.3.so
mumps 4949 oper 0r CHR 1,3 179543 /dev/null
mumps 4949 oper 1w REG 72,8 0 784192 /u/tst/prg/bugfd.mjo
mumps 4949 oper 2w REG 72,8 0 784193 /u/tst/prg/bugfd.mje
mumps 4949 oper 3u REG 72,8 52560384 1615736 /u/tst/db/mumps_tmp.dat
mumps 4949 oper 4u unix 0xcd6c2840 1622783 /tmp/gtm_mutex00000F29
mumps 4949 oper 5u REG 72,8 52560384 1615736 /u/tst/db/mumps_tmp.dat
mumps 4949 oper 6u unix 0xcd6c3080 1622793 /tmp/gtm_mutex00000F2F
mumps 4949 oper 7u REG 72,8 52560384 1615736 /u/tst/db/mumps_tmp.dat
mumps 4949 oper 8u unix 0xcd6c2000 1622803 /tmp/gtm_mutex00000F35
mumps 4949 oper 9u REG 72,8 52560384 1615736 /u/tst/db/mumps_tmp.dat
mumps 4949 oper 10u unix 0xcd6c2b00 1622813 /tmp/gtm_mutex00000F3B
mumps 4949 oper 11u REG 72,8 52560384 1615736 /u/tst/db/mumps_tmp.dat
mumps 4949 oper 12u unix 0xcd6c3340 1622823 /tmp/gtm_mutex00000F41
mumps 4949 oper 13u REG 72,8 52560384 1615736 /u/tst/db/mumps_tmp.dat
mumps 4949 oper 14u unix 0xcd6c38c0 1622833 /tmp/gtm_mutex00000F47
mumps 4949 oper 15u REG 72,8 52560384 1615736 /u/tst/db/mumps_tmp.dat
mumps 4949 oper 16u unix 0xcd6c2580 1622843 /tmp/gtm_mutex00000F4D
mumps 4949 oper 17u REG 72,8 52560384 1615736 /u/tst/db/mumps_tmp.dat
mumps 4949 oper 18u unix 0xcd6c2dc0 1622853 /tmp/gtm_mutex00000F53
mumps 4949 oper 19u REG 72,8 52560384 1615736 /u/tst/db/mumps_tmp.dat
mumps 4949 oper 20u unix 0xcd6c22c0 1622863 /tmp/gtm_mutex00000F59
mumps 4949 oper 21u REG 72,8 52560384 1615736 /u/tst/db/mumps_tmp.dat
mumps 4949 oper 22u unix 0xd0cff780 1622873 /tmp/gtm_mutex00000F5F
...
mumps 4949 oper 360u unix 0xd0217100 1624563 /tmp/gtm_mutex00001355

END

Discussion

  • Steven Estes

    Steven Estes - 2001-06-07

    Logged In: YES
    user_id=97877

    Hi Bob, a most excellent bug with wonderful supporting docs
    -- makes my job a lot easier. Thankyou!

    When entering the orphaned jobbed off process, the code in
    question (ojstartchild.h) does close all of the open "flat"
    files that the M user may have opened but for some
    inexplicable reason does not close the database files or the
    mutex socket which it will no longer be using. I have a fix
    for this which will hopefully make it into the next release
    if it can make it through the testing/review regimen in
    time. If you need it before then and can do your own builds
    I can send you the new version. Note this has not yet been
    tested to make sure it doesn't have bad side effects like
    closing files in the parent, etc..

    Steve

     
  • Bob Isch

    Bob Isch - 2001-06-12

    Logged In: YES
    user_id=8257

    Steve: Use of "most excellent" and "bug" together is a new
    one... I'm sure someone will come up with one of
    those "easier to document than to fix" bugs soon enough.
    Sounds like I can wait for the next release which must be
    sometime soon? Otherwise, I'd be too tempted to "test" it
    in "production..."

    Thanks for looking into it,
    -bi

     
  • Steven Estes

    Steven Estes - 2002-01-11

    Logged In: YES
    user_id=97877

    Bob, this was fixed in V4.3. Can you do the needful to
    close it out? Thanks.

     
  • K.S. Bhaskar

    K.S. Bhaskar - 2002-01-11
    • priority: 5 --> 4
    • assigned_to: nobody --> estess
    • status: open --> closed-fixed
     

Log in to post a comment.