#10 accept() socket errors under load (linux)

v0.1.14
closed
3
2004-07-14
2004-07-13
No

Here are the system specs for this bug:

# uname -a
Linux dill 2.4.19-64GB-SMP #1 SMP Mon Oct 21 18:48:05
UTC 2002 i686 unknown
# cat /etc/issue
Welcome to SuSE SLES 8 (powered by UnitedLinux 1.0) (i586)
Kernel \r (\l).
# top
CPU0 states: 0.0% user, 0.1% system, 0.0% nice,
99.0% idle
CPU1 states: 0.0% user, 0.0% system, 0.0% nice,
100.0% idle
Mem: 2327776K av, 1412280K used, 915496K free,
0K shrd, 197740K buff
Swap: 1052616K av, 496K used, 1052120K free
709304K cached

Under heavy testing load, the dk-filter was unable to
accept some connections, while others continued
successfully.

This problem is repeatable under load. Seems to be
happening regularly under load on Linux. dk-filter
is happily chugging along:

Jul 12 16:58:24 dill sm-mta[13276]: i6CNw8fJ013276:
from=<sender@othercompany.com>, size=1075, class=0,
nrcpts=1,
msgid=<12345678.1@nowhere>, proto=ESMTP, daemon=MTA,
relay=basil.ps-lab.sendmail.com [10.211.32.151]
Jul 12 16:58:24 dill sm-mta[13276]: i6CNw8fJ013276:
Milter insert (1): header:
DomainKey-Signature: a=rsa-sha1; s=thomtest;
d=othercompany.com; c=simple; q=;
b=E\n\tlnrAeF0G5dQVk/HcnA3eZ+gzfkTYaH2rBUNXfLhpVuGAEyVa5PTdGlXbnMxz76R
Jul 12 16:58:24 dill sm-mta[13268]: i6CNw63w013268:
to=<u570@othercompany.com>,
delay=00:00:00, xdelay=00:00:00, mailer=esmtp, pri=31075,
relay=thyme.othercompany.com. [10.211.32.156],
dsn=2.0.0, stat=Sent (Ok)

And then suddenly:

Jul 12 16:58:25 dill sm-mta[13318]: i6CNwPAm013318:
Milter (dk-filter): write(O) returned -1, expected 17:
Broken pipe
Jul 12 16:58:25 dill sm-mta[13318]: i6CNwPAm013318:
Milter (dk-filter): to error state
Jul 12 16:58:25 dill sm-mta[13318]: i6CNwPAm013318:
Milter (dk-filter): init failed to open
Jul 12 16:58:25 dill sm-mta[13318]: i6CNwPAm013318:
Milter (dk-filter): to error state
Jul 12 16:58:25 dill dk-filter[12795]: Sendmail
DomainKeys Filter: accept() returned invalid socket
(Numerical result out of range), try again
Jul 12 16:58:25 dill sm-mta[13318]: i6CNwPAm013318:
Milter: initialization failed, temp failing commands

Of particular interest is this error:

Jul 12 16:58:25 dill dk-filter[12795]: Sendmail
DomainKeys Filter: accept() returned invalid socket
(Numerical result out of range), try again

After the failure begain occurring, it continued to
occur on some but not all threads (some threads
completed successfully):

After which some but not all threads fail, while some
seem to keep on working (note this is approximately 8
minutes later, but the same test has continued running
without any processes being restarted):

Jul 12 17:06:54 dill sm-mta[14717]: i6D05o3e014717:
from=<sender@othercompany.com>, size=1075, class=0,
nrcpts=1, msgid=<12345678.1@nowhere>, proto=ESMTP,
daemon=MTA, relay=basil.mydomain.com [10.211.32.151]
Jul 12 17:06:54 dill sm-mta[14717]: i6D05o3e014717:
Milter insert (1): header: DomainKey-Signature:
a=rsa-sha1; s=thomtest; d=othercompany.com; c=simple; q=;
b=E\n\tlnrAeF0G5dQVk/HcnA3eZ+gzfkTYaH2rBUNXfLhpVuGAEyVa5PTdGlXbnMxz76R
Jul 12 17:06:54 dill sm-mta[14717]: i6D05o3e014717:
to=<u93@othercompany.com>, delay=00:00:00,
xdelay=00:00:00, mailer=esmtp, pri=31075,
relay=thyme.othercompany.com. [10.211.32.156],
dsn=2.0.0, stat=Sent (Ok)
Jul 12 17:06:54 dill sm-mta[14717]: i6D05o3f014717:
from=<sender@othercompany.com>, size=1075, class=0,
nrcpts=1, msgid=<12345678.1@nowhere>, proto=ESMTP,
daemon=MTA, relay=basil.mydomain.com [10.211.32.151]
Jul 12 17:06:54 dill sm-mta[14717]: i6D05o3f014717:
Milter insert (1): header: DomainKey-Signature:
a=rsa-sha1; s=thomtest; d=othercompany.com; c=simple; q=;
b=E\n\tlnrAeF0G5dQVk/HcnA3eZ+gzfkTYaH2rBUNXfLhpVuGAEyVa5PTdGlXbnMxz76R
Jul 12 17:06:54 dill sm-mta[14717]: i6D05o3f014717:
to=<u18@othercompany.com>,
delay=00:00:00, xdelay=00:00:00, mailer=esmtp,
pri=31075, relay=thyme.othercompany.com.
[10.211.32.156], dsn=2.0.0, stat=Sent (Ok)
Jul 12 17:06:56 dill dk-filter[12795]: Sendmail
DomainKeys Filter: accept() returned invalid socket
(Numerical result out of range), try again
Jul 12 17:06:56 dill sm-mta[14724]: i6D05tmk014724:
Milter (dk-filter): read returned -1: Connection reset
by basil.mydomain.com
Jul 12 17:06:56 dill sm-mta[14724]: i6D05tmk014724:
Milter (dk-filter): to error state
Jul 12 17:06:56 dill sm-mta[14724]: i6D05tmk014724:
Milter (dk-filter): init failed to open
Jul 12 17:06:56 dill sm-mta[14724]: i6D05tmk014724:
Milter (dk-filter): to error state
Jul 12 17:06:56 dill sm-mta[14724]: i6D05tmk014724:
Milter: initialization failed, temp failing commands
Jul 12 17:06:56 dill sm-mta[14724]: i6D05tmk014724:
basil.mydomain.com [10.211.32.151] did not issue
MAIL/EXPN/VRFY/ETRN during connection to MTA

Discussion

  • Anonymous - 2004-07-13

    Logged In: YES
    user_id=1048957

    This is probably not a bug. That error is reported when
    (among other possibilities) accept() returns a descriptor that
    is too large to be used with select(). This is a safety
    mechanism, because forcing use of such a descriptor would
    corrupt memory.

    You can remedy this by compiling libmilter with
    _FFR_USE_POLL (8.12.x) and recompiling dk-filter. I can't
    remember if 8.13.0 has this code enabled or not; if it doesn't,
    try the same thing.

    I'll leave this open for now; please let me know how that
    works.

     
  • Anonymous - 2004-07-13
    • priority: 5 --> 3
    • assigned_to: nobody --> sm-msk
     
  • Thom OConnor

    Thom OConnor - 2004-07-13

    Logged In: YES
    user_id=1082229

    Okay, I'm working on this now.

    I think the POLL build option is "-DSM_CONF_POLL". Here is
    the site.config.m4 option I'm using:

    APPENDDEF(`confENVDEF', `-DSM_CONF_POLL=1')dnl

    The libmilter.a library that I build dk-filter with looks to
    use only poll:

    # nm -A libmilter.a | egrep 'poll|select'
    libmilter.a:listener.o: U poll
    libmilter.a:comm.o: U poll

    However, the resulting dk-filter binary appears to contain
    both the poll and select symbols still:

    # nm -A dk-filter | egrep 'poll|select'
    dk-filter: U poll@@GLIBC_2.0
    dk-filter: U select@@GLIBC_2.0
    dk-filter:08062bb0 B selector

    Using this new binary, I'm still getting the same error
    messages:

    Jul 13 14:08:18 dill dk-filter[4662]: Sendmail DomainKeys
    Filter: accept() returned invalid socket (Numerical result
    out of range), try again
    Jul 13 14:08:18 dill sm-mta[4730]: i6DL87PH004730:
    to=<u778@othercompany.com>, d
    elay=00:00:10, xdelay=00:00:00, mailer=esmtp, pri=31075,
    relay=thyme.othercompany.com. [10.211.32.156], dsn=2.0.0,
    stat=Sent (Ok)
    Jul 13 14:08:18 dill sm-mta[4702]: i6DL87YY004702:
    to=<u781@othercompany.com>, d
    elay=00:00:00, xdelay=00:00:00, mailer=esmtp, pri=31075,
    relay=thyme.othercompan
    y.com. [10.211.32.156], dsn=2.0.0, stat=Sent (Ok)
    Jul 13 14:08:18 dill sm-mta[5354]: i6DL8Hq1005354: Milter
    (dk-filter): read returned -1: Connection reset by
    basil.mydomain.com
    Jul 13 14:08:18 dill sm-mta[5354]: i6DL8Hq1005354: Milter
    (dk-filter): to error state
    Jul 13 14:08:18 dill sm-mta[5354]: i6DL8Hq1005354: Milter
    (dk-filter): init failed to open
    Jul 13 14:08:18 dill sm-mta[5354]: i6DL8Hq1005354: Milter
    (dk-filter): to error state
    Jul 13 14:08:18 dill sm-mta[5354]: i6DL8Hq1005354: Milter:
    initialization failed, temp failing commands

    In using linux truss (strace) on the various threads, I see
    both poll() and select() syscalls. First, here are the
    threads as shown by ps (with no messages running through it):

    # ps ax | grep dk-filter | less
    7056 pts/3 S 0:00 /usr/local/sendmail/bin/dk-filter
    -l -d mycompany.com othercompany.com -i
    /etc/mail/dk-filter-ilist -p
    /var/sendmail/dk-filter/dk-filter.sock -s
    /var/sendmail/dk-filter/thomtest.key.pem -S thomtest -A
    7057 ? S 0:00 /usr/local/sendmail/bin/dk-filter
    -l -d mycompany.com,othercompany.com -i
    /etc/mail/dk-filter-ilist -p
    /var/sendmail/dk-filter/dk-filter.sock -s
    /var/sendmail/dk-filter/thomtest.key.pem -S thomtest -A
    7059 ? S 0:00 /usr/local/sendmail/bin/dk-filter
    -l -d mycompany.com othercompany.com -i
    /etc/mail/dk-filter-ilist -p
    /var/sendmail/dk-filter/dk-filter.sock -s
    /var/sendmail/dk-filter/thomtest.key.pem -S thomtest -A
    7061 ? S 0:00 /usr/local/sendmail/bin/dk-filter
    -l -d mycompany.com othercompany.com -i
    /etc/mail/dk-filter-ilist -p
    /var/sendmail/dk-filter/dk-filter.sock -s
    /var/sendmail/dk-filter/thomtest.key.pem -S thomtest -A

    strace on each of these threads shows the following:

    dill:/var/tmp # strace -p 7056
    wait4(-1, <unfinished ...>

    dill:/var/tmp # strace -p 7057
    select(4, [3], NULL, [3], {4, 740000}) = 0 (Timeout)
    select(4, [3], NULL, [3], {5, 0} <unfinished ...>

    dill:/var/tmp # strace -p 7059
    getppid() = 7057
    poll([{fd=5, events=POLLIN}], 1, 2000) = 0
    getppid() = 7057
    poll( <unfinished ...>

    dill:/var/tmp # strace -p 7061

    So it looks like one thread is still using select(). Looking
    at the source, I see a select() calls in libsm and libar:

    libar/ar.c: status = select(maxfd + 1, &rfds,
    &wfds, NULL, NULL);
    libar/ar.c: status = select(maxfd + 1, NULL, &wfds,
    NULL, &stimeout);
    libsm/clock.c: r = select(0, NULL, NULL,
    NULL, &sm_io_to);
    libsm/refill.c:** This #define uses a select() to wait for
    the 'fd' to become readable.
    libsm/refill.c:** The select() can be active for up to 'To'
    time. The select() may not
    libsm/refill.c:** BSD-based/like systems the timeout for a
    select() is updated for the
    libsm/refill.c:** fd -- raw file descriptor
    (from 'fp') to use for select()
    libsm/refill.c:** sel_ret -- the return value
    from the select()
    libsm/refill.c: (sel_ret) = select((fd) + 1, &sm_io_to_mask,
    NULL, \ libsm/refill.c: ** something available to be read (via
    select()).
    libsm/stdio.c: if (select(fp->f_file + 1,
    FDSET_CAST &readfds,

    So, I'm a bit puzzled here. The accept() error I'm seeing is
    from a portion of libmilter.c which is not inside a "#if
    !SM_CONF_POLL" section:

    sendmail-8.13.0
    libmilter/libmilter.c, line 844:
    smi_log(SMI_LOG_ERR,
    "%s: accept() returned invalid socket (%s), %s",
    smfi->xxfi_name, sm_errstring(save_errno),
    acnt >= MAX_FAILS_A ? "abort" : "try again");

    Thanks, thoughts welcome.

    Thom O'Connor

     
  • Thom OConnor

    Thom OConnor - 2004-07-13

    Logged In: YES
    user_id=1082229

    Okay, please disregard the last followup. The build process
    was working correctly, the problem was a startup script
    issue (starting up the wrong binary location).

    After correcting the situation, the new binary appears to
    solely be using poll():

    # ps ax | grep dk-filter
    29785 pts/1 S 0:00 /usr/local/sendmail/bin/dk-filter
    -l -d mycompany.com othercompany.com -i
    /etc/mail/dk-filter-ilist -p
    /var/sendmail/dk-filter/dk-filter.sock -s
    /var/sendmail/dk-filter/thomtest.key.pem -S thomtest -A
    29786 ? S 0:00 /usr/local/sendmail/bin/dk-filter
    -l -d mycompany.com othercompany.com -i
    /etc/mail/dk-filter-ilist -p
    /var/sendmail/dk-filter/dk-filter.sock -s
    /var/sendmail/dk-filter/thomtest.key.pem -S thomtest -A
    29789 ? S 0:00 /usr/local/sendmail/bin/dk-filter
    -l -d mycompany.com othercompany.com -i
    /etc/mail/dk-filter-ilist -p
    /var/sendmail/dk-filter/dk-filter.sock -s
    /var/sendmail/dk-filter/thomtest.key.pem -S thomtest -A
    29790 ? S 0:00 /usr/local/sendmail/bin/dk-filter
    -l -d mycompany.com othercompany.com -i
    /etc/mail/dk-filter-ilist -p
    /var/sendmail/dk-filter/dk-filter.sock -s
    /var/sendmail/dk-filter/thomtest.key.pem -S thomtest -A

    dill:/usr/bin # strace -p 29785
    wait4(-1, <unfinished ...>

    dill:/usr/bin # strace -p 29786
    poll( <unfinished ...>

    dill:/usr/bin # strace -p 29789
    getppid() = 29786
    poll( <unfinished ...>

    dill:/usr/bin # strace -p 29790

    Continuing to test, then, I'll let you know how the
    performance test turns out.

    Thanks,

    -t

     
  • Thom OConnor

    Thom OConnor - 2004-07-14

    Logged In: YES
    user_id=1082229

    Okay, just to finalize and close this bug.

    The key is to building a dk-fillter binary built for very
    high performance is two-fold:

    1. Build libmilter with poll() rather than select(). This is
    accomplished by modifying your sendmail build file
    (site.config.m4) to contain this:

    dnl # use poll() instead of select() in libmilter
    APPENDDEF(`confENVDEF', `-DSM_CONF_POLL=1')dnl

    Then build dk-filter with the resulting libmilter.a

    2. Make sure to build dk-filter using the asynchronous
    (ARLIB) resolver. This is accomplished by modifying the
    Makefile.m4 in libdk:

    bldPUSH_SMLIB(`ar')
    APPENDDEF(`confENVDEF', `-DUSE_ARLIB ')
    APPENDDEF(`confINCDIRS', `-I../libar/ ')

    Obviously, overall throughput performance is then based on
    factors such as CPU resource available, DNS performance,
    etc. However, in benchmark tests using 100% message signing
    or verification, on a 2-cpu linux box, I'm getting
    consistent rates of over 100 messages signed or verified per
    second (32kB average message size), and upwards of 200
    messages signed/verified per second when using 1kB messages.

    Thanks!

    Thom O'Connor

     
  • Thom OConnor

    Thom OConnor - 2004-07-14

    Logged In: YES
    user_id=1082229

    Okay, just to finalize and close this bug.

    The key is to building a dk-fillter binary built for very
    high performance is two-fold:

    1. Build libmilter with poll() rather than select(). This is
    accomplished by modifying your sendmail build file
    (site.config.m4) to contain this:

    dnl # use poll() instead of select() in libmilter
    APPENDDEF(`confENVDEF', `-DSM_CONF_POLL=1')dnl

    Then build dk-filter with the resulting libmilter.a

    2. Make sure to build dk-filter using the asynchronous
    (ARLIB) resolver. This is accomplished by modifying the
    Makefile.m4 in libdk:

    bldPUSH_SMLIB(`ar')
    APPENDDEF(`confENVDEF', `-DUSE_ARLIB ')
    APPENDDEF(`confINCDIRS', `-I../libar/ ')

    Obviously, overall throughput performance is then based on
    factors such as CPU resource available, DNS performance,
    etc. However, in benchmark tests using 100% message signing
    or verification, on a 2-cpu linux box, I'm getting
    consistent rates of over 100 messages signed or verified per
    second (32kB average message size), and upwards of 200
    messages signed/verified per second when using 1kB messages.

    Thanks!

    Thom O'Connor

     
  • Anonymous - 2004-07-14
    • status: open --> closed
     
  • Anonymous - 2004-07-14

    Logged In: YES
    user_id=1048957

    Thanks for the post. I'll put your comments in the
    project's README.

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks