#20 Hang on select()

closed
nobody
None
5
2004-11-01
2003-11-25
Anonymous
No

For some users, SpamProbe is working perfectly. But
for one user, it hangs on select().

(gdb) r summarize -T MakeSpam.realbegova
Starting program: /usr/local/bin/spamprobe summarize -T
MakeSpam.realbegova

Program received signal SIGINT, Interrupt.
0x401e641e in select () from /lib/libc.so.6
(gdb) bt
#0 0x401e641e in select () from /lib/libc.so.6
#1 0x400a941c in __DTOR_END__ () from /lib/libdb-3.3.so
#2 0x4008fd77 in __os_yield () from /lib/libdb-3.3.so
#3 0x4002d4af in __db_tas_mutex_lock () from
/lib/libdb-3.3.so
#4 0x4006dcf3 in __db_e_attach () from /lib/libdb-3.3.so
#5 0x4006b2bc in __dbenv_open () from /lib/libdb-3.3.so
#6 0x0805a11f in FrequencyDBImpl_bdb::open
(this=0x8084e60,
arg_filename=@0xbffff660, read_only=true,
create_mode=384)
at /usr/include/g++-3/std/bastring.h:343
#7 0x0805b952 in FrequencyDBImpl_cache::open
(this=0x8084e48,
filename=@0xbffff660, read_only=true,
create_mode=384) at NewPtr.h:53
#8 0x08057c78 in FrequencyDB::open (this=0x8084e10,
arg_filename=@0xbffff6d0,
read_only=true) at NewPtr.h:53
#9 0x080652e6 in SpamFilter::open (this=0x8084e10,
db_file=@0xbffffa30,
read_only=true) at File.h:57
#10 0x08050182 in main (argc=4, argv=0xbffffb84) at
NewPtr.h:53
#11 0x401271c4 in __libc_start_main () from /lib/libc.so.6
(gdb)

I assume it's trying to access a lockfile that it can
never access.
This is on redhat 7.3 with the RPM db3-3.3.11-6

If I can render any further help, email
abentley@panoramicfeedback.com

Discussion

  • Logged In: YES
    user_id=508249

    I am seeing this too (using db-4.2.52). It usually happens
    after I interrupted a spamprobe of several files such as

    spamprobe score folder/*

    It seems that when spamprobe "cleans up", it breaks the
    database. The result is that further spamprobes hang
    indefinitely in select(). Doing db_recover fixes the
    database and future spamprobe processes run as usual (the
    hanging ones keep hanging until killed).
    Sometimes the blocked spamprobes need to be SIGKILLED
    because they hang in the "cleaning up" stage of a previous
    interrupt.

    Other observations: 1. When the database is screwed up,
    spamprobe of multiple messages doesn't always hang
    immediately. Sometimes it classifies the 1st message and
    then hangs.

    2. After running spamprobe there's always a lock-file in the
    database directory, no matter if spamprobe was aborted or
    not. As spamprobe is the first program that uses db that I
    have, I'm not sure if this is normal behaviour.

     
  • Logged In: YES
    user_id=508249

    I'd just like to add that this is fairly reproducible. It
    usually needs no more than a couple of interrupted

    spamprobe folder/*

    commands to get the problem to appear on my system.

     
  • Logged In: YES
    user_id=122024

    I'm having the same problem. Here's the curious part -- it
    used to work for me, then I upgraded and it stopped. I'm
    using v0.9h with Berkeley DB db-4.2.52 on a virgin Slackware
    10 install. All my scripts that used to work now break in
    exactly the way described by the submitter of over a year ago
    -- things work fine, then they hang, the database is corrupt, a
    db_recover fixes them, and things soon break again.

    Here's an strace with commentary:
    ...lots of GOOD and BAD spam detection happen.
    write(1, "SPAM 0.9999990 59275b00878f54270"..., 48
    SPAM 0.9999990 59275b00878f5427067fa4e413cb6a78
    ) = 48
    pread(4,
    "\0\0\0\0\1\0\0\0\321/\0\0\0\0\0\0\0\0\0\0a\0d\7\2\3\350"...,
    4096, 50139136) = 4096
    pread(4,
    "\0\0\0\0\1\0\0\0\5\36\0\0\372\0\0\0n\v\0\0L\0\30\7\1\5"...,
    4096, 31477760) = 4096

    ...memory release and files close
    brk(0x8118000) = 0x8118000
    close(6) = 0
    fcntl64(5, F_SETLK, {type=F_UNLCK, whence=SEEK_SET,
    start=0, len=0}) = 0
    close(5) = 0

    ...from here on out the process hangs in a select
    select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
    select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
    select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
    ...process must be killed.

     
  • Logged In: YES
    user_id=122024

    Here's a little bit more weirdness. After trying the command a
    second time, I got a different ending -- could there be a race
    condition?

    close(6) = 0
    fcntl64(5, F_SETLK, {type=F_UNLCK, whence=SEEK_SET,
    start=0, len=0}) = 0
    close(5) = 0
    fsync(4) = 0
    fsync(4) = 0
    fsync(4) = 0
    fsync(4) = 0
    close(4) = 0
    munmap(0x40359000, 368640) = 0
    munmap(0x40317000, 270336) = 0
    munmap(0x40015000, 8192) = 0
    fcntl64(3, F_SETLK, {type=F_UNLCK, whence=SEEK_SET,
    start=0, len=0}) = 0
    close(3) = 0
    munmap(0x40017000, 4096) = 0
    semget(IPC_PRIVATE, 4096, IPC_CREAT|0x4030f140|0400) =
    -1 ENOSYS (Function not implemented)
    _exit(0) = ?

    Note I have not rebuilt my own kernel yet, but during a
    semaphore get for private interprocess communication the
    creation fails with a function not implemented. Could
    spamprobe calling on a feature not compiled into the kernel?

     
  • Brian Burton
    Brian Burton
    2004-11-01

    • status: open --> closed
     
  • Brian Burton
    Brian Burton
    2004-11-01

    Logged In: YES
    user_id=294579

    Thanks for the comments about DB corruption and the example
    of how to reproduce it. I'll investigate and see if maybe
    I'm not closing down DB safely in the event of exit on a signal.