#144 Badness in sk_del_node_init at include/net/sock.h:343

v1.9.2
closed-fixed
nobody
5
2007-08-12
2007-07-31
Anonymous
No

Hi,

On debian 1.9.3 when i start up my applications which involve heavy shared memory usage and network usage i constantly get this on the non-init-nodes until they freeze:

Badness in sk_del_node_init at include/net/sock.h:343
[<c01068be>] dump_stack+0x1e/0x30
[<c0412de9>] __unix_remove_socket+0x69/0x70
[<c0413204>] unix_release_sock+0x24/0x300
[<c041384a>] unix_release+0x3a/0x90
[<c039bdc9>] sock_release+0x79/0xc0
[<c039ca44>] sock_close+0x34/0x50
[<c016c791>] __fput+0x121/0x140
[<c016c669>] fput+0x19/0x20
[<c016acf7>] filp_close+0x57/0x90
[<c016ad9e>] sys_close+0x6e/0x90
[<c010596b>] syscall_call+0x7/0xb

There was someone here http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/0735.html
had experienced the same on SELinux.

-niklas

Discussion

  • Nobody/Anonymous

    Logged In: NO

    Hi,

    Made some more research. This comes when a process tries to migrate and sometimes it comes in relation to
    reop_export_path: Can't export unlinked file /SYSV6d2012 (deleted)

    This is on debian sarge, openssi 1.9.3 debs, kernel 2.6.10

     
  • John Hughes

    John Hughes - 2007-07-31

    Logged In: YES
    user_id=166336
    Originator: NO

    I assume you're talking about the repository at www.atlantech.com/~john/openssi-debian-1.9.3?

    Could you try with the 2.6.11 kernel?

    You say "heavy shared memory usage and network usage" any way of quantifying that?

    Any simple recepie for duplicating the problem?

     
  • Nobody/Anonymous

    Logged In: NO

    Hi,

    I guess it is an error I get when a process using a socket tries to migrate, or at least it is related to it somehow.

    Easily reproduced on my system but I dont know any easy general way to reproduce it.

    The applications I use consist of perl-scripts controlling daemons that capture pictures and other daemons analyzing them and writing pictures to disk. One capturing daemon can easily use a block of 100M shared memory and the analyzing daemons analyzes the data in the same space.

    I will see if there is something I can do on application level, eg. to have daemons started on different nodes to even out load and then not allowing migrating. I will also try the 2.6.11 kernel. I just couldnt yet figure out how to do it but I will soon.

    I also get lots of these errors:
    kernel: reop_export_path: Can't export unlinked file /SYSV7a6d2001 (deleted)

     
  • John Hughes

    John Hughes - 2007-08-02

    Logged In: YES
    user_id=166336
    Originator: NO

    1. If you're using my repository just apt-get-install kernel-image-2.6.11-ssi-686-smp to get 2.6.11 (ssi 1.9.3 prerelease)
    2. You're using Unix domain sockets?
    3. SYSV shared memory?

     
  • John Hughes

    John Hughes - 2007-08-02

    Logged In: YES
    user_id=166336
    Originator: NO

    Can reproduce this by, on one node run script that waits for connection then exits, on another node script thet connects to server; waits a bit; then exits. If client exits before server get oops.

    Server.pl:

    use Socket;
    socket SERVER, PF_UNIX, SOCK_STREAM, 0
    or die "Can't create socket: $!\n";
    my $name = 'socket';
    my $addr = pack_sockaddr_un $name;
    unlink $name;
    bind SERVER, $addr
    or die "Can't bind socket: $!\n";
    listen SERVER, 1
    or die "Can't set socket for listening: $!\n";
    accept INCOMING, SERVER
    or die "Can't accept: $!\n";
    exit 0;
    ===============================================

    Client.pl:

    use Socket;
    socket CLIENT, PF_UNIX, SOCK_STREAM, 0
    or die "Can't create socket: $!\n";
    my $name = 'socket';
    my $addr = pack_sockaddr_un $name;
    connect CLIENT, $addrvl
    or die "Can't connect to $name: $!\n";
    sleep 10;
    ================================================

     
  • John Hughes

    John Hughes - 2007-08-02

    Logged In: YES
    user_id=166336
    Originator: NO

    Sorry, I meant if server exits before client then we get the oops.

     
  • John Hughes

    John Hughes - 2007-08-02

    Logged In: YES
    user_id=166336
    Originator: NO

    Can duplicate the "kernel: reop_export_path: Can't export unlinked file /SYSV7a6d2001 (deleted)" message by:

    On node 1 create shared memory segment (shmget)

    On node 2 attach to seg, then migrate to node 1.

     
  • John Hughes

    John Hughes - 2007-08-02

    Logged In: YES
    user_id=166336
    Originator: NO

    Reported the "reop_export_path: Can't export unlinked file /SYSV7a6d2001 (deleted)" problem as a seperate bug, #1766275

     
  • Nobody/Anonymous

    Logged In: NO

    Hi,

    Wow! you were quick in patching the reop_export_path -thing.

    Tried to compile the sources from cvs to try it out but when I make fullkern I get:
    debian:~/openssicvs/openssi# make fullkern
    Cleaning sandboxes
    Copying ../linux to ../linux-ssi
    Applying UML patch to ../linux-ssi
    Copying CI code into ../linux-ssi
    Copying OpenSSI code into ../linux-ssi
    Applying i386 patches to ../linux-ssi
    >>> Applying kdb-i386-ssi.patch
    Applying common patches to ../linux-ssi
    >>> Applying bugfixes.patch
    >>> Applying ipvs-bugfixes.patch
    >>> Applying kdb-common-ssi.patch
    kdb-common-ssi.patch, patch failed
    make: *** [fullkern] Error 1
    debian:~/openssicvs/openssi#

    And then a reject file:
    debian:~/openssicvs/openssi# cat /root/openssicvs/linux-2.6.10/cluster/clms/clms_client.c.rej
    ***************
    *** 37,44 ****
    #include <cluster/icsgen.h>
    #include <cluster/node_monitor.h>
    #include <linux/timer.h>
    #include <linux/sched.h>

    #include <cluster/gen/ics_clms_macros_gen.h>
    #include <cluster/gen/ics_clms_protos_gen.h>

    --- 37,47 ----
    #include <cluster/icsgen.h>
    #include <cluster/node_monitor.h>
    #include <linux/timer.h>
    #include <linux/sched.h>
    + #ifdef CONFIG_KDB
    + #include <linux/kdb.h>
    + #endif /* CONFIG_KDB */

    #include <cluster/gen/ics_clms_macros_gen.h>
    #include <cluster/gen/ics_clms_protos_gen.h>

    debian:~/openssicvs/openssi#

    I think it seems that I could modify kdb-common-ssi.patch somehow to make it work but if this is something someone else is experiencing then maybe it could be changed in the cvs. It is a fresh cvs checkout I have and also tried it on both 2.6.10 and .11 source trees.

    -niklas

     
  • John Hughes

    John Hughes - 2007-08-03

    Logged In: YES
    user_id=166336
    Originator: NO

    Niklas, please don't keep adding stuff that has nothing to do with this bug - use the ssic-linux-devel mailing list.

     
  • Nobody/Anonymous

    Logged In: NO

    Sorry. Will do.

     
  • John Hughes

    John Hughes - 2007-08-03

    Logged In: YES
    user_id=166336
    Originator: NO

    What is happening is that there is one too many sock_put's somewhere. sk_del_node_init warns if sk_refcnt is one just before calling sock_put 'cos it doesn't want to free the socket (a good thing too 'cos unix_release_sock still has quite a lot of work to do after calling unix_remove_socket (which calls sk_del_node_init)).

    So, where is the naughty sock_put? I'll try instrumenting things.

    More news later (assuming Roger doesn't fix it first :-))

     
  • John Hughes

    John Hughes - 2007-08-05

    Logged In: YES
    user_id=166336
    Originator: NO

    Same bug as seen by Vladimir Razgulin in 2006, reported in message <bf0eedee0608311539v5b6cca5cn7dc8492d9d64067c@mail.gmail.com>, fix checked in by Roger but then backed out due to a confusing comment.

    (Fix to get rid of an extra sock_put in rmtunixsvr_release_rmtpair).

     
  • Roger Tsang

    Roger Tsang - 2007-08-12
    • status: open --> open-fixed
     
  • Roger Tsang

    Roger Tsang - 2007-08-12
    • milestone: --> v1.9.2
     
  • Roger Tsang

    Roger Tsang - 2007-08-12
    • status: open-fixed --> closed-fixed
     
  • Nobody/Anonymous

    Logged In: NO

    Hi,

    I still see this on a system checked out from cvs today.
    Aug 13 06:43:17 localhost kernel: Assertion failed! rerror != -66, cluster/ssi/ipc/rmtunix.c, rmtunix_release_rmtpair, line=850
    Aug 13 06:47:01 localhost kernel: Assertion failed! rerror != -66, cluster/ssi/ipc/rmtunix.c, rmtunix_shutdown_rmtpair, line=930

    I see it in the syslog when i try to connect to a mysql-server running on the init-node via mysqld.sock linked to another location in the filesystem with ln -s.

    -niklas

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks