Pgl 2.2.4 and Raspberry Pi

Help
2014-06-04
2014-07-16
  • Hi,

    I've compiled pgl 2.2.4 on my raspberry pi (running raspbian) and it running mostly fine, except that on list refresh, pgld fails with an error:

    Unbinding from queue '23552', recv returned No buffer space available

    I've read that this is known to happen under load and have seen possible workarounds: pgl is compiled with LOWMEM option, but I haven't tried nicing it to a higer priority yet.

    Is there anything else that can be done to refresh pgld reliably?

    Pgld was configured like this:

    ./configure --prefix=/usr --mandir=/usr/share/man --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var --with-lsb=/lib/lsb/init-functions --enable-cron --disable-dbus --enable-logrotate --enable-networkmanager --enable-zlib --without-qt4 --enable-lowmem

    Thank you,
    Marko

     
  • jre-phoenix
    jre-phoenix
    2014-06-04

    First some general, perhaps unrelated, questions:

    How much RAM does your Rasperry have?
    Did you change the NFQUEUE number to 23552? The default is 92.
    Do the lines in your /var/lib/pgl/master_blocklist.p2p also contain descriptions or only IP ranges?

    I didn't write that part of the code, but as far as I understand it:
    You get a error message that comes from handling the packets (normal operation of pgld). So it is not a problem of reloading the new blocklist - but of decreased memory during this loading being available for the normal operation.

    Setting a lower nice level (in extreme NICE_LEVEL="-20") might help, check it out.

    A workaround would be to stop pgl, then update and then start again. Of course this is not the best solution and leaves you some time unprotected.

    But probably the best tip (original info was found here by user dogg:
    Increase the default receive/send window:
    sysctl -w net.core.rmem_default=8388608
    sysctl -w net.core.wmem_default=8388608
    If this helps do this permanently by either setting above commands somewhere in the system start or e.g. to /etc/pgl/insert.sh.

    Hope this helps, please report back.

     
  • Hi,

    Pi is a Model B, so 512M RAM.

    I didn't change the NFQUEUE number (wouldn't know how :).

    Master blocklist contains only IPs (I believe that's the --enable-lowmem config option)

    I've set the nice level to -10, we'll see how that goes.

    I'll also try setting receive/send windows and report back...

    Thank you,
    Marko

     
  • Oh, one more thing... Here's a sample of pgld's log when it fails:

    Jun 4 06:25:53 INFO: Reopened logfile: /var/log/pgl/pgld.log
    Jun 4 06:26:11 INFO: ASCII: 227555 entries loaded from "/var/lib/pgl/master_blocklist.p2p"
    Jun 4 06:26:12 INFO: Blocking 227555 IP ranges (2896894064 IPs).
    Jun 4 06:26:12 INFO: Blocklist(s) reloaded.
    Jun 4 06:26:12 ERROR: Unbinding from queue '23552', recv returned No buffer space available
    Jun 4 06:30:26 INFO: Started.
    Jun 4 06:30:37 INFO: ASCII: 218625 entries loaded from "/var/lib/pgl/master_blocklist.p2p"
    Jun 4 06:30:37 INFO: Blocking 218625 IP ranges (2963431412 IPs).
    Jun 4 06:30:37 INFO: NFQUEUE: binding to queue 92
    Jun 4 06:30:37 INFO: ACCEPT mark: 20
    Jun 4 06:30:37 INFO: REJECT mark: 10
    ...

    I checked yesterday's log, it's essentially the same, the error message mentions queue 23552 and when pgld gets restarted, it binds to queue 92.

    I just ran pglcmd update with pgld's nice level set to -10 and it worked. I'll let it run now and check it tomorrow morning.

     
  • Ok, pgld updated two days in a row without problems with nice level set to -10. Today it failed. I have increased send and receive windows, we'll see how that goes.

    Marko

     
  • Cader
    Cader
    2014-06-09

    I updated pgld to make the error less confusing (I hope)
    For some reason (probably due to big/little endian issues or the way the nfq bind worked) I was changing the input value of the queue number vi host to network byte order. Then whenever I used queue_num I did network to host. But not really sure why I was.
    I cleaned up the log messages so the numbers should now be right. So the 23552 is really 92 just in the wrong byte order.

    As far as the no buffer space - I am unsure.

     
  • jre-phoenix
    jre-phoenix
    2014-06-10

    I think this is what happens:

    While the master blocklist gets reloaded pgld obviously can't handle new traffic. Therefore the traffic sent to pgld/NFQUEUE accumulates in the buffer during the reload.
    Now the raspberry pi is probably slow and needs some time to reload the master blocklist. So the buffer fills up before the new master blocklist is fully loaded.

    Therefore we currently have probably two working solutions/workarounds:
    - Increase the priority of pgld, so that the blocklist gets loaded quicker (and pgld can kick in again before the buffer is full)
    - Increase the buffer size (so that we have more time to load the master blocklist before the buffer is full)

    Further things that should be done:
    - Improve efficiency of the blocklist loading code (probably hard to do, don't know)
    - If the buffer is full simply flush it/reject new incoming traffic instead of having pgld unbind from nfqueue. (Currently the traffic waiting in the buffer for pgld is lost anyway if this happens.)

    @Cader:
    What do you think? Can you implement the latter thing (flush/reject)?

    @Marko Bozikovic:
    What's your experience with the increased windows? (Do you apply this permanently (by putting the command in a file like i proposed)?)

     
  • Cader
    Cader
    2014-06-10

    That sounds like it could very well be an issue.
    I was poking around the netfilter site again the other day and saw a queue length option.
    I didn't see what the default is so I have to find that and then maybe increasing it would help. One thing that I have to verify as well was I think this required kernel 2.6.25+ or so to be able to modify the kernel queue from user space. So that may be an issue.

    int nfq_set_queue_maxlen ( struct nfq_q_handle * qh,
    u_int32_t queuelen
    )
    nfq_set_queue_maxlen - Set kernel queue maximum length parameter

    Parameters:
    qh Netfilter queue handle obtained by call to nfq_create_queue().
    queuelen the length of the queue
    Sets the size of the queue in kernel. This fixes the maximum number of packets the kernel will store before internally before dropping upcoming packets.

    Returns:
    -1 on error; >=0 otherwise.
    Definition at line 610 of file libnetfilter_queue.c.

     
  • Cader
    Cader
    2014-06-10

    I am thinking that of adding a -Q option that if set will set the queue length.
    That way you can tune it to the needs of your machine rather than hard code something.
    Still looking for default and if there is a /proc or /sys file that shows current size.

    Thoughts?

     
  • Cader
    Cader
    2014-06-10

    I added a -Q option to pgld to allow for tuning the kernel packet queue in the latest git.
    @JRE do you want to add that as an option to pglcmd?

    I have no clue (yet) what the default is so if it is set use the set value with -Q otherwise don't pass -Q to pgld and it will use what ever the default is.

    Again this requires a 2.6.20+ kernel but guessing that should be an issue.

     
  • jre-phoenix
    jre-phoenix
    2014-06-11

    Sorry, I answered per EMail but that seems not to work, so reposting 2 older posts and a current one:

    Will that be an issue to require 2.6.20? That is fairly old...

    No problem at all, 2.6.20 is from 2007. We already have a requirement of
    Linux kernel >= 2.6.13 for NFQUEUE support, so we'll just increase that
    (edit the INSTALL file in the requirements section).

    Besides that currently I don't fully understand you. remember "I'm not a
    programmer"
    Just go, we'll test afterwards.

     
  • jre-phoenix
    jre-phoenix
    2014-06-11

    On 06/10/2014 04:38 PM, Cader wrote:

    I am thinking that of adding a -Q option that if set will set the queue length.
    That way you can tune it to the needs of your machine rather than hard code something.
    Still looking for default and if there is a /proc or /sys file that shows current size.

    Thoughts?

    -Q option sounds good (perhaps add a hint to it in the relevant error
    message).

    A quick grep for some keywords in /proc and /sys didnt show me any
    results (but for other kernel modules I always found something in
    /proc/net).

     
  • jre-phoenix
    jre-phoenix
    2014-06-11

    Hi

    On 06/10/2014 09:10 PM, Cader wrote:

    I added a -Q option to pgld to allow for tuning the kernel packet queue in the latest git.
    @JRE do you want to add that as an option to pglcmd?

    Great, for sure I will do this (today or tomorrow).

    I have no clue (yet) what the default is so if it is set use the set value with -Q otherwise don't pass -Q to pgld and it will use what ever the default is.

    Yes, that's how I'll implement that. (But anyway I'm still searching the
    web to better understand this stuff)

    Again this requires a 2.6.20+ kernel but guessing that should be an issue.

    NOT an issue

    Now the big question is, whether this is really the right solution for
    the original problem. And if yes, what values should be used - I assume
    the valid range (because of uint32) is 0 - 4294967295, right?

    In a previous message I recommended these commands:

    sysctl -w net.core.rmem_default=8388608
    sysctl -w net.core.wmem_default=8388608
    

    Do they the same things? Can we recommend "pgld -Q 8388608"?

    @Marko Bozikovic: Did you try the sysctl commands?

    Further thoughts to improve this:

    I found this:

    "Too slow verdict [from the user space application] will result in a
    full queue. Kernel will then drop incoming packets instead of en-queuing
    them."
    (http://home.regit.org/netfilter-en/using-nfqueue-and-libnetfilter_queue/)

    So if the queue buffer is full, new packets will simply be dropped, right!?

    So in pgld.c in the nfqueue_loop function in that part (line 550):

        int err=errno;
        do_log(LOG_ERR, "ERROR: Unbinding from queue '%hu', recv returned
    %s", queue_num, strerror(err));
        if ( err == ENOBUFS ) {
            /* close and return, nfq_destroy_queue() won't work as we've no
    buffers */
            nfq_close(nfqueue_h);
            exit(1);
    
        } else {
            nfqueue_unbind();
            exit(0);
        }
    

    ... I don't understand why to close the queue and exit. Instead pgld
    might simply continue working and the kernel would just drop packets if
    the buffer is full because pgld is too slow.

    So I propose to remove the close & exit part.

    Independently I propose to add a log message here like "The queue
    buffer is full, consider increasing its size with pgld -Q ..."

     
  • Cader
    Cader
    2014-06-11

    sysctl -w net.core.rmem_default=’8388608′
    sysctl -w net.core.wmem_default=’8388608′

    These are for the interface buffers in bytes. If the issue is the netfilter buffer these sysctl commands won't help

    I made the change to not exit on ENOBUFS and log it with the "use -Q option" so we can see if we are on the right track here.
    I also added a log message for anything other than ENOBUFS

    while ((rv = recv(fd, buf, sizeof(buf), 0)) >= 0) {
        nfq_handle_packet(nfqueue_h, buf, rv);
    }
    int err=errno;
    if ( err == ENOBUFS ) {
        do_log(LOG_ERR, "ERROR: ENOBUFS error on queue '%hu'. Use -Q to increase buffers, recv returned %s", queue_num, strerror(err));
    } else {
        do_log(LOG_ERR, "ERROR: Error on queue '%hu', recv returned %s", queue_num, strerror(err));
        nfqueue_unbind();
        exit(0);
    }
    
     
    Last edit: Cader 2014-06-11
  • jre-phoenix
    jre-phoenix
    2014-06-12

    1.)

    On 06/11/2014 04:14 PM, Cader wrote:

    sysctl -w net.core.rmem_default=’8388608′
    sysctl -w net.core.wmem_default=’8388608′

    These are for the interface buffers in bytes. If the issue is the netfilter buffer these sysctl commands won't help

    I /try/ to understand that stuff. But for me the following two sound quite similar (of course this doesn't mean they are for the same thing):

    sysctl man page:
    sysctl is used to modify kernel parameters at runtime. The parameters available are those listed under /proc/sys/.

    http://www.netfilter.org/projects/libnetfilter_queue/doxygen/group__Queue.html:
    nfq_set_queue_maxlen - Set kernel queue maximum length parameter
    Sets the size of the queue in kernel. This fixes the maximum number of packets the kernel will store before internally before dropping upcoming packets.

    2.)

    I implemented the stuff in pglcmd. Set it with NFQUEUE_MAXLEN="value" in pglcmd.conf.
    For now I chose as valid values 0 - 2147483647 (2^31 - 1). For higher values I got stuff like this in the pgld.log:

    Jun 12 02:29:17 INFO: Kernel queue maximum length: 4294967295
    Jun 12 02:29:17 INFO: ACCEPT mark: 20
    Jun 12 02:29:17 INFO: REJECT mark: 10
    Jun 12 02:29:17 INFO: Set netfilter queue length to -1 packets
    

    Is this correct or doesn't pgld show the correct number with %d?

    3.)

    I started pgld with -Q 0. But this seems to be a noop, at least nothing in pgld.log. But "pglcmd status" shows the option was passed correctly:

    PID: 12343    CMD: /usr/sbin/pgld -l /var/log/pgl/pgld.log -d -p /var/run/pgld.pid -q 92 -Q 0 -r 10 -a 20 /var/lib/pgl/master_blocklist.p2p
    

    4.)

    I started pgld with -Q 1. But even if I reload a bunch of websites at the same time I just get hundreds of blocks in a few seconds shown in pgld.log. But no errors.

    Shouldn't such a small value trigger your new error messages?

     
    • Cader
      Cader
      2014-06-12

      1) yes sysctl controls many kernel params but the buffer for rmem and wmem are for the interface. This kernel queue should be the netfilter queue. so a different queue.

      2) the queue_length is an unsigned int so you can go to 4,294,967,295 packets.
      %d is for signed int. Use %u for unsigned int with printf

      3) if (queue_length) is what is doing it - if queue_length is 0 that means false therefore wouldn't print or even go into the nfq_set_queue_maxlen block since I have that as "if ( queue_length > 0) {".

      4) Yeah I tried to set to 1 as well and was hoping to get the error too but didn't. I don't have a Pi or anything really slow to test with so I am just stabbing at the error. I wonder if there is some sanity in the function to not go below a certain packet threshold as a buffer of 1 wouldn't be very sane.

      Just did a quick search and see the error in other stuff and the explanation is what we think is the issue - no kernel buffer left. So I think we are on the right track at least. Reproducing it is the hard part now.

       
  • Cader
    Cader
    2014-06-12

    ahh cripes that log line was my mistake - not sure why I did %d and not %u.
    fixed it so it should log right
    I removed the debug set lines too since it should log correctly now.

     
  • Hi guys,

    Sorry for a late answer... I've been running pgld on RPi with increased buffers, and had no errors in the past 5 days, even with high network traffic.

    @jre-phoenix I haven't applied the change permanently yet. I'll try to test this over the weekend. Also, I can try playing with the git version, but I can't promise exactly when.

    FWIW, I like the idea of including a command line option, or even a config option. Just like the NICE option, it will probably only be used on low-end hardware, like RPi.

    Cheers,
    Marko

     
  • jre-phoenix
    jre-phoenix
    2014-06-12

    I adapted pglcmd to the correct range values. So from my point of view we just need some testing now in order to give correct advices for people with

    @Marko: When you find time to test the current version, then please reset the nice setting and go with the default sysctl buffers. Just use the new option for now. I suggest to start with the values that already seem to work with the sysctl commands.
    So just set in pglcmd.conf NFQUEUE_MAXLEN="8388608". Then restart your machine so that everything else goes back to its defaults.
    If you experience errors then you may increase the value up to 4,294,967,295 (although I think that this is too high on your machine with 512 MB RAM).

    Generally I assume all these values are in bytes and are limited first by the unsigned int (4,294,967,295) and second by the available RAM (in this case something below 512,000,000).

    I'd still like to know the current default for maxlen and where to find it in /proc

     
  • Cader
    Cader
    2014-06-12

    I would say that the 8388608 is way to high.
    This number is in packets not bytes.
    "Sets the size of the queue in kernel. This fixes the maximum number of packets the kernel will store before internally before dropping upcoming packets"
    Since each packet could be up to 1500 bytes you would be looking at 12GB of mem.

    I would start with 2048 or 4096.
    The default might be 1000

    I am looking more to find a /proc or /sys file with the value

     
  • @jre-phoenix are the changes on master, or another branch?

     
  • Cader
    Cader
    2014-06-13

    The changes should all be in master.

     
  • Ok, I ran autogen.sh and configure successfully, I get the following error on make:
    /home/pi/smece/peerguardian-code/pgl/pgld/src/pgld.c:418: undefined reference to nfq_set_verdict2' /home/pi/smece/peerguardian-code/pgl/pgld/src/pgld.c:396: undefined reference tonfq_set_verdict2'
    /home/pi/smece/peerguardian-code/pgl/pgld/src/pgld.c:454: undefined reference to nfq_set_verdict2' /home/pi/smece/peerguardian-code/pgl/pgld/src/pgld.c:443: undefined reference tonfq_set_verdict2'
    /home/pi/smece/peerguardian-code/pgl/pgld/src/pgld.c:465: undefined reference to `nfq_set_verdict2'

    I have libnetfilter-queue-dev package installed, I think nfq_set_verdict2 function should be declared there...

     
  • jre-phoenix
    jre-phoenix
    2014-06-14

    1.) @Marko Bozikovic
    I just reverted the change that introduced nfq_set_verdict2 and pushed it to a new branch "pgl_nfq_set_verdict". I hope I did it right since I had changed something else in the same code area in the meantime. At least "pglcmd test" worked here, though.

    2.) @Marko Bozikovic
    nfq_set_verdict_mark() was deprecated in favour of nfq_set_verdict2() on 2010-05-09. So I guess at least since libnetfilter-queue 1.0.0 this function was available. Marko, which version of this library do you have installed?

    I never had a problem compiling with the new version here.

    3.)
    I found a page about the ENOBUFS:
    http://www.netfilter.org/projects/libnetfilter_queue/doxygen/index.html

    There are some tips to avoid them (e.g. set NICE to -20), but also:

    • increase the default socket buffer size by means of nfnl_rcvbufsiz()
      --> I guess that's something similar to what we try with maxlen, but it's not the same

    • set NETLINK_NO_ENOBUFS socket option to avoid receiving ENOBUFS errors (requires Linux kernel >= 2.6.30).
      --> Sounds interesting, but I got no real clue what that means exactly and how to implement

     
  • jre-phoenix
    jre-phoenix
    2014-06-14

    Raspbian has 2 versions of libnetfilter-queue in it's archives (see
    http://archive.raspbian.org/raspbian/pool/main/libn/libnetfilter-queue/):
    - 0.0.17-1 which lacks nfq_set_verdict2
    - 1.0.2-2 which has this function

    So Marko either has to update his system or use the branch I created.

    We definitely should keep the verdict2 function in the master branch, since the other is officially deprecated. I don't know whether it is possible to code the old verdict function as backup code in pgl. Otherwise I propose to keep master as it is and I'll add the nfq_set_verdict branch to a backport branch, which I always rebase on current master for releases.

     
  • Ah, raspbian is based on wheezy, which has only 0.0.17-1 available. I'll have to add unstable to my list of sources...

    I'll let you know guys when I have something...

     
  • jre-phoenix
    jre-phoenix
    2014-06-15

    Yeah, I just noticed I would have to make a revert of this commit for the backports for the Debian Wheezy and Ubuntu Lucid packages anyway.

    So I added this to the git branch pgl_backport and published it (replacing the pgl_debian-backport and the pgl_nfq_set_verdict branches).

    Since the branch pgl_backport contains also other changes for Wheezy I recommend you use that one, instead of updating single packages on your system.

     
  • Ok, I built pgl_backport branch and it's installed and running. NFQUEUE_MAXLEN is set to 4096 packets, NICE_LEVEL not set.

    I tried a manual pglcmd update under network load and it went fine. There was a line in syslog:

    raspberry kernel: [ 1048.907106] TCP: TCP: Possible SYN flooding on port 51000. Sending cookies. Check SNMP counters.

    I'll let it run for a few days and let you know how it goes. I can also upgrade to jessie and test master.

     
    Last edit: Marko Bozikovic 2014-06-15
  • jre-phoenix
    jre-phoenix
    2014-06-16

    Good.

    Was that warning level directly during the update? Can you reproduce it?

     
  • Cader
    Cader
    2014-06-16

    That is a "normal" message when there are too many connections for your kernel to handle:
    http://blog.dubbelboer.com/2012/04/09/syn-cookies.html
    and to protect you from DDOS.

    can be fixed by increasing net.ipv4.tcp_max_syn_backlog and net.core.somaxconn.
    If you were testing pgl by creating a ton of new connections to your machine this is probably unnecessary since the test caused an abnormal amount of connections.

     
  • Ok, today pgld logged an error message. Here's the relevant part of the log:

    Jun 19 06:25:42 INFO: Reopened logfile: /var/log/pgl/pgld.log
    Jun 19 06:26:00 INFO: ASCII: 192002 entries loaded from "/var/lib/pgl/master_blocklist.p2p"
    Jun 19 06:26:01 INFO: Blocking 192002 IP ranges (2963402476 IPs).
    Jun 19 06:26:01 INFO: Blocklist(s) reloaded.
    Jun 19 06:26:01 ERROR: ENOBUFS error on queue '92'. Use pgld -Q option or set in pglcmd.conf NFQUEUE_MAXLEN to increase
    buffers, recv returned No buffer space available
    Jun 19 06:30:18 INFO: Started.
    Jun 19 06:30:27 INFO: ASCII: 192002 entries loaded from "/var/lib/pgl/master_blocklist.p2p"
    Jun 19 06:30:27 INFO: Blocking 192002 IP ranges (2963402476 IPs).
    Jun 19 06:30:27 INFO: Binding to queue 92
    Jun 19 06:30:27 INFO: ACCEPT mark: 20
    Jun 19 06:30:27 INFO: REJECT mark: 10
    Jun 19 06:30:27 INFO: Set netfilter queue length to 4096 packets
    

    It seems that pgld crashed/stopped after logging the error, since pglcmd log says this:

    Blocklists updated.
    Problematic daemon status: 1
    pgld is not running ... failed!
    2014-06-19 06:29:15 BST Begin: pglcmd restart_not_wd
    Deleting iptables ...
    ..Executing iptables remove script /var/lib/pgl/.pglcmd.iptables.remove.sh.
    ..Removing iptables remove script /var/lib/pgl/.pglcmd.iptables.remove.sh.
    Iptables deleted.
    Stopping pgld/sbin/start-stop-daemon: warning: failed to kill 28677: No such process
    .
    Building blocklist ...
    WARN: No valid ASCII blocklist format line:
    INFO: ASCII: 683536 entries loaded from "STDIN"
    INFO: Merged 491534 of 683536 entries.
    INFO: Blocking 192002 IP ranges (2963402476 IPs).
    Blocklist built.
    

    I've set NFQUEUE_MAXLEN to 8192 now, still without setting NICE_LEVEL.

     
    Last edit: Marko Bozikovic 2014-06-19
  • jre-phoenix
    jre-phoenix
    2014-06-19

    Thanks.

    The results so far are (please correct me if I'm wrong):

    • no change --> crash every day
    • NICE=-10 --> 2 days, then crash
    • Increased send/receive window=8388608 (see note 1 below) --> 5 days without crash, then stopped the test
    • MAXLEN=4096 --> 3 days, then crash

    So now we can either continue to test with increased NFQUEUE_MAXLEN=8192 or we try a mix of MAXLEN, NICE and an increased receive/send window (

    sysctl -w net.core.rmem_default=8388608
    sysctl -w net.core.wmem_default=8388608
    

    ). For now I'd say one more test with only MAXLEN, then go for a mix.

    I just reenabled the queue close and return after ENOBUFS in pgld. We had disabled that because I thought pgld could just ignore the ENOBUFS. But now we know that this definitely requires a restart of pgld (at least the pglcmd watchdog perfectly takes care of this). You may recompile from pgl_backport branch.

    [Note 1] According to https://sourceforge.net/p/peerguardian/discussion/446997/thread/0df72ba6/#1130 Note Cader's theoretic analysis "These are for the interface buffers in bytes. If the issue is the netfilter buffer these sysctl commands won't help". So this was either coincidence or we don't understand the inner workings completely (but who actually does, hehe??).

     
    Last edit: jre-phoenix 2014-06-19
  • jre-phoenix
    jre-phoenix
    2014-06-19

    OK, revisiting the netfilter page on ENOBUFS I'd say forget about the sysctl commands (although they seem to help, they are not in the upstream recommendation list. But heck, MAXLEN isn't there either):

    ENOBUFS errors in recv()

    recv() may return -1 and errno is set to ENOBUFS in case that your application is not fast enough to retrieve the packets from the kernel. In that case, you can increase the socket buffer size by means of nfnl_rcvbufsiz(). Although this delays ENOBUFS errors, you may hit it again sooner or later. The next section provides some hints on how to obtain the best performance for your application.

    Performance

    To improve your libnetfilter_queue application in terms of performance, you may consider the following tweaks:

    1. increase the default socket buffer size by means of nfnl_rcvbufsiz().
    2. set nice value of your process to -20 (maximum priority).
    3. set the CPU affinity of your process to a spare core that is not used to handle NIC interruptions.
    4. set NETLINK_NO_ENOBUFS socket option to avoid receiving ENOBUFS errors (requires Linux kernel >= 2.6.30).
    5. see --queue-balance option in NFQUEUE target for multi-threaded apps (it requires Linux kernel >= 2.6.31).

    For us this means:

    1. could be solved similar to MAXLEN in pgld, with configuration option in pglcmd. IIUC nfq_set_queue_maxlen and nfnl_rcvbufsiz do quite similar things.
    2. can already be done in pglcmd. Has a negative impact on the rest of the system.
    3. don't know whether this requires coding in pgl.
    4. last resort or best solution? I think in Marco's case simply ignoring ENOBUFS and having the kernel drop some packets would be ok. Again needs change in pgld and option in pglcmd.
    5. aready a long time on the TODO. But up to now I thought --queue-balance requires multiple pgld processes running. Now I see the solution with a multi-threaded pgld, but don't know how difficult this is. Requires changes in pgld and some adaptions in pglcmd.
     
  • Cader
    Cader
    2014-06-19

    I was able to hit ENOBUFS starting a ton of torrents at once with default queue length. I also noticed that pgld either died or unbound from the queue and didn't process anymore packets.
    I will look at that NETLINK_NO_ENOBUFS option and I think I saw something about rebinding to the queue.
    I'll look at a couple options.

    For #2 - does a pi have multiple cpus/cores to even do that?

     
  • Hi all,

    @jre: your list of my test results is correct.

    Pi has a single core.

    I hit ENOBUFS today again with MAXLEN 8192. I'll try to build pgl_backport changes today. I'll try setting nice level to -10 and keeping MAXLEN at 8192.

    One, possibly silly suggestion: would it be possible for pgld to reload the block list in a separate thread (or using async I/O) and then swap it with the currently active one? Even on single core machines, a thread that loads stuff from storage mostly waits for I/O to finish, so the main thread can have most of the CPU time...

     
  • jre-phoenix
    jre-phoenix
    2014-06-22

    @Cader: Setting NICE="20" might help to trigger a ENOBUFS.

    @Marko:That might be a very good idea for your problem.

    Generally to prevent ENOBUFS we should make pgld multithreaded (which I think it is not currently, again I'm no expert here) in order to be able to use multiple QUEUES with --queue-balance, but also to reload in the background.

    But I also think we definitely need a better solution for the case that ENOBUFS happens, see Cader's last mail.

     
  • Hi all,

    Finally found some time to build the latest commit to pgl_backport... I'm running it now with MAXQUEUE set to 8192 and no NICE level set. Again, I'll let it run for a few days (or until it crashes)

     
  • jre-phoenix
    jre-phoenix
    2014-06-25

    @Marko: The last relevant change for your problem was on 2014-06-12. (Since then we only added error messages, did code cleanups and other stuff.) So you can continue your testing with MAXLEN=8192 and NICE=-10.

    Current results:
    no change: crash every day
    NICE=-10: crash after 2 days
    Increased send/receive window=8388608: no crash for 5 days
    MAXLEN=4096: crash after 3 days
    MAXLEN=8192: crash after 3 days
    MAXLEN=8192, NICE_LEVEL=-10: no crash for 3 days since 2014-06-26

    Am I right with the last entry? I'm not sure what NICE_LEVEL you used the last three days.

     
    Last edit: jre-phoenix 2014-06-27
  • Hi,

    I've just set the NICE option to -10 since I forgot to uncomment it the other day :)

     
  • Hi all,

    Just to let you know that everything's been working fine since 26/06...

     
  • Hi all,

    Pgld failed with ENOBUFS this morning.