Pgl 2.2.4 and Raspberry Pi

Help
2014-06-04
2014-07-16
<< < 1 2 (Page 2 of 2)
  • jre-phoenix
    jre-phoenix
    2014-06-14

    Raspbian has 2 versions of libnetfilter-queue in it's archives (see
    http://archive.raspbian.org/raspbian/pool/main/libn/libnetfilter-queue/):
    - 0.0.17-1 which lacks nfq_set_verdict2
    - 1.0.2-2 which has this function

    So Marko either has to update his system or use the branch I created.

    We definitely should keep the verdict2 function in the master branch, since the other is officially deprecated. I don't know whether it is possible to code the old verdict function as backup code in pgl. Otherwise I propose to keep master as it is and I'll add the nfq_set_verdict branch to a backport branch, which I always rebase on current master for releases.

     
  • Ah, raspbian is based on wheezy, which has only 0.0.17-1 available. I'll have to add unstable to my list of sources...

    I'll let you know guys when I have something...

     
  • jre-phoenix
    jre-phoenix
    2014-06-15

    Yeah, I just noticed I would have to make a revert of this commit for the backports for the Debian Wheezy and Ubuntu Lucid packages anyway.

    So I added this to the git branch pgl_backport and published it (replacing the pgl_debian-backport and the pgl_nfq_set_verdict branches).

    Since the branch pgl_backport contains also other changes for Wheezy I recommend you use that one, instead of updating single packages on your system.

     
  • Ok, I built pgl_backport branch and it's installed and running. NFQUEUE_MAXLEN is set to 4096 packets, NICE_LEVEL not set.

    I tried a manual pglcmd update under network load and it went fine. There was a line in syslog:

    raspberry kernel: [ 1048.907106] TCP: TCP: Possible SYN flooding on port 51000. Sending cookies. Check SNMP counters.

    I'll let it run for a few days and let you know how it goes. I can also upgrade to jessie and test master.

     
    Last edit: Marko Bozikovic 2014-06-15
  • jre-phoenix
    jre-phoenix
    2014-06-16

    Good.

    Was that warning level directly during the update? Can you reproduce it?

     
  • Cader
    Cader
    2014-06-16

    That is a "normal" message when there are too many connections for your kernel to handle:
    http://blog.dubbelboer.com/2012/04/09/syn-cookies.html
    and to protect you from DDOS.

    can be fixed by increasing net.ipv4.tcp_max_syn_backlog and net.core.somaxconn.
    If you were testing pgl by creating a ton of new connections to your machine this is probably unnecessary since the test caused an abnormal amount of connections.

     
  • Ok, today pgld logged an error message. Here's the relevant part of the log:

    Jun 19 06:25:42 INFO: Reopened logfile: /var/log/pgl/pgld.log
    Jun 19 06:26:00 INFO: ASCII: 192002 entries loaded from "/var/lib/pgl/master_blocklist.p2p"
    Jun 19 06:26:01 INFO: Blocking 192002 IP ranges (2963402476 IPs).
    Jun 19 06:26:01 INFO: Blocklist(s) reloaded.
    Jun 19 06:26:01 ERROR: ENOBUFS error on queue '92'. Use pgld -Q option or set in pglcmd.conf NFQUEUE_MAXLEN to increase
    buffers, recv returned No buffer space available
    Jun 19 06:30:18 INFO: Started.
    Jun 19 06:30:27 INFO: ASCII: 192002 entries loaded from "/var/lib/pgl/master_blocklist.p2p"
    Jun 19 06:30:27 INFO: Blocking 192002 IP ranges (2963402476 IPs).
    Jun 19 06:30:27 INFO: Binding to queue 92
    Jun 19 06:30:27 INFO: ACCEPT mark: 20
    Jun 19 06:30:27 INFO: REJECT mark: 10
    Jun 19 06:30:27 INFO: Set netfilter queue length to 4096 packets
    

    It seems that pgld crashed/stopped after logging the error, since pglcmd log says this:

    Blocklists updated.
    Problematic daemon status: 1
    pgld is not running ... failed!
    2014-06-19 06:29:15 BST Begin: pglcmd restart_not_wd
    Deleting iptables ...
    ..Executing iptables remove script /var/lib/pgl/.pglcmd.iptables.remove.sh.
    ..Removing iptables remove script /var/lib/pgl/.pglcmd.iptables.remove.sh.
    Iptables deleted.
    Stopping pgld/sbin/start-stop-daemon: warning: failed to kill 28677: No such process
    .
    Building blocklist ...
    WARN: No valid ASCII blocklist format line:
    INFO: ASCII: 683536 entries loaded from "STDIN"
    INFO: Merged 491534 of 683536 entries.
    INFO: Blocking 192002 IP ranges (2963402476 IPs).
    Blocklist built.
    

    I've set NFQUEUE_MAXLEN to 8192 now, still without setting NICE_LEVEL.

     
    Last edit: Marko Bozikovic 2014-06-19
  • jre-phoenix
    jre-phoenix
    2014-06-19

    Thanks.

    The results so far are (please correct me if I'm wrong):

    • no change --> crash every day
    • NICE=-10 --> 2 days, then crash
    • Increased send/receive window=8388608 (see note 1 below) --> 5 days without crash, then stopped the test
    • MAXLEN=4096 --> 3 days, then crash

    So now we can either continue to test with increased NFQUEUE_MAXLEN=8192 or we try a mix of MAXLEN, NICE and an increased receive/send window (

    sysctl -w net.core.rmem_default=8388608
    sysctl -w net.core.wmem_default=8388608
    

    ). For now I'd say one more test with only MAXLEN, then go for a mix.

    I just reenabled the queue close and return after ENOBUFS in pgld. We had disabled that because I thought pgld could just ignore the ENOBUFS. But now we know that this definitely requires a restart of pgld (at least the pglcmd watchdog perfectly takes care of this). You may recompile from pgl_backport branch.

    [Note 1] According to https://sourceforge.net/p/peerguardian/discussion/446997/thread/0df72ba6/#1130 Note Cader's theoretic analysis "These are for the interface buffers in bytes. If the issue is the netfilter buffer these sysctl commands won't help". So this was either coincidence or we don't understand the inner workings completely (but who actually does, hehe??).

     
    Last edit: jre-phoenix 2014-06-19
  • jre-phoenix
    jre-phoenix
    2014-06-19

    OK, revisiting the netfilter page on ENOBUFS I'd say forget about the sysctl commands (although they seem to help, they are not in the upstream recommendation list. But heck, MAXLEN isn't there either):

    ENOBUFS errors in recv()

    recv() may return -1 and errno is set to ENOBUFS in case that your application is not fast enough to retrieve the packets from the kernel. In that case, you can increase the socket buffer size by means of nfnl_rcvbufsiz(). Although this delays ENOBUFS errors, you may hit it again sooner or later. The next section provides some hints on how to obtain the best performance for your application.

    Performance

    To improve your libnetfilter_queue application in terms of performance, you may consider the following tweaks:

    1. increase the default socket buffer size by means of nfnl_rcvbufsiz().
    2. set nice value of your process to -20 (maximum priority).
    3. set the CPU affinity of your process to a spare core that is not used to handle NIC interruptions.
    4. set NETLINK_NO_ENOBUFS socket option to avoid receiving ENOBUFS errors (requires Linux kernel >= 2.6.30).
    5. see --queue-balance option in NFQUEUE target for multi-threaded apps (it requires Linux kernel >= 2.6.31).

    For us this means:

    1. could be solved similar to MAXLEN in pgld, with configuration option in pglcmd. IIUC nfq_set_queue_maxlen and nfnl_rcvbufsiz do quite similar things.
    2. can already be done in pglcmd. Has a negative impact on the rest of the system.
    3. don't know whether this requires coding in pgl.
    4. last resort or best solution? I think in Marco's case simply ignoring ENOBUFS and having the kernel drop some packets would be ok. Again needs change in pgld and option in pglcmd.
    5. aready a long time on the TODO. But up to now I thought --queue-balance requires multiple pgld processes running. Now I see the solution with a multi-threaded pgld, but don't know how difficult this is. Requires changes in pgld and some adaptions in pglcmd.
     
  • Cader
    Cader
    2014-06-19

    I was able to hit ENOBUFS starting a ton of torrents at once with default queue length. I also noticed that pgld either died or unbound from the queue and didn't process anymore packets.
    I will look at that NETLINK_NO_ENOBUFS option and I think I saw something about rebinding to the queue.
    I'll look at a couple options.

    For #2 - does a pi have multiple cpus/cores to even do that?

     
  • Hi all,

    @jre: your list of my test results is correct.

    Pi has a single core.

    I hit ENOBUFS today again with MAXLEN 8192. I'll try to build pgl_backport changes today. I'll try setting nice level to -10 and keeping MAXLEN at 8192.

    One, possibly silly suggestion: would it be possible for pgld to reload the block list in a separate thread (or using async I/O) and then swap it with the currently active one? Even on single core machines, a thread that loads stuff from storage mostly waits for I/O to finish, so the main thread can have most of the CPU time...

     
  • jre-phoenix
    jre-phoenix
    2014-06-22

    @Cader: Setting NICE="20" might help to trigger a ENOBUFS.

    @Marko:That might be a very good idea for your problem.

    Generally to prevent ENOBUFS we should make pgld multithreaded (which I think it is not currently, again I'm no expert here) in order to be able to use multiple QUEUES with --queue-balance, but also to reload in the background.

    But I also think we definitely need a better solution for the case that ENOBUFS happens, see Cader's last mail.

     
  • Hi all,

    Finally found some time to build the latest commit to pgl_backport... I'm running it now with MAXQUEUE set to 8192 and no NICE level set. Again, I'll let it run for a few days (or until it crashes)

     
  • jre-phoenix
    jre-phoenix
    2014-06-25

    @Marko: The last relevant change for your problem was on 2014-06-12. (Since then we only added error messages, did code cleanups and other stuff.) So you can continue your testing with MAXLEN=8192 and NICE=-10.

    Current results:
    no change: crash every day
    NICE=-10: crash after 2 days
    Increased send/receive window=8388608: no crash for 5 days
    MAXLEN=4096: crash after 3 days
    MAXLEN=8192: crash after 3 days
    MAXLEN=8192, NICE_LEVEL=-10: no crash for 3 days since 2014-06-26

    Am I right with the last entry? I'm not sure what NICE_LEVEL you used the last three days.

     
    Last edit: jre-phoenix 2014-06-27
  • Hi,

    I've just set the NICE option to -10 since I forgot to uncomment it the other day :)

     
  • Hi all,

    Just to let you know that everything's been working fine since 26/06...

     
  • Hi all,

    Pgld failed with ENOBUFS this morning.

     
<< < 1 2 (Page 2 of 2)