Raspbian has 2 versions of libnetfilter-queue in it's archives (see
- 0.0.17-1 which lacks nfq_set_verdict2
- 1.0.2-2 which has this function
So Marko either has to update his system or use the branch I created.
We definitely should keep the verdict2 function in the master branch, since the other is officially deprecated. I don't know whether it is possible to code the old verdict function as backup code in pgl. Otherwise I propose to keep master as it is and I'll add the nfq_set_verdict branch to a backport branch, which I always rebase on current master for releases.
Ah, raspbian is based on wheezy, which has only 0.0.17-1 available. I'll have to add unstable to my list of sources...
I'll let you know guys when I have something...
Yeah, I just noticed I would have to make a revert of this commit for the backports for the Debian Wheezy and Ubuntu Lucid packages anyway.
So I added this to the git branch pgl_backport and published it (replacing the pgl_debian-backport and the pgl_nfq_set_verdict branches).
Since the branch pgl_backport contains also other changes for Wheezy I recommend you use that one, instead of updating single packages on your system.
Ok, I built pgl_backport branch and it's installed and running. NFQUEUE_MAXLEN is set to 4096 packets, NICE_LEVEL not set.
I tried a manual pglcmd update under network load and it went fine. There was a line in syslog:
raspberry kernel: [ 1048.907106] TCP: TCP: Possible SYN flooding on port 51000. Sending cookies. Check SNMP counters.
I'll let it run for a few days and let you know how it goes. I can also upgrade to jessie and test master.
Was that warning level directly during the update? Can you reproduce it?
That is a "normal" message when there are too many connections for your kernel to handle:
and to protect you from DDOS.
can be fixed by increasing net.ipv4.tcp_max_syn_backlog and net.core.somaxconn.
If you were testing pgl by creating a ton of new connections to your machine this is probably unnecessary since the test caused an abnormal amount of connections.
Ok, today pgld logged an error message. Here's the relevant part of the log:
Jun 19 06:25:42 INFO: Reopened logfile: /var/log/pgl/pgld.log
Jun 19 06:26:00 INFO: ASCII: 192002 entries loaded from "/var/lib/pgl/master_blocklist.p2p"
Jun 19 06:26:01 INFO: Blocking 192002 IP ranges (2963402476 IPs).
Jun 19 06:26:01 INFO: Blocklist(s) reloaded.
Jun 19 06:26:01 ERROR: ENOBUFS error on queue '92'. Use pgld -Q option or set in pglcmd.conf NFQUEUE_MAXLEN to increase
buffers, recv returned No buffer space available
Jun 19 06:30:18 INFO: Started.
Jun 19 06:30:27 INFO: ASCII: 192002 entries loaded from "/var/lib/pgl/master_blocklist.p2p"
Jun 19 06:30:27 INFO: Blocking 192002 IP ranges (2963402476 IPs).
Jun 19 06:30:27 INFO: Binding to queue 92
Jun 19 06:30:27 INFO: ACCEPT mark: 20
Jun 19 06:30:27 INFO: REJECT mark: 10
Jun 19 06:30:27 INFO: Set netfilter queue length to 4096 packets
It seems that pgld crashed/stopped after logging the error, since pglcmd log says this:
Problematic daemon status: 1
pgld is not running ... failed!
2014-06-19 06:29:15 BST Begin: pglcmd restart_not_wd
Deleting iptables ...
..Executing iptables remove script /var/lib/pgl/.pglcmd.iptables.remove.sh.
..Removing iptables remove script /var/lib/pgl/.pglcmd.iptables.remove.sh.
Stopping pgld/sbin/start-stop-daemon: warning: failed to kill 28677: No such process
Building blocklist ...
WARN: No valid ASCII blocklist format line:
INFO: ASCII: 683536 entries loaded from "STDIN"
INFO: Merged 491534 of 683536 entries.
INFO: Blocking 192002 IP ranges (2963402476 IPs).
I've set NFQUEUE_MAXLEN to 8192 now, still without setting NICE_LEVEL.
The results so far are (please correct me if I'm wrong):
So now we can either continue to test with increased NFQUEUE_MAXLEN=8192 or we try a mix of MAXLEN, NICE and an increased receive/send window (
sysctl -w net.core.rmem_default=8388608
sysctl -w net.core.wmem_default=8388608
). For now I'd say one more test with only MAXLEN, then go for a mix.
I just reenabled the queue close and return after ENOBUFS in pgld. We had disabled that because I thought pgld could just ignore the ENOBUFS. But now we know that this definitely requires a restart of pgld (at least the pglcmd watchdog perfectly takes care of this). You may recompile from pgl_backport branch.
[Note 1] According to https://sourceforge.net/p/peerguardian/discussion/446997/thread/0df72ba6/#1130 Note Cader's theoretic analysis "These are for the interface buffers in bytes. If the issue is the netfilter buffer these sysctl commands won't help". So this was either coincidence or we don't understand the inner workings completely (but who actually does, hehe??).
OK, revisiting the netfilter page on ENOBUFS I'd say forget about the sysctl commands (although they seem to help, they are not in the upstream recommendation list. But heck, MAXLEN isn't there either):
recv() may return -1 and errno is set to ENOBUFS in case that your application is not fast enough to retrieve the packets from the kernel. In that case, you can increase the socket buffer size by means of nfnl_rcvbufsiz(). Although this delays ENOBUFS errors, you may hit it again sooner or later. The next section provides some hints on how to obtain the best performance for your application.
To improve your libnetfilter_queue application in terms of performance, you may consider the following tweaks:
I was able to hit ENOBUFS starting a ton of torrents at once with default queue length. I also noticed that pgld either died or unbound from the queue and didn't process anymore packets.
I will look at that NETLINK_NO_ENOBUFS option and I think I saw something about rebinding to the queue.
I'll look at a couple options.
For #2 - does a pi have multiple cpus/cores to even do that?
@jre: your list of my test results is correct.
Pi has a single core.
I hit ENOBUFS today again with MAXLEN 8192. I'll try to build pgl_backport changes today. I'll try setting nice level to -10 and keeping MAXLEN at 8192.
One, possibly silly suggestion: would it be possible for pgld to reload the block list in a separate thread (or using async I/O) and then swap it with the currently active one? Even on single core machines, a thread that loads stuff from storage mostly waits for I/O to finish, so the main thread can have most of the CPU time...
@Cader: Setting NICE="20" might help to trigger a ENOBUFS.
@Marko:That might be a very good idea for your problem.
Generally to prevent ENOBUFS we should make pgld multithreaded (which I think it is not currently, again I'm no expert here) in order to be able to use multiple QUEUES with --queue-balance, but also to reload in the background.
But I also think we definitely need a better solution for the case that ENOBUFS happens, see Cader's last mail.
Finally found some time to build the latest commit to pgl_backport... I'm running it now with MAXQUEUE set to 8192 and no NICE level set. Again, I'll let it run for a few days (or until it crashes)
@Marko: The last relevant change for your problem was on 2014-06-12. (Since then we only added error messages, did code cleanups and other stuff.) So you can continue your testing with MAXLEN=8192 and NICE=-10.
no change: crash every day
NICE=-10: crash after 2 days
Increased send/receive window=8388608: no crash for 5 days
MAXLEN=4096: crash after 3 days
MAXLEN=8192: crash after 3 days
MAXLEN=8192, NICE_LEVEL=-10: no crash for 3 days since 2014-06-26
Am I right with the last entry? I'm not sure what NICE_LEVEL you used the last three days.
I've just set the NICE option to -10 since I forgot to uncomment it the other day :)
Just to let you know that everything's been working fine since 26/06...
Pgld failed with ENOBUFS this morning.