I use Calimero since a long time and updated recently to version 2.2.1. Since then I exprerienced the problem that the library stopps receiving telegrams after some time. Some facts.
-I see the telegrams coming in with Wireshark properly, the link works but Calimero is just silent, it doesn't fire any events.
-There is no obvious situation when this happens, I can't find any reason so far. The time seems random, sometimes days, sometimes minutes.
-After some time Calimero starts receiving telegrams again, as well this time is random but usually about 10 - 30 minutes.
-There is definitely nothing blocked or any Exception thrown. All is properly catched.
-There is no Thread blocked, I did take ThreadDumps
-During the blocked period, Telegrams could be still sent to the bus without any problem.
-OS Windows 7 / Java 1.8 / Calimero 2.2.1
I wasn't able to activate debug or trace log level for Calimero in the log4j configuration. Is there somewhere any instruction about this?
Any advice how to debug this strange situation is highly appreciated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What Calimero version did you use before?
You connect via KNXnet/IP tunneling, right?
Calimero itself does not look for log4j configurations, nor depend on log4j. So I cannot really know how the logging setup of your application works. Do you see warn/info etc. messages from Calimero using your configuration, or nothing at all?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, KNXnet/IP tunneling, sorry forgot to mention.
I don't see any log output from calimero, nowhere. Actually I didn't found how to configure loglevel for calimero yet.
I used for several years version 1.4.0 without problems. Since migrating to 2.2.1 I experienced this behaviour.
I tested now as well with 2.3-beta, same behaviour.
Are there any limitations in speed of sending telegrams, I saw in the logs that in the cases where it locks up there are telegrams sent in a fast order (eg. 3 telegrams within just a few ms)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can see a small code tutorial on how logging works here.
There are several ways Calimero provides its log output (file, network, etc.), but I think that you did not pipe the output into log4j at all. Could that be? For log4j 2, you might simply do:
Add your log4j writer in Calimero, telling it to capture everything:
LogManager.getManager().addWriter("", new Log4jWriter());
Now you can configure the complete output with log4j properties etc.
From v1.4, the changes are considerable, so I would start with checking the log first.
Re. speed limitations: all tunneled knx telegrams are acknowledged, so in some sense the receiver processing speed tells the sender how fast to send. I get around 400 us for a single write (using process communication), so your values seem quite ok.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It recovers as soon as the rcv-seq hits again the right value, so exactly after 255 telegrams recevied after the problem appeared (It overflows at 255 and starts from 0 again).
So why does it get async and shouldn't there be a recovery mechanism which resolved immediately?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If I check the source of calimero, it does handle just the other way around but not if calimeros counter is behind. I just ask myself if disconnecting really helps in this situation, it obviously missed a telegram (or didn't increment the counter) and this won't come again. So in my meaning it could just follow the counter. The open question for me is if Calimero has a possible place confirming a telegram without performing the sequence counter increment properly.
BTW: I have a few Chromecasts in the network running and I have the impression that if the same are running, the problem arises more often than if not. Probably the UDP traffic is influenced what could cause more packet losts and therefore provoke this situation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
[calimero] does handle just the other way around but not if calimeros counter is behind
yes, a repeated frame can be ignored, a missed frame not.
I will rephrase my previous answer: the server SHALL send a disconnect in that situation; it is not optional. All subsequent requests (up to the expected one) are not further considered.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I patched Calimero with a simple provisory workaround to treat exactly this situation. The trap since then logs/triggers about 5 to 10 times a day but just works without any service interruption.
protected boolean handleServiceType(final KNXnetIPHeader h, final byte[] data, final int offset,
final InetAddress src, final int port) throws KNXFormatException, IOException
{
if (super.handleServiceType(h, data, offset, src, port))
return true;
final int svc = h.getServiceType();
if (svc != serviceRequest)
return false;
final ServiceRequest req = getServiceRequest(h, data, offset);
if (!checkChannelId(req.getChannelID(), "request"))
return true;
final int seq = req.getSequenceNumber();
// ************** Start Workaround to treat missed telegram problem
if (seq==((getSeqRcv()+1) & 0xFF)){
// ups, we are one behind, don't know why, we missed it somewhere to increment or lost it.
// Fix this by just following the real counter again
logger.error("tunneling request with larger rcv-seq " + seq + ", expected "
+ getSeqRcv() + " -> fix this nasty situation");
incSeqRcv();
}
// ************** End Workaround to treat missed telegram problem
if (seq == getSeqRcv() || ((seq + 1) & 0xFF) == getSeqRcv()) {
Last edit: Kusis 2016-12-04
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It would be good to know why that whole situation (which requires the workaround) occurs (besides that there is some bug that needs to be fixed). Is it specific to some hardware/setup etc. ?
You are aware that such workaround basically subverts reliable in-order delivery of cEMI frames of the tunneling protocol (and is outside the specification).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, I am aware of that. But a missed frame is a missed frame. If it doesn't come it is lost and even a reconnection will not make it appear again.
I don't think this is a special situation but just a use case with a lot of UDP traffice on the network and there might be from time to time a conflict and the workaround at least fixes this single situation and makes the best of the situation for the moment. Reconnect would make it worth because the telegram which came in with the "invalid" sequence number would be lost as well.
In theory this situation shouldn't be possible, so my assumption is that Calimero doesn't increase the counter somewhere where it should (or my Siemens Instabus IP Interface is faulty).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If anyone is wondering why this was only integrated into Calimero as an opt-in feature, here are the clauses from the KNX specifications that Boris was referring to (Chapter 3/8/4 "Tunnelling" v01.05.03 AS, Section 2.6):
"If a KNXnet/IP Tunnelling Client receives a frame with a sequence number that is not equal to the expected sequence number and not equal to one less than the expected sequence number, the KNXnet/IP Tunnelling Client shall not reply and shall discard the received frame."
"If the KNXnet/IP device does not receive a TUNNELLING_ACK frame within the TUNNELLING_REQUEST_TIMEOUT (= 1 second) [...], the sending device shall repeat the TUNNELLING_REQUEST frame once and then terminate the connection by sending a DISCONNECT_REQUEST frame to the other device’s control endpoint."
For all we know, Calimero adheres to these rules (unless this feature toggle is activated!). Only an actual network trace of the situation that the OP observed could give proof, however.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I use Calimero since a long time and updated recently to version 2.2.1. Since then I exprerienced the problem that the library stopps receiving telegrams after some time. Some facts.
-I see the telegrams coming in with Wireshark properly, the link works but Calimero is just silent, it doesn't fire any events.
-There is no obvious situation when this happens, I can't find any reason so far. The time seems random, sometimes days, sometimes minutes.
-After some time Calimero starts receiving telegrams again, as well this time is random but usually about 10 - 30 minutes.
-There is definitely nothing blocked or any Exception thrown. All is properly catched.
-There is no Thread blocked, I did take ThreadDumps
-During the blocked period, Telegrams could be still sent to the bus without any problem.
-OS Windows 7 / Java 1.8 / Calimero 2.2.1
I wasn't able to activate debug or trace log level for Calimero in the log4j configuration. Is there somewhere any instruction about this?
Any advice how to debug this strange situation is highly appreciated.
What Calimero version did you use before?
You connect via KNXnet/IP tunneling, right?
Calimero itself does not look for log4j configurations, nor depend on log4j. So I cannot really know how the logging setup of your application works. Do you see warn/info etc. messages from Calimero using your configuration, or nothing at all?
Yes, KNXnet/IP tunneling, sorry forgot to mention.
I don't see any log output from calimero, nowhere. Actually I didn't found how to configure loglevel for calimero yet.
I used for several years version 1.4.0 without problems. Since migrating to 2.2.1 I experienced this behaviour.
I tested now as well with 2.3-beta, same behaviour.
Are there any limitations in speed of sending telegrams, I saw in the logs that in the cases where it locks up there are telegrams sent in a fast order (eg. 3 telegrams within just a few ms)?
You can see a small code tutorial on how logging works here.
There are several ways Calimero provides its log output (file, network, etc.), but I think that you did not pipe the output into log4j at all. Could that be? For log4j 2, you might simply do:
Add your log4j writer in Calimero, telling it to capture everything:
Now you can configure the complete output with log4j properties etc.
From v1.4, the changes are considerable, so I would start with checking the log first.
Re. speed limitations: all tunneled knx telegrams are acknowledged, so in some sense the receiver processing speed tells the sender how fast to send. I get around 400 us for a single write (using process communication), so your values seem quite ok.
Thanks, with this info I was now able to get it logging properly.
2016-12-02 11:11:41:007 calimero.link.192.168.200.14:3671: indication from 1.2.29 /
2016-12-02 11:11:44:448 calimero.link.192.168.200.14:3671: indication from 1.2.29 /
2016-12-02 11:11:53:316 calimero.link.192.168.200.14:3671: indication from 1.2.29 /
2016-12-02 11:11:53:521 calimero.link.192.168.200.14:3671: indication from 1.2.5 /
2016-12-02 11:12:06:363 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 117, expected 116 /
2016-12-02 11:12:06:363 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 117 (already received) /
2016-12-02 11:12:07:363 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 117, expected 116 /
2016-12-02 11:12:07:363 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 117 (already received) /
2016-12-02 11:12:12:498 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 118, expected 116 /
2016-12-02 11:12:12:498 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 118 (already received) /
2016-12-02 11:12:13:055 KNXnet/IP Tunneling 192.168.200.14:3671: sending connection state request, attempt 1 /
2016-12-02 11:12:13:498 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 118, expected 116 /
2016-12-02 11:12:13:498 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 118 (already received) /
2016-12-02 11:12:15:499 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 119, expected 116 /
2016-12-02 11:12:15:499 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 119 (already received) /
[ SKIPPED up to recovery section]
2016-12-02 11:51:31:361 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 113 (already received) /
2016-12-02 11:51:32:360 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 113, expected 116 /
2016-12-02 11:51:32:361 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 113 (already received) /
2016-12-02 11:51:33:361 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 114, expected 116 /
2016-12-02 11:51:33:361 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 114 (already received) /
2016-12-02 11:51:34:360 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 114, expected 116 /
2016-12-02 11:51:34:361 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 114 (already received) /
2016-12-02 11:51:52:597 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 115 (already received) /
2016-12-02 11:51:52:631 calimero.link.192.168.200.14:3671: indication from 1.2.69 /
It recovers as soon as the rcv-seq hits again the right value, so exactly after 255 telegrams recevied after the problem appeared (It overflows at 255 and starts from 0 again).
So why does it get async and shouldn't there be a recovery mechanism which resolved immediately?
Looking at that log, the server should have sent a disconnect in any case, not continue to send tunneling requests.
If I check the source of calimero, it does handle just the other way around but not if calimeros counter is behind. I just ask myself if disconnecting really helps in this situation, it obviously missed a telegram (or didn't increment the counter) and this won't come again. So in my meaning it could just follow the counter. The open question for me is if Calimero has a possible place confirming a telegram without performing the sequence counter increment properly.
BTW: I have a few Chromecasts in the network running and I have the impression that if the same are running, the problem arises more often than if not. Probably the UDP traffic is influenced what could cause more packet losts and therefore provoke this situation.
yes, a repeated frame can be ignored, a missed frame not.
I will rephrase my previous answer: the server SHALL send a disconnect in that situation; it is not optional. All subsequent requests (up to the expected one) are not further considered.
I patched Calimero with a simple provisory workaround to treat exactly this situation. The trap since then logs/triggers about 5 to 10 times a day but just works without any service interruption.
Last edit: Kusis 2016-12-04
It would be good to know why that whole situation (which requires the workaround) occurs (besides that there is some bug that needs to be fixed). Is it specific to some hardware/setup etc. ?
You are aware that such workaround basically subverts reliable in-order delivery of cEMI frames of the tunneling protocol (and is outside the specification).
Yes, I am aware of that. But a missed frame is a missed frame. If it doesn't come it is lost and even a reconnection will not make it appear again.
I don't think this is a special situation but just a use case with a lot of UDP traffice on the network and there might be from time to time a conflict and the workaround at least fixes this single situation and makes the best of the situation for the moment. Reconnect would make it worth because the telegram which came in with the "invalid" sequence number would be lost as well.
In theory this situation shouldn't be possible, so my assumption is that Calimero doesn't increase the counter somewhere where it should (or my Siemens Instabus IP Interface is faulty).
If anyone hitting this issue: seems to be fixed in v2.3 (see Github) using the workaround.
https://github.com/calimero-project/calimero-core/issues/39
If anyone is wondering why this was only integrated into Calimero as an opt-in feature, here are the clauses from the KNX specifications that Boris was referring to (Chapter 3/8/4 "Tunnelling" v01.05.03 AS, Section 2.6):
"If a KNXnet/IP Tunnelling Client receives a frame with a sequence number that is not equal to the expected sequence number and not equal to one less than the expected sequence number, the KNXnet/IP Tunnelling Client shall not reply and shall discard the received frame."
"If the KNXnet/IP device does not receive a TUNNELLING_ACK frame within the TUNNELLING_REQUEST_TIMEOUT (= 1 second) [...], the sending device shall repeat the TUNNELLING_REQUEST frame once and then terminate the connection by sending a DISCONNECT_REQUEST frame to the other device’s control endpoint."
For all we know, Calimero adheres to these rules (unless this feature toggle is activated!). Only an actual network trace of the situation that the OP observed could give proof, however.