Menu

Calimero stopps receiving telegrams for some time

Community
Kusis
2016-11-27
2016-12-04
  • Kusis

    Kusis - 2016-11-27

    I use Calimero since a long time and updated recently to version 2.2.1. Since then I exprerienced the problem that the library stopps receiving telegrams after some time. Some facts.
    -I see the telegrams coming in with Wireshark properly, the link works but Calimero is just silent, it doesn't fire any events.
    -There is no obvious situation when this happens, I can't find any reason so far. The time seems random, sometimes days, sometimes minutes.
    -After some time Calimero starts receiving telegrams again, as well this time is random but usually about 10 - 30 minutes.
    -There is definitely nothing blocked or any Exception thrown. All is properly catched.
    -There is no Thread blocked, I did take ThreadDumps
    -During the blocked period, Telegrams could be still sent to the bus without any problem.
    -OS Windows 7 / Java 1.8 / Calimero 2.2.1

    I wasn't able to activate debug or trace log level for Calimero in the log4j configuration. Is there somewhere any instruction about this?

    Any advice how to debug this strange situation is highly appreciated.

     
    • Boris Malinowsky

      What Calimero version did you use before?
      You connect via KNXnet/IP tunneling, right?

      Calimero itself does not look for log4j configurations, nor depend on log4j. So I cannot really know how the logging setup of your application works. Do you see warn/info etc. messages from Calimero using your configuration, or nothing at all?

       
      • Kusis

        Kusis - 2016-11-28

        Yes, KNXnet/IP tunneling, sorry forgot to mention.

        I don't see any log output from calimero, nowhere. Actually I didn't found how to configure loglevel for calimero yet.

        I used for several years version 1.4.0 without problems. Since migrating to 2.2.1 I experienced this behaviour.
        I tested now as well with 2.3-beta, same behaviour.

        Are there any limitations in speed of sending telegrams, I saw in the logs that in the cases where it locks up there are telegrams sent in a fast order (eg. 3 telegrams within just a few ms)?

         
        • Boris Malinowsky

          You can see a small code tutorial on how logging works here.

          There are several ways Calimero provides its log output (file, network, etc.), but I think that you did not pipe the output into log4j at all. Could that be? For log4j 2, you might simply do:

          import org.apache.logging.log4j.Level;
          import org.apache.logging.log4j.LogManager;
          
          import tuwien.auto.calimero.log.LogLevel;
          import tuwien.auto.calimero.log.LogWriter;
          
          // Writer that forwards everything to log4j 2
          class Log4jWriter extends LogWriter {
              @Override public void write(final String logService, final LogLevel level, final String msg) {
                  write(logService, level, msg, null);
              }
          
              @Override public void write(final String logService, final LogLevel level, final String msg, final Throwable t) {
                  LogManager.getLogger(logService).log(Level.toLevel(level.toString(), Level.INFO), msg, t);
              }
          
              @Override public void flush() {}
              @Override public void close() {}
          }
          

          Add your log4j writer in Calimero, telling it to capture everything:

          LogManager.getManager().addWriter("", new Log4jWriter());
          

          Now you can configure the complete output with log4j properties etc.

          From v1.4, the changes are considerable, so I would start with checking the log first.

          Re. speed limitations: all tunneled knx telegrams are acknowledged, so in some sense the receiver processing speed tells the sender how fast to send. I get around 400 us for a single write (using process communication), so your values seem quite ok.

           
  • Kusis

    Kusis - 2016-12-02

    Thanks, with this info I was now able to get it logging properly.

    2016-12-02 11:11:41:007 calimero.link.192.168.200.14:3671: indication from 1.2.29 /
    2016-12-02 11:11:44:448 calimero.link.192.168.200.14:3671: indication from 1.2.29 /
    2016-12-02 11:11:53:316 calimero.link.192.168.200.14:3671: indication from 1.2.29 /
    2016-12-02 11:11:53:521 calimero.link.192.168.200.14:3671: indication from 1.2.5 /
    2016-12-02 11:12:06:363 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 117, expected 116 /
    2016-12-02 11:12:06:363 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 117 (already received) /
    2016-12-02 11:12:07:363 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 117, expected 116 /
    2016-12-02 11:12:07:363 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 117 (already received) /
    2016-12-02 11:12:12:498 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 118, expected 116 /
    2016-12-02 11:12:12:498 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 118 (already received) /
    2016-12-02 11:12:13:055 KNXnet/IP Tunneling 192.168.200.14:3671: sending connection state request, attempt 1 /
    2016-12-02 11:12:13:498 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 118, expected 116 /
    2016-12-02 11:12:13:498 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 118 (already received) /
    2016-12-02 11:12:15:499 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 119, expected 116 /
    2016-12-02 11:12:15:499 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 119 (already received) /

    [ SKIPPED up to recovery section]

    2016-12-02 11:51:31:361 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 113 (already received) /
    2016-12-02 11:51:32:360 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 113, expected 116 /
    2016-12-02 11:51:32:361 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 113 (already received) /
    2016-12-02 11:51:33:361 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 114, expected 116 /
    2016-12-02 11:51:33:361 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 114 (already received) /
    2016-12-02 11:51:34:360 KNXnet/IP Tunneling 192.168.200.14:3671: tunneling request with invalid rcv-seq 114, expected 116 /
    2016-12-02 11:51:34:361 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 114 (already received) /
    2016-12-02 11:51:52:597 KNXnet/IP Tunneling 192.168.200.14:3671: skip tunneling request with rcv-seq 115 (already received) /
    2016-12-02 11:51:52:631 calimero.link.192.168.200.14:3671: indication from 1.2.69 /

    It recovers as soon as the rcv-seq hits again the right value, so exactly after 255 telegrams recevied after the problem appeared (It overflows at 255 and starts from 0 again).

    So why does it get async and shouldn't there be a recovery mechanism which resolved immediately?

     
    • Boris Malinowsky

      Looking at that log, the server should have sent a disconnect in any case, not continue to send tunneling requests.

       
  • Kusis

    Kusis - 2016-12-04

    If I check the source of calimero, it does handle just the other way around but not if calimeros counter is behind. I just ask myself if disconnecting really helps in this situation, it obviously missed a telegram (or didn't increment the counter) and this won't come again. So in my meaning it could just follow the counter. The open question for me is if Calimero has a possible place confirming a telegram without performing the sequence counter increment properly.

    BTW: I have a few Chromecasts in the network running and I have the impression that if the same are running, the problem arises more often than if not. Probably the UDP traffic is influenced what could cause more packet losts and therefore provoke this situation.

     
    • Boris Malinowsky

      [calimero] does handle just the other way around but not if calimeros counter is behind

      yes, a repeated frame can be ignored, a missed frame not.

      I will rephrase my previous answer: the server SHALL send a disconnect in that situation; it is not optional. All subsequent requests (up to the expected one) are not further considered.

       
  • Kusis

    Kusis - 2016-12-04

    I patched Calimero with a simple provisory workaround to treat exactly this situation. The trap since then logs/triggers about 5 to 10 times a day but just works without any service interruption.

        protected boolean handleServiceType(final KNXnetIPHeader h, final byte[] data, final int offset,
            final InetAddress src, final int port) throws KNXFormatException, IOException
        {
            if (super.handleServiceType(h, data, offset, src, port))
                return true;
            final int svc = h.getServiceType();
            if (svc != serviceRequest)
                return false;
    
            final ServiceRequest req = getServiceRequest(h, data, offset);
            if (!checkChannelId(req.getChannelID(), "request"))
                return true;
    
            final int seq = req.getSequenceNumber();
            // ************** Start Workaround to treat missed telegram problem
            if (seq==((getSeqRcv()+1) & 0xFF)){
                // ups, we are one behind, don't know why, we missed it somewhere to increment or lost it.
                // Fix this by just following the real counter again
                logger.error("tunneling request with larger rcv-seq " + seq + ", expected "
                        + getSeqRcv() + " -> fix this nasty situation");
                incSeqRcv();
            }
            // ************** End Workaround to treat missed telegram problem       
            if (seq == getSeqRcv() || ((seq + 1) & 0xFF) == getSeqRcv()) {
    
     

    Last edit: Kusis 2016-12-04
    • Boris Malinowsky

      It would be good to know why that whole situation (which requires the workaround) occurs (besides that there is some bug that needs to be fixed). Is it specific to some hardware/setup etc. ?

      You are aware that such workaround basically subverts reliable in-order delivery of cEMI frames of the tunneling protocol (and is outside the specification).

       
      • Kusis

        Kusis - 2016-12-04

        Yes, I am aware of that. But a missed frame is a missed frame. If it doesn't come it is lost and even a reconnection will not make it appear again.
        I don't think this is a special situation but just a use case with a lot of UDP traffice on the network and there might be from time to time a conflict and the workaround at least fixes this single situation and makes the best of the situation for the moment. Reconnect would make it worth because the telegram which came in with the "invalid" sequence number would be lost as well.

        In theory this situation shouldn't be possible, so my assumption is that Calimero doesn't increase the counter somewhere where it should (or my Siemens Instabus IP Interface is faulty).

         
  • D.L.

    D.L. - 2017-03-01

    If anyone hitting this issue: seems to be fixed in v2.3 (see Github) using the workaround.
    https://github.com/calimero-project/calimero-core/issues/39

     
    • Georg Neugschwandtner

      If anyone is wondering why this was only integrated into Calimero as an opt-in feature, here are the clauses from the KNX specifications that Boris was referring to (Chapter 3/8/4 "Tunnelling" v01.05.03 AS, Section 2.6):

      "If a KNXnet/IP Tunnelling Client receives a frame with a sequence number that is not equal to the expected sequence number and not equal to one less than the expected sequence number, the KNXnet/IP Tunnelling Client shall not reply and shall discard the received frame."

      "If the KNXnet/IP device does not receive a TUNNELLING_ACK frame within the TUNNELLING_REQUEST_TIMEOUT (= 1 second) [...], the sending device shall repeat the TUNNELLING_REQUEST frame once and then terminate the connection by sending a DISCONNECT_REQUEST frame to the other device’s control endpoint."

      For all we know, Calimero adheres to these rules (unless this feature toggle is activated!). Only an actual network trace of the situation that the OP observed could give proof, however.

       

Log in to post a comment.