LinuxCNC / Archived Bugs / #328 Problem using AXIS together with HALUI

Michael Haberler - 2013-08-01

Sascha,

if your analysis is correct (and I do not doubt it is) your work identifies and removes a fundamental design defect from intertask communication in LinuxCNC.

I did run into the problem before in a different context: I tried to implement a remote procedure call over NML using the existing (fishy) interaction pattern, and that failed for the reasons you describe; at the time I did not drill down like you did - congratulations for the stamina in identifying and fixing this!

To verify, is the scenario as follows?:

task A writes a command+local serial into the command buffer

the command consumer is late in processing the command

task B goes ahead and also writes a command to the buffer with its own serial
B's command gets executed and 'acknowledged'

the acknowlegdement for A's command necessarily fails due to serial mismatch, or lack of an ack altogether

If this is the case, then I would see more than option to address the issue:

as you have done, mutate the buffer into a queue *)

stay with the single command buffer, but make producers aware of a command being present but not yet finished so the write does not happen before completion

the globally uniqe serial surely helps, but afaict alone it is not sufficient to protect against the problem

I am positive you are onto something fundamental here. Before adopting the patch my suggestion would be:

we need to create a test case which identifies the current problem, even if only for regression tests. This might be a bit involved given all the moving parts. It might be easier to replicate the code sequence into small test programs and verify that the overwrite happens.

the underlying problem is an unprotected write to a shared resource, and there might be more than one way to fix it (see above); it might make sense to adopt a method which is less of a wholesale change (I dont suggest it exists; just saying - it might make sense to investigate if there is a single fix to the issue at a different level)

it might make sense to reconsider the usage of serials altogether. What they are for is correlating origin and command for replies; hence a tuple (local serial, originator id) would do the same trick; provided the originator ID is known to be unique. The reason why I am raising this is - a globally unique serial generator either needs shared memory, or - in the distributed case - a RPC-type service. Note full distribution of components is a goal for the future replacement of NML by zmq and protobuf which I'm working on (or rather would like to work once I have the RTOS work out the door).

Interested how you see this.

best regards, and congratulations

Michael

*) I would think the queue would protect against overwriting of commands alone; NB the possibly clash in non-unique serial numbers still exists: meaning the combination of single command buffer, write protect if command present, and serial/unique originator tuples might do the trick too. As a matter of style, I think your patch - making the command input a queue proper - is more up to the point though.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sascha Ittner - 2013-08-01

Hi Michael,

yes, the scenario you describe is correcet. But there is an other one:

task A writes a command+local serial into the command buffer

the command consumer is processing and acknowledging the command

task B goes ahead and also writes a command to the buffer with its own serial

B's command gets executed and 'acknowledged'

task B reads the ack (local_serial_number = echo_serial_number) and is happy

task A (which is delayed for some reason) will wait until timeout for it's local_serial_number = echo_serial_number

That is the reason I've choosen a global serial number. In this case A+B can wait for a echo_serial_number >= local_serial_number (with respect to datatype overflow). But this is also a compromise since you can only get the error state of the last command.

In summary we have two problems:
- overwriting of commands in the command buffer
- waiting for the ack

The correct solution would be some kind of (logical) bidirectional 1:n connection (like a TCP connection for example). You can send a command and just the sender get it's acknowledgement. Global state information like current position could be still reside in a global buffer or a current copy could be requestet by a special command via the connection.

I've tried to create a minimal invasive patch for the issue, so I hadn't to chage the existing concept too much. The intention was to fix the current version, not to create a new approach (which might be needed). Most of the code changes are dedicated to clean up the command send code. There where so many serialNumber++ and so on that I've decided to clean them up and implement the logic in single functions.

If you keep the buffer instead of the queue and implement some kind of locking (could be done with write_if_read I think) then you also need to implement some kind of retry logic. Implementing a queue on the other hand just requires a configuration change and replacing the peek() with a read().

Zmq and protobuf seems to be pretty cool components to implement a clean solution. I don't known them (yet) in depth, but what I've seen on the internet looks very promising. Zmq can build a lot of different communication relationships.

How could we proceed with the test case?

regards
Sascha
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Haberler - 2013-08-01

Hi Sascha,

the more I think of it the lost ack problem you describe is the harder of the two issues.

I am not sure this is actually possible to do with responses going through an 'NML buffer' (pretty much a shared memory region), Also, I would like to solve the problem without assuming shared memory between producers, which may or may not be the case, but turn out rather restrictive eventually.

A clean solution not requiring a globally unique id would be:

a command queue

a response queue

both commands and reponses are tagged by a tuple(local serial (need not be unique),producer identifier (needs to be unique)). A producer peeks on the response queue until his producer id comes up, then reads.

Would you agree this solves the problem in principle? I'm not suggesting at this point it actually be done this way, I am interested in a clean design.

If that is the case, the next question is: can we coordinate consumers of acks in such a way that a response is not read by any other than the legitimate originator? I think it is, producers just peek until 'their' producer id comes up, then read.

This would shift the issue to ack production, the consumer: an ack cannot be written to the buffer unless the legitimate producer who caused the previous ack has read 'his' last ack. The write_if_read() method might be the tool for that.

How do you see this option? In terms of throughput the dual-queue solution is better because there would be no lock-step between ack production and consumption, but I am not sure it matters much. Also, EMC_STAT is a sizable structure and updated a lot, so this might have some performance impact as now several EMC_STAT copies would be in-flight in the response queue.

--

re test case: I hope a volunteer comes up, it would be great to have an independent pair of eyeballs on the issue. It's a very interesting issue, but I'm a bit pressed for time to do that right away. Since it is such a fundamental problem, and very few folks really understand the somewhat convoluted semantics of NML, it would be relevant not just as a test case but as a tutorial ;)

--

in a zmq scenario, this would be rather simple to solve: acks are pushed into a PUBLISH channel (including originator id), and any producer interested in acks subscribes to that channel. The semantics of the PUBLISH operation guarantees that no acks are lost.

Michael
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Haberler - 2013-08-02

Let's establish a baseline first - by applying your patch to the proper base version and make sure all regression tests run OK.

I took your patch and applied it to the current origin/v2.5_branch
the result is here: https://github.com/mhaberler/linuxcnc/tree/sittner-patch-onto-6ff9451b

I build with --enable-simulator and did a runtests

tests/linuxcncrsh hangs (the linuxcncrsh binary actually) in a busy loop:

#0 emcCommandWaitDone () at emc/usr_intf/shcom.cc:282
#1 0x080531a4 in sendManual () at emc/usr_intf/shcom.cc:501
#2 0x0804a9f0 in setMode (s=0x8b43919 "MANUAL", context=0x8b438a8) at emc/usr_intf/emcrsh.cc:885
#3 0x0804b8e1 in commandSet (context=0x8b438a8) at emc/usr_intf/emcrsh.cc:1243
#4 0x08050f8b in parseCommand (context=0x8b438a8) at emc/usr_intf/emcrsh.cc:2625
#5 0x080511b5 in readClient (arg=0x8b438a8) at emc/usr_intf/emcrsh.cc:2691
#6 0xb76b096e in start_thread (arg=0xb6c0eb70) at pthread_create.c:300
#7 0xb74e53fe in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130

in the for loop and the first continue: int emcCommandWaitDone()
{
double end;
for (end = 0.0; emcTimeout <= 0.0 || end < emcTimeout; end += EMC_COMMAND_DELAY) {
int serial_diff = emcStatus->echo_serial_number - emcCommandSerialNumber;
if (serial_diff < 0) {
continue;
}
emcTimeout = 0.

This result is not unlikely as there have been various patches to linuxcncrsh recently

Maybe we better start with the exact same commit - on which sha1 did you base this patch on?

Michael
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sascha Ittner - 2013-08-02

No, was not caused by other patches. Stupid me has just killed a call to updateStatus() and disabled the sleep while waiting as I've implemented the overflow handling. Then I've tested halui and axis and missed doing a runtests. Sorry. Please find the fixed patch attached.

The patch is against the released 2.5.3 (I've used this tarball: http://linuxcnc.org/emc2/dists/lucid/linuxcnc2.5/source/linuxcnc_2.5.3.tar.gz)

I've run the tests on sim + rt and they looks ok.

As you pointed out a clean solution would require a queue on both sides (I think a queue is more natural to this problem than polling a single buffer with some kind of locking).

I could imagine someting like this:
- a global coommand queue like the current with som kind of "client id" plus a local serial number
- a global buffer for EMC_STAT. Same as the current one but without echo_serial_number
- a "per client" ack queue just to pass the command status (received/done/error..) back to the particular client. Messages are dispatched to this queue via a mapping of the "client id".

I think a global ack queue would be problematic since if one of the consumers miss to pick up it's message for some reason, the whole system will be blocked. But the method I've mentioned above is also problematic: How to create the Queues on demand if a new client comes up? And when to remove it and how to sync to the partner?

And both of them have two problems in common:
- How to create the unique client id. A harcoded constant will not be sufficent, since in therory there could be multiple instances of one clinent (linuxcncrsh for example). This could be resolved by creating a kind of "request client id" command, but how to pass the result back?
- We have to check how to be sure that all commands generate a DONE message, if they are finished, in addition to the ack, when they are received.

Sascha

fix-emccommand-queue-fix1.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sascha Ittner - 2013-08-03

Hi Michael,

I just read the posts on the devel list about this topic and only now I understand the problem with the global serial number in conjuction with tcpmem. Till now my picture was that tcpmem transparently do a atomic read/modify/write of the remote buffer, but investigating the code shows that this is wrong. But from technical point of view it seems very easy to make this going (please see my attached patch, just as base for discussion, It's completely untestet)

Since it changes the protocol a finnished patch has to handle the different versions corretly, if possible.

But I've seen some inconsistent use of datatypes (int/long) and some missing usage of network byte order macros in tcp_srv.cc. Maybe we sould have a closer look to this.

Sould I rename all my added serial_number vars into something like user_serial_numer (just to avoid confusion with the serial_number used internaly by cms/nml)?

Sascha

nml_tcp_user_serial_number.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sascha Ittner - 2013-08-04

Michael,

I tried to setup a test env for testing my patch from yesterday, but can't get working this (even without any patch). Are there any open issues with nml over tcp?

My first attempt show this on server side:
libnml/cms/tcp_srv.cc 1494: Unrecognized request type received.(65536)
libnml/cms/tcp_srv.cc 1494: Unrecognized request type received.(131072)
and this on client side:
emc/usr_intf/xemc.cc 4701: can't establish communication with emc

Both hosts are 2.6.32-122-rtai (i386)
Looks to me like a byte order problem on my first bet.?

Let me try to write some (woolly?) sunday morning thoughts for possible future implenetations:

All UI code I've found do somethink like SendCommandAndWaitForReceive or SendCommandAndWaitForReceive followed by WaitForDone. So for all Commands what we really need is not a message system but a remote call system. I think we could implement an API at the level of shcom.hh on a central point (eg. Emctaskmain). This will eliminate all the multi process / over network sync pain. Maybe all of these functions or selected ones (let say those which take a long time to run) could get an extra parameter "waitForDone". So a success/error state can be returned immediately. The current Status buffer could be get by a command like "getCurrentState"

If so, why not implementing this remote call system as restful web service. This could be done in a resource friendly way with libmicrohttpd for example. Some time ago I've startet something similar on top of shcom to interface a java UI via network, but never finnished that. The Payload could be encoded in xml or json. I'm not a friend of text encoded payload for high performance or small resource applications, but in this case it would be ok, since there is not that much throughput and it will eliminate the multi platform issues (byte order, type length...) and is "network ready" in a natural kind.

This API could be extended by other functionality that UIs needs. I'm thinking about HAL access ala halcmd or some kind of ngc interpreter functionality for generating a preview.

I think the "state maintaining logic" found in axis UI for example sould be moved to a layer below (forcing manual mode after NGC prog has been executed for example), because this kind of thing has a lot potential to create racing conditions between the participants if multiple UIs are running

How do you think about this?

regards
Sascha
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Haberler - 2013-08-05

Hallo Sascha,

I agree that a decent RPC pattern is what we need to plug this. And we need to plug this, this is such a fundamental flaw.

The thing I am not sure about - and I sense similar ideas in your post: does it make sense to blunder down the NML lane even further? It might be coerced to do what we need, but the solution wont live longer than at most a year, by then I should be done ripping out NML completely and have the result usable and merged. That will entail replacing both transport and serialisation.

It depends a bit on the timeframe when this should be fixed. If we can hold off for say 3-6 months and you can live with your patch for the time being my preferred solution would be this:

forget about fixing this in NML altogether.

introduce zmq as a transport early and retain NML messages as content for now (I do have a protobuf wrapper for that backwards compatibility option, so both new and old formats can be packet into containers for zmq

use the zmq XREQ/XREP pattern for command submission, and maybe PUB/SUB for 'command complete' reports which clients might susbscribe to as they wish.

This could be done like so:

first we teach task to listen not only to NML transports, but on a ZMQ socket in parallel. Not real hard.

next we teach some client, like shcom.cc, to use the ZMQ socket as an alternate command submission method, still using NML message format in the ZMQ frames.

that client would be dual-mode, say through an enviroment variable, to use either transport method.

cook up regression tests which assure that either method works. If this is done right, the results must be identical (or the regression test isnt worth anything ;)

finally switch NML off and switch to protobuf-encoded messages (my planned workitem). It just occurred to me even that can be done piecemeal once the transports are switched, because the protobuf container is self-describing and can carry old or new; the recipient will be able to tell.

The reason why I said 'introduce zmq early' is that I had planned it the other way around: complete the encoding/serialisation/deserialisation of the protobuf messages and go from there. But the more I think of it, retaining the NML message format for the time being and switch transports underneath first is a better incremental step, which can be regression-tested piecemeal.

Interesting - a bug report mutating into a migration strategy ;)

I wouldnt like to switch transports to yet something else at this point, like a RESTful API; I get your idea but I think the protobuf method will work out better in the end because it can work end-to-end (I have demo RT modules which can decode/encode protobuf messages even in-kernel if needed, and that gets rid of the silly transcoding in usrmotintf.cc completely). Note that at the UI side, we have automatic, schema-driven protobuf->JSON->protobuf translation in place, and bindings for C, C++, and Python. So it should be rather easy to slap on some JS Web UI to generate and understand the JSON formats of the future message encoding.

So really question #1 is - can you hold out without that patch going in as-is until fall, maybe winter?

Michael

Last edit: Michael Haberler 2013-08-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Epler - 2013-08-05

Since this discussion has clearly moved from the original bug report (which may not have a simple and clear solution) to brainstorming linuxcnc design, please consider moving this discussion to the developers mailing list. I can't guarantee that it will generate more discussion but I suspect hardly anybody is subscribed to this bug and reading this discussion.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sascha Ittner - 2013-08-05

You are right Jeff, we've mixed up discussion about the bug and brainstorming. I'will post my thoughts to the list.

To close with this bug just let me publish my final Version of the patch.
It's certainly ok for my without the patch going into the offical code because I've integrated it as one of a serious of patches into my build system. I just like to post it here for the case that it could help anybody else.

fix-emccommand-queue-fix2.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Milos Prochazka - 2013-08-15

Thank you guys, this last patch is working perfectly. We had this problem for a long time. We couldn't finish our control panel. Sending you the photo of our cnc control panel.

CNCpanel.JPG

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Epler - 2014-10-21

I have rebased the original SF#328 patch onto the tip of the LinuxCNC
master branch.

This work can be found on the tip of the branch 'jepler/master/sf328' on
git.linuxcnc.org.

My patch should be like 'fix-emccommand-queue-fix1.patch', though I
started with 'fix-emccommand-queue.patch' and found and re-added the
missing call myself.

I did not attempt to incorporate any form of the tcp serial number
fixes.

For me, this resolved the #395 bug as well, and all runtests still pass.

If you're affected by #328 or #395 please test with my branch and report
your results.

Since so few people are using NML over TCP in LinuxCNC, I believe that
we should incorporate a version of this patch in our master branch even
if it totally breaks NML over TCP, because 328/395 affect ALL of our
users.

(I would have said 'no users' but back in April, W. Martinjak (matsche)
indicated that he was currently using it successfully between 32-bit
machines and with limited success after some changes between 32-bit and
64-bit machines)

In a separate e-mail I've brought EBo (who wrote those patches) and
matsche (who reported using NML-over-TCP) into the loop about the #328
patch. I haven't tried the later patch which intends to add reliable
global serial numbers to TCP.

Jeff

alternate

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Epler - 2014-11-16

status: open --> closed-fixed

Group: --> 2.6
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Epler - 2014-11-16

Even though the patch causes known problems for users of nml-over-tcp, we have incorporated it into the 2.7 branch of linuxcnc. Thank you again for your work on this patch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Problem using AXIS together with HALUI

Group

Searches

Help

#328 Problem using AXIS together with HALUI

Discussion