I've found a problem in using AXIS together with HALUI, but it seems a general problem when two or more user interfaces are used the same time.
If I use HALUI excessively I've found that axis is blocked some times. Even more problematic for me is the fact that in some cases the HALUI action get lost. After investigating the involved code I've found the reason. I'll try to describe my thoughts:
The communication between UI and EMCTASK is done by shared mem buffers (cmd + status). If a command is issued, the command object is copied to the cmd buffer (including a command serial number). Then the sender (the UI) waits for the reception and optionaly for the completion of the command. The wait is relized by waiting for the appearance of the issued serial number in echo_serial_number of the status buffer. The serial number generator is local to the ui process and no sync methods (except from a semaphore on the buffer read/write functions) is involved. This causes two problems:
As I needed a quick (hopefully not (so) dirty) fix I've created a patch that allows to generate the serial number in an atomic way on the cms write and also configures emcCommand as a queue instead of a buffer.
I will enjoy your feedback
Sascha
Sascha,
if your analysis is correct (and I do not doubt it is) your work identifies and removes a fundamental design defect from intertask communication in LinuxCNC.
I did run into the problem before in a different context: I tried to implement a remote procedure call over NML using the existing (fishy) interaction pattern, and that failed for the reasons you describe; at the time I did not drill down like you did - congratulations for the stamina in identifying and fixing this!
To verify, is the scenario as follows?:
B's command gets executed and 'acknowledged'
If this is the case, then I would see more than option to address the issue:
I am positive you are onto something fundamental here. Before adopting the patch my suggestion would be:
we need to create a test case which identifies the current problem, even if only for regression tests. This might be a bit involved given all the moving parts. It might be easier to replicate the code sequence into small test programs and verify that the overwrite happens.
the underlying problem is an unprotected write to a shared resource, and there might be more than one way to fix it (see above); it might make sense to adopt a method which is less of a wholesale change (I dont suggest it exists; just saying - it might make sense to investigate if there is a single fix to the issue at a different level)
Interested how you see this.
best regards, and congratulations
Michael
*) I would think the queue would protect against overwriting of commands alone; NB the possibly clash in non-unique serial numbers still exists: meaning the combination of single command buffer, write protect if command present, and serial/unique originator tuples might do the trick too. As a matter of style, I think your patch - making the command input a queue proper - is more up to the point though.
Hi Michael,
yes, the scenario you describe is correcet. But there is an other one:
That is the reason I've choosen a global serial number. In this case A+B can wait for a echo_serial_number >= local_serial_number (with respect to datatype overflow). But this is also a compromise since you can only get the error state of the last command.
In summary we have two problems:
The correct solution would be some kind of (logical) bidirectional 1:n connection (like a TCP connection for example). You can send a command and just the sender get it's acknowledgement. Global state information like current position could be still reside in a global buffer or a current copy could be requestet by a special command via the connection.
I've tried to create a minimal invasive patch for the issue, so I hadn't to chage the existing concept too much. The intention was to fix the current version, not to create a new approach (which might be needed). Most of the code changes are dedicated to clean up the command send code. There where so many serialNumber++ and so on that I've decided to clean them up and implement the logic in single functions.
If you keep the buffer instead of the queue and implement some kind of locking (could be done with write_if_read I think) then you also need to implement some kind of retry logic. Implementing a queue on the other hand just requires a configuration change and replacing the peek() with a read().
Zmq and protobuf seems to be pretty cool components to implement a clean solution. I don't known them (yet) in depth, but what I've seen on the internet looks very promising. Zmq can build a lot of different communication relationships.
How could we proceed with the test case?
regards
Sascha
Hi Sascha,
the more I think of it the lost ack problem you describe is the harder of the two issues.
I am not sure this is actually possible to do with responses going through an 'NML buffer' (pretty much a shared memory region), Also, I would like to solve the problem without assuming shared memory between producers, which may or may not be the case, but turn out rather restrictive eventually.
A clean solution not requiring a globally unique id would be:
Would you agree this solves the problem in principle? I'm not suggesting at this point it actually be done this way, I am interested in a clean design.
If that is the case, the next question is: can we coordinate consumers of acks in such a way that a response is not read by any other than the legitimate originator? I think it is, producers just peek until 'their' producer id comes up, then read.
This would shift the issue to ack production, the consumer: an ack cannot be written to the buffer unless the legitimate producer who caused the previous ack has read 'his' last ack. The write_if_read() method might be the tool for that.
How do you see this option? In terms of throughput the dual-queue solution is better because there would be no lock-step between ack production and consumption, but I am not sure it matters much. Also, EMC_STAT is a sizable structure and updated a lot, so this might have some performance impact as now several EMC_STAT copies would be in-flight in the response queue.
--
re test case: I hope a volunteer comes up, it would be great to have an independent pair of eyeballs on the issue. It's a very interesting issue, but I'm a bit pressed for time to do that right away. Since it is such a fundamental problem, and very few folks really understand the somewhat convoluted semantics of NML, it would be relevant not just as a test case but as a tutorial ;)
--
in a zmq scenario, this would be rather simple to solve: acks are pushed into a PUBLISH channel (including originator id), and any producer interested in acks subscribes to that channel. The semantics of the PUBLISH operation guarantees that no acks are lost.
Let's establish a baseline first - by applying your patch to the proper base version and make sure all regression tests run OK.
I took your patch and applied it to the current origin/v2.5_branch
the result is here: https://github.com/mhaberler/linuxcnc/tree/sittner-patch-onto-6ff9451b
I build with --enable-simulator and did a runtests
tests/linuxcncrsh hangs (the linuxcncrsh binary actually) in a busy loop:
#0 emcCommandWaitDone () at emc/usr_intf/shcom.cc:282
#1 0x080531a4 in sendManual () at emc/usr_intf/shcom.cc:501
#2 0x0804a9f0 in setMode (s=0x8b43919 "MANUAL", context=0x8b438a8) at emc/usr_intf/emcrsh.cc:885
#3 0x0804b8e1 in commandSet (context=0x8b438a8) at emc/usr_intf/emcrsh.cc:1243
#4 0x08050f8b in parseCommand (context=0x8b438a8) at emc/usr_intf/emcrsh.cc:2625
#5 0x080511b5 in readClient (arg=0x8b438a8) at emc/usr_intf/emcrsh.cc:2691
#6 0xb76b096e in start_thread (arg=0xb6c0eb70) at pthread_create.c:300
#7 0xb74e53fe in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130
in the for loop and the first continue: int emcCommandWaitDone()
{
double end;
for (end = 0.0; emcTimeout <= 0.0 || end < emcTimeout; end += EMC_COMMAND_DELAY) {
int serial_diff = emcStatus->echo_serial_number - emcCommandSerialNumber;
if (serial_diff < 0) {
continue;
}
emcTimeout = 0.
This result is not unlikely as there have been various patches to linuxcncrsh recently
Maybe we better start with the exact same commit - on which sha1 did you base this patch on?
No, was not caused by other patches. Stupid me has just killed a call to updateStatus() and disabled the sleep while waiting as I've implemented the overflow handling. Then I've tested halui and axis and missed doing a runtests. Sorry. Please find the fixed patch attached.
The patch is against the released 2.5.3 (I've used this tarball: http://linuxcnc.org/emc2/dists/lucid/linuxcnc2.5/source/linuxcnc_2.5.3.tar.gz)
I've run the tests on sim + rt and they looks ok.
As you pointed out a clean solution would require a queue on both sides (I think a queue is more natural to this problem than polling a single buffer with some kind of locking).
I could imagine someting like this:
I think a global ack queue would be problematic since if one of the consumers miss to pick up it's message for some reason, the whole system will be blocked. But the method I've mentioned above is also problematic: How to create the Queues on demand if a new client comes up? And when to remove it and how to sync to the partner?
And both of them have two problems in common:
Sascha
Hi Michael,
I just read the posts on the devel list about this topic and only now I understand the problem with the global serial number in conjuction with tcpmem. Till now my picture was that tcpmem transparently do a atomic read/modify/write of the remote buffer, but investigating the code shows that this is wrong. But from technical point of view it seems very easy to make this going (please see my attached patch, just as base for discussion, It's completely untestet)
Since it changes the protocol a finnished patch has to handle the different versions corretly, if possible.
But I've seen some inconsistent use of datatypes (int/long) and some missing usage of network byte order macros in tcp_srv.cc. Maybe we sould have a closer look to this.
Sould I rename all my added serial_number vars into something like user_serial_numer (just to avoid confusion with the serial_number used internaly by cms/nml)?
Sascha
Michael,
I tried to setup a test env for testing my patch from yesterday, but can't get working this (even without any patch). Are there any open issues with nml over tcp?
My first attempt show this on server side:
libnml/cms/tcp_srv.cc 1494: Unrecognized request type received.(65536)
libnml/cms/tcp_srv.cc 1494: Unrecognized request type received.(131072)
and this on client side:
emc/usr_intf/xemc.cc 4701: can't establish communication with emc
Both hosts are 2.6.32-122-rtai (i386)
Looks to me like a byte order problem on my first bet.?
Let me try to write some (woolly?) sunday morning thoughts for possible future implenetations:
How do you think about this?
regards
Sascha
Hallo Sascha,
I agree that a decent RPC pattern is what we need to plug this. And we need to plug this, this is such a fundamental flaw.
The thing I am not sure about - and I sense similar ideas in your post: does it make sense to blunder down the NML lane even further? It might be coerced to do what we need, but the solution wont live longer than at most a year, by then I should be done ripping out NML completely and have the result usable and merged. That will entail replacing both transport and serialisation.
It depends a bit on the timeframe when this should be fixed. If we can hold off for say 3-6 months and you can live with your patch for the time being my preferred solution would be this:
This could be done like so:
The reason why I said 'introduce zmq early' is that I had planned it the other way around: complete the encoding/serialisation/deserialisation of the protobuf messages and go from there. But the more I think of it, retaining the NML message format for the time being and switch transports underneath first is a better incremental step, which can be regression-tested piecemeal.
Interesting - a bug report mutating into a migration strategy ;)
I wouldnt like to switch transports to yet something else at this point, like a RESTful API; I get your idea but I think the protobuf method will work out better in the end because it can work end-to-end (I have demo RT modules which can decode/encode protobuf messages even in-kernel if needed, and that gets rid of the silly transcoding in usrmotintf.cc completely). Note that at the UI side, we have automatic, schema-driven protobuf->JSON->protobuf translation in place, and bindings for C, C++, and Python. So it should be rather easy to slap on some JS Web UI to generate and understand the JSON formats of the future message encoding.
So really question #1 is - can you hold out without that patch going in as-is until fall, maybe winter?
Last edit: Michael Haberler 2013-08-05
Since this discussion has clearly moved from the original bug report (which may not have a simple and clear solution) to brainstorming linuxcnc design, please consider moving this discussion to the developers mailing list. I can't guarantee that it will generate more discussion but I suspect hardly anybody is subscribed to this bug and reading this discussion.
You are right Jeff, we've mixed up discussion about the bug and brainstorming. I'will post my thoughts to the list.
To close with this bug just let me publish my final Version of the patch.
It's certainly ok for my without the patch going into the offical code because I've integrated it as one of a serious of patches into my build system. I just like to post it here for the case that it could help anybody else.
Thank you guys, this last patch is working perfectly. We had this problem for a long time. We couldn't finish our control panel. Sending you the photo of our cnc control panel.
I have rebased the original SF#328 patch onto the tip of the LinuxCNC
master branch.
This work can be found on the tip of the branch 'jepler/master/sf328' on
git.linuxcnc.org.
My patch should be like 'fix-emccommand-queue-fix1.patch', though I
started with 'fix-emccommand-queue.patch' and found and re-added the
missing call myself.
I did not attempt to incorporate any form of the tcp serial number
fixes.
For me, this resolved the #395 bug as well, and all runtests still pass.
If you're affected by #328 or #395 please test with my branch and report
your results.
Since so few people are using NML over TCP in LinuxCNC, I believe that
we should incorporate a version of this patch in our master branch even
if it totally breaks NML over TCP, because 328/395 affect ALL of our
users.
(I would have said 'no users' but back in April, W. Martinjak (matsche)
indicated that he was currently using it successfully between 32-bit
machines and with limited success after some changes between 32-bit and
64-bit machines)
In a separate e-mail I've brought EBo (who wrote those patches) and
matsche (who reported using NML-over-TCP) into the loop about the #328
patch. I haven't tried the later patch which intends to add reliable
global serial numbers to TCP.
Jeff
Even though the patch causes known problems for users of nml-over-tcp, we have incorporated it into the 2.7 branch of linuxcnc. Thank you again for your work on this patch.