From: Jim H. <ji...@ov...> - 2023-09-22 13:35:54
|
Hi Everyone, Spoiler: Proposed fix for flakey behavior of an Agilent 82357B. Testers wanted. Under unusual conditions the 82357B returns an EBUS timeout for all commands until it is unplugged and plugged back in. In particular the problem depends on the last character of the last message received. The problem doesn't happen in the normal case where the read buffer is large enough to hold the whole response and the last character is newline. The accompanying test program reads characters a byte at a time and can recreate the problem. If you have an Agilent 82357B and an instrument that responds to commands please run the attached program and send the output to help determine whether this is a common problem or only specific to my board. TLDR; * The problem * I was trying to use my HP 82357B to read the calibration data from my HP3478A meter. It failed with EBUS. I posted here on July 26th and have been working with Dave Penkler to debug this problem since then. It wasn't clear if this was flakey hardware or a problem with the Linux-gpib agilent_82357a driver. The calibration data of the HP3478A can be read a nibble at a time by sending a two character command consisting of a 'W' followed by a byte with the 8-bit address of thelocation to be read from the ram. Tom Verbeure has good description of the process here: https://tomverbeure.github.io/2022/12/02/HP3478A-Multimeter-Calibration-Data-Backup-and-Battery-Replacement.html The calibration data is returned as a single byte with values in the range 0x40-0x4f where the lower 4 bits are the value of the addressed nibble. The EBUS failure only happened for some addresses. A 'W1' command always worked but sending a 'W4' command caused the next command to receive an EBUS failure. Once the system returned an EBUS the only way to recover was to unplug the HP 82357B USB cable and plug it back in. Further testing showed that it was the returned nibble that predicted the failure. Here is the table: @ABC HIJK - The next read works. DEFG LMNO - The next read fails with EBUS So it was the DI[2] bit which determined the result. * The Debug * I had fun chasing this problem. I started with printks in the driver and determined that the EBUS was the result of agilent_82357_take_control failing. This function is supposed to assert ATN to signal that the bytes which follow are command bytes. This is done automatically when a command is written or can be done with a user call to ibcac(). The take_control function sends a command to the GPIB bus control logic in the 82357B and then polls waiting for ATN to be asserted. In the failure case it timed out waiting for ATN to be asserted. I used my logic analyzer to capture the GPIB activity, wireshark and usbmon to capture the USB traffic and Linux ftrace because I got tired of adding printk's. Finally I hooked the logic analyzer to the TMS9914 chip which implements the GPIB control logic in the 82357B. * History related problems * We knew that there had been problems with the agilent_82350b driver which involved the take_control path. The issue involves the synchronous option to take_control. This option determines if the GPIB controller in the 82357B waits (synchronizes with) a pending IO before asserting ATN. If the sync option is 1 the controller waits. If sync is zero the controller just asserts ATN. The default command path calls ibcac() in the common portion of the gpib driver with sync=1. There is logic to try to recover from a failure to assert ATN by trying a second time with sync=0. It also checks if the ATN bit is already set in ibsta and will avoid calling the take_control function. The problem with the 82350B and potentially all TMS9914 based designs is that once it has tried to do a take_control synchronously it is stuck. A subsequent call with sync=0 doesn't work. The work around for the 82350B is to have the take_control function return a timeout error if it is called with sync=1. The ibcac() will call back with sync=0 avoiding the problem. If the call is from the user-space library ibcac() (contrary to the documentation), it doesn't fall back to asynchronous. It will fail a request with sync=1. Does it return failure if the ATN bit is falsely already set in ibsta? No because if ATN is not asserted by the controller the firmware will initiate an asynchronous take control (TCA) on the chip before sending the command bytes. * Keysight to the rescue * I also tried to reproduce the problem with the Keysight IO libraries. They have a version for Linux. It works with PyVISA and I was able to send the 'W1' and 'W4' commands to the HP3478A but it always returned the front panel display value rather than the calibration ram single byte response which I had expected. I put some effort into diagnosing this new problem and believe that it was the result of the IO libraries always sending a secondary address of 0. So the PyVISA identifier GPIB::22::INSTR was interpreted as GPIB::22::0::INSTR. I could see this on the GPIB bus using my logic analyzer, and I could see it in the usbmon trace using Wireshark. The Keysight IO libraries consist of a proprietary user space and an open source device driver. I had hoped to see if the Keysight software initialized the 82357B the same as the linux-gpib. I was able to capture the initialization from the USB bus using Wireshark. I tried using the same initialization sequence in the linux-gpib driver and still had the same EBUS failure. I wrote a python script which allowed me to use the Agilent driver bypassing the IO libraries. This let me send 'W4' command to the HP3478A. Subsequent commands worked. I also found that the Keysight code always used asynchronous take control command to assert ATN. * Enlightenment * I shared the captured traces with Dave, and he figured out that the character being read determined if the take_control function was being called with sync=0 or 1. This is the result of another short cut in the common gpib ibcac() function that will only use the sync option if the board is a listener on the GPIB bus. It does this by checking the LACS status bit. The agilent_82357a driver gets this status from the TMS9914 ADSR register. In the case of my Agilent 82357B this register was corrupted with the last character of the last message received. The failure case with my HP 3478A, issuing a 'W4' command results in location 0x34 being read and single character 'F' was returned. This character was also being returned for reads of the ADSR register. This corruption of the ADSR register caused the common gpib ibcac() to call the agilent_82357a_take_control() with sync=1. This triggers a failure similar to the problem previously seen with the Agilent 82350B. The AUX_TCS is sent but the ATN signal is never asserted. * Can you reproduce the failure at home? * From the start I was not sure if this was a broken Agilent 82357B or a problem with the driver. We still don't know. We would like other Agilent 82357 owners to test. You don't need an HP3478A to reproduce this problem. Since the problem depends on the last character received it would often be masked since a '\n' doesn't cause the failure. In normal use the receive buffer is large enough to hold the whole response and the problem is avoided. The problem is easy to reproduce if you read a character at a time. The attached python script takes two arguments the device number and a command. Here is an example using my HP6642 power supply: jhouston@linux-gpib:~$ python3 byte_read.py 4 volt? b'0' 256 b'.' 256 b'0' 256 b'E' 256 InternalReceiveSetup: command failed Traceback (most recent call last): File "byte_read.py", line 12, in <module> ch = inst.read(1) File "/usr/local/lib/python3.6/dist-packages/Gpib.py", line 59, in read self.res = gpib.read(self.id,len) gpib.GpibError: read() failed: An attempt to write command bytes to the bus has timed out. jhouston@linux-gpib:~$ In this case the 'E' as the last character read triggers the failure. Dave Penkler ran the attached 'C' program on his Beiming 82357B clone with an HP34401 dmm without errors. But as it has its own firmware it does not prove that there is not a common problem with the Agilents. $ onebyte -m2 -d7 "MEAS:VOLT?" - 0x0164 1 0x0164 . 0x0164 9 0x0164 4 0x0164 8 0x0164 E 0x0164 - 0x0164 0 0x0164 6 0x0164 0x0164 $ * The Hardware * The Agilent 82357B consists of a Cypress EZ-USB chip, a GPIB controller chip, a Xilinx XC9536 CPLD and bus transceivers. On my 82357B the GPIB controller is labeled Agilent 1822-0639. Google searches find this in the BOM for several HP/Agilent/Keysight products as a TMS9914. It would be nice to know the difference between the 82357A and 82357B. From picture of the boards posted to the internet they have the same major parts. Mine has the same revision label on the Xilinx XC9536 as a rev A board picture posted on the Sigrok website here: https://sigrok.org/wiki/Agilent_82357A The firmware for the Cypress EZ-USB is downloaded using a udev script which runs fxload. The A/B revs of the 82357 have different firmware files from Agilent. Again we don't know why the firmware is different. The agilent_82357a driver is used for both and treats them the same. This might be a hardware problem unique to my 82357B. In addition to the failure with the ADSR register corruption, I have seen other strange behavior. I have seen bursts of traffic which include very fast DMA transfers which do not match any request from the driver. There is always a chance that connecting a logic analyzer may effect the circuit under test. I'm using a HP16702A and connecting to the TMS9914 using a PLCC clip and flying leads. The GPIB is probed using an HP10342. I don't have a good ground path for the flying lead probes, but the traces make sense so I think that the DMA transfers are real. It would be nice to instrument another 82357 but it is hard to justify spending the $100 which it is likely to cost. * The right fix * I had hoped that this problem was the same as take_control problem in the agilent_82350b driver. I tried changing take_control to only use AUX_TCA. This fix worked and I was able to read the calibration ram of the HP3478A. Once we found that the problem was the result of reads of the TMS9914 ADSR register being corrupted, we tested various commands and found that only the AUX_TCS and AUX_TCA would restore the ADSR register to correct operation. We tried using the clear LON (Listener only) command. This does not restore the correct function of the ADSR register but seems to make AUX_TCS work if it is used by the take_control function to assert ATN. Calling take_control from the agilent_82357a_read function would restore the correct ADSR register behavior. I have tested this and it solves my problems. Dave Penkler is concerned that users of the low-level interface may be surprised that the ATN signal has been asserted. Having the ATN asserted works well for most high level operations because the next operation is likely to have to send commands to setup listen and talk addresses. The other alternative is to understand when we can trust the ADSR register and to provide a resonable guess of the GPIB state when we know that the ADSR is corrupt. Jim Houston |