Re: [Linux-gpib-general] Agilent 82357B repeatable hard failure

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Everyone,

Spoiler:
Proposed fix for flakey behavior of an Agilent 82357B.  Testers wanted.
Under unusual conditions the 82357B returns an EBUS timeout for all
commands until it is unplugged and plugged back in.  In particular the
problem depends on the last character of the last message received.
The problem doesn't happen in the normal case where the read buffer
is large enough to hold the whole response and the last character is
newline.  The accompanying test program reads characters a byte at a
time and can recreate the problem.  If you have an Agilent 82357B and
an instrument that responds to commands please run the attached program
and send the output to help determine whether this is a common problem
or only specific to my board.

TLDR;

* The problem *
I was trying to use my HP 82357B to read the calibration data
from my HP3478A meter.  It failed with EBUS.

I posted here on July 26th and have been working with Dave Penkler to 
debug this
problem since then.  It wasn't clear if this was flakey hardware or a 
problem
with the Linux-gpib agilent_82357a driver.

The calibration data of the HP3478A can be read a nibble at a time by 
sending a
two character command consisting of a 'W' followed by a byte with the 
8-bit address
of thelocation to be read from the ram.  Tom Verbeure has good description
of the process here:
https://tomverbeure.github.io/2022/12/02/HP3478A-Multimeter-Calibration-Data-Backup-and-Battery-Replacement.html

The calibration data is returned as a single byte with values in the
range 0x40-0x4f where the lower 4 bits are the value of the addressed
nibble.  The EBUS failure only happened for some addresses.  A 'W1'
command always worked but sending a 'W4' command caused the next
command to receive an EBUS failure.  Once the system returned an EBUS
the only way to recover was to unplug the HP 82357B USB cable and plug
it back in.  Further testing showed that it was the returned nibble
that predicted the failure.  Here is the table:

@ABC HIJK - The next read works.
DEFG LMNO - The next read fails with EBUS

So it was the DI[2] bit which determined the result.

* The Debug *
I had fun chasing this problem.  I started with printks in the driver
and determined that the EBUS was the result of agilent_82357_take_control
failing.  This function is supposed to assert ATN to signal that
the  bytes which follow are command bytes.  This is done automatically
when a command is written or can be done with a user call to ibcac().
The take_control function sends a command to the GPIB bus control
logic in the 82357B and then polls waiting for ATN to be asserted.
In the failure case it timed out waiting for ATN to be asserted.

I used my logic analyzer to capture the GPIB activity, wireshark and
usbmon to capture the USB traffic and Linux ftrace because I got tired
of adding printk's.  Finally I hooked the logic analyzer to the
TMS9914 chip which implements the GPIB control logic in the 82357B.

* History related problems *
We knew that there had been problems with the agilent_82350b driver
which involved the take_control path.  The issue involves the
synchronous option to take_control.  This option determines if
the GPIB controller in the 82357B waits (synchronizes with) a
pending IO before asserting  ATN.  If the sync option is 1 the
controller waits.  If sync is zero the controller just asserts ATN.
The default command path calls ibcac() in the common portion of
the gpib driver with sync=1.  There is logic to try to recover from
a failure to assert ATN by trying a second time with sync=0.
It also checks if the ATN bit is already set in ibsta and will
avoid calling the take_control function.  The problem with the 82350B and
potentially all TMS9914 based designs is that once it has tried to
do a take_control synchronously it is stuck.  A subsequent call
with sync=0 doesn't work.  The work around for the 82350B is
to have the take_control function return a timeout error if it is
called with sync=1.  The ibcac() will call back with sync=0 avoiding
the problem.  If the call is from the user-space library ibcac()
(contrary to the documentation), it doesn't fall back to asynchronous.
It will fail a request with sync=1.  Does it return failure if the ATN
bit is falsely already set in ibsta?  No because if ATN is not asserted
by the controller the firmware will initiate an asynchronous take control
(TCA) on the chip before sending the command bytes.

* Keysight to the rescue *
I also tried to reproduce the problem with the Keysight IO libraries.
They have a version for Linux.  It works with PyVISA and I was able
to send the 'W1' and 'W4' commands to the HP3478A but it always returned
the front panel display value rather than the calibration ram single
byte response which I had expected.  I put some effort into diagnosing
this new problem and believe that it was the result of the IO libraries
always sending a secondary address of 0.  So the PyVISA identifier
GPIB::22::INSTR was interpreted as GPIB::22::0::INSTR.  I could see this
on the GPIB bus using my logic analyzer, and I could see it in the
usbmon trace using Wireshark.  The Keysight IO libraries consist of a
proprietary user space and an open source device driver.  I had hoped
to see if the Keysight software initialized the 82357B the same as
the linux-gpib.  I was able to capture the initialization from the
USB bus using Wireshark.  I tried using the same initialization sequence
in the linux-gpib driver and still had the same EBUS failure.
I wrote a python script which allowed me to use the Agilent driver
bypassing the IO libraries.  This let me send 'W4' command to the HP3478A.
Subsequent commands worked.  I also found that the Keysight code always
used asynchronous take control command to assert ATN.

* Enlightenment *
I shared the captured traces with Dave, and he figured out that the
character being read determined if the take_control function was
being called with sync=0 or 1.  This is the result of another short cut in
the common gpib ibcac() function that will only use the sync option
if the board is a listener on the GPIB bus.  It does this by checking the
LACS status bit.  The agilent_82357a driver gets this status from
the TMS9914 ADSR register.  In the case of my Agilent 82357B this
register was corrupted with the last character of the last message
received.
The failure case with my HP 3478A,  issuing a 'W4' command results
in location 0x34 being read and single character 'F' was returned.
This character was also being returned for reads of the ADSR register.
This corruption of the ADSR register caused the common gpib ibcac()
to call the agilent_82357a_take_control() with sync=1.
This triggers a failure similar to the problem previously seen
with the Agilent 82350B.  The AUX_TCS is sent but the ATN signal
is never asserted.

* Can you reproduce the failure at home?  *
 From the start I was not sure if this was a broken Agilent 82357B
or a problem with the driver.  We still don't know.  We would like
other Agilent 82357 owners to test.  You don't need an HP3478A
to reproduce this problem.  Since the problem depends on the last
character received it would often be masked since a '\n' doesn't
cause the failure.  In normal use the receive buffer is large
enough to hold the whole response and the problem is avoided.
The problem is easy to reproduce if you read a character at a time.
The attached python script takes two arguments the device number
and a command.  Here is an example using my HP6642 power supply:

jhouston@linux-gpib:~$ python3 byte_read.py 4 volt?
b'0' 256
b'.' 256
b'0' 256
b'E' 256
InternalReceiveSetup: command failed
Traceback (most recent call last):
   File "byte_read.py", line 12, in <module>
     ch = inst.read(1)
   File "/usr/local/lib/python3.6/dist-packages/Gpib.py", line 59, in read
     self.res = gpib.read(self.id,len)
gpib.GpibError: read() failed: An attempt to write command bytes to the 
bus has timed out.
jhouston@linux-gpib:~$

In this case the 'E' as the last character read triggers the failure.

Dave Penkler ran the attached 'C' program on his Beiming 82357B clone
with an HP34401 dmm without errors.  But as it has its own firmware it
does not prove that there is not a common problem with the Agilents.
$ onebyte -m2 -d7 "MEAS:VOLT?"
- 0x0164
1 0x0164
. 0x0164
9 0x0164
4 0x0164
8 0x0164
E 0x0164
- 0x0164
0 0x0164
6 0x0164

  0x0164
$

* The Hardware *
The Agilent 82357B consists of
a Cypress EZ-USB chip, a GPIB controller chip, a Xilinx XC9536 CPLD
and bus transceivers.  On my 82357B the GPIB controller is labeled
Agilent 1822-0639.  Google searches find this in the BOM for several
HP/Agilent/Keysight products as a TMS9914.  It would be nice to know
the difference between the 82357A and 82357B.  From picture of the
boards posted to the internet they have the same major parts.  Mine
has the same revision label on the Xilinx XC9536 as a rev A board
picture posted on the Sigrok website here:
https://sigrok.org/wiki/Agilent_82357A

The firmware for the Cypress EZ-USB is downloaded using a udev script
which runs fxload.  The A/B revs of the 82357 have different firmware
files from Agilent.  Again we don't know why the firmware is different.
The agilent_82357a driver is used for both and treats them the same.

This might be a hardware problem unique to my 82357B.  In
addition to the failure with the ADSR register corruption, I have
seen other strange behavior.  I have seen bursts of traffic which
include very fast DMA transfers which do not match any request
from the driver.  There is always a chance that connecting a
logic analyzer may effect the circuit under test.  I'm using a
HP16702A and connecting to the TMS9914 using a PLCC clip and flying
leads.  The GPIB is probed using an HP10342.  I don't have
a good ground path for the flying lead probes, but the traces make
sense so I think that the DMA transfers are real.
It would be nice to instrument another 82357 but it is hard to
justify spending the $100 which it is likely to cost.

* The right fix *
I had hoped that this problem was the same as take_control problem
in the agilent_82350b driver.  I tried changing take_control to
only use AUX_TCA.  This fix worked and I was able to read the
calibration ram of the HP3478A.

Once we found that the problem was the result of reads of the
TMS9914 ADSR register being corrupted, we tested various commands
and found that only the AUX_TCS and AUX_TCA would restore the
ADSR register to correct operation.  We tried using the clear
LON (Listener only) command.  This does not restore the correct
function of the ADSR register but seems to make AUX_TCS work
if it is used by the take_control function to assert ATN.

Calling take_control from the agilent_82357a_read function
would restore the correct ADSR register behavior.  I have
tested this and it solves my problems.  Dave Penkler is
concerned that users of the low-level interface
may be surprised that the ATN signal has been asserted.
Having the ATN asserted works well for most high level operations
because the next operation is likely to have to send commands
to setup listen and talk addresses.

The other alternative is to understand when we can trust the
ADSR register and to provide a resonable guess of the GPIB
state when we know that the ADSR is corrupt.

Jim Houston

Re: [Linux-gpib-general] Agilent 82357B repeatable hard failure

Linux GPIB Driver package (source)

Re: [Linux-gpib-general] Agilent 82357B repeatable hard failure