Breif description: ATM adapter fails to transmit
packets, and eventually hard hangs the system.
RedHat Linux 7.1 + current errata
Kernel version 2.4.3-12smp (non-enterprise version)
ATM code version 0.78
IBM Netfinity 5500 M10
two 500 Mhz Pentium III cpu's
899616 KB ram
one Forerunner HE 155 atm adapter
one AMD pcnet32 100Mb ethernet
one IBM 16/4 Token Ring II (unused)
one IBM serveraid 4H raid adapter
we've verified that we're using the latest firmware
for the system and adapters
155 Mbit atm
atm mtu set to 4096
100Mbit ethernet used to contact the system
when the atm fails
This systems primary function is as an ftp server.
Transmit load on the atm appears to be a necessary but
not sufficient condition for the onset of failure mode.
By observation, the atm does not fail when the transmit
load is below 15Mbit. We have seen the atm transmit at
over 80Mbit for a few minutes without failure. The
failures seem to occur when the transmit load is
sustained above 15Mbit for two to three hours.
In our environment that works out to one to two
failures per day.
First, syslog reports
ftp3-atm kernel: he0: bad isw = 0x8?
Second, syslog reports
ftp3-atm kernel: clip_start_xmit: XOFF->XOFF transition
Third, after a few XOFF->XOFF messages the atm adapter
will stop transmitting packets
Fourth, if we do not reboot the system within ten to
fifteen minutes of atm transmit loss, the system will
hard hang, requiring a power cycle to recover.
Here's an excerpt from /var/log/messages showing a
situation where we forced a reboot prior to a hard hang:
Aug 15 11:12:29 ftp3-atm mrx: Monitoring [atm0]
Aug 15 11:12:29 ftp3-atm mrx: atm0 state is NORMAL
Aug 15 11:28:32 ftp3-atm kernel: he0: bad isw = 0x28?
Aug 15 11:29:32 ftp3-atm last message repeated 2 times
Aug 15 11:29:33 ftp3-atm kernel: he0: bad isw = 0x8?
Aug 15 11:30:11 ftp3-atm kernel: he0: bad isw = 0x8?
Aug 15 12:00:31 ftp3-atm mrx: atm0 state is NORMAL
Aug 15 12:11:34 ftp3-atm kernel: clip_start_xmit:
Aug 15 12:14:03 ftp3-atm last message repeated 68 times
Aug 15 12:15:34 ftp3-atm last message repeated 4 times
Aug 15 12:16:37 ftp3-atm last message repeated 30 times
Aug 15 12:18:19 ftp3-atm last message repeated 19 times
Aug 15 12:19:55 ftp3-atm last message repeated 18 times
Aug 15 12:20:59 ftp3-atm last message repeated 9 times
Aug 15 12:21:21 ftp3-atm last message repeated 3 times
Aug 15 12:25:05 ftp3-atm rc: Stopping keytable: succeeded
Aug 15 12:25:06 ftp3-atm rc: Stopping atsar: succeeded
Aug 15 12:25:06 ftp3-atm proftpd: proftpd shutdown
Aug 15 12:25:06 ftp3-atm sshd: sshd -TERM succeeded
- the statements from "mrx:" are from a watchdog script
that verifies packet transmission by monitoring
/proc/net/dev and by pinging the default router
- the atm was transmitting between 15 and 20Mbit/s
when we received the first "bad isw"
- at 12:25:05 there was at least three minutes of no
atm packet tranmission and we rebooted the system prior
to it hard hanging.
Steps we've taken to narrow the problem:
1) RH 2.4.3-12smp kernel with atm (he.o) module
2) RH 2.4.3-12smp kernel with atm in a monolithic kernel
3) RH 2.4.3-12uni kernel with modular atm (he.o)
4) moving the atm adapter from its own pci bus
to the one the raid adapter is on.
No joy in any of the combinations.
I will split this report into two cases is you
feel that "bad isw" and "XOFF->XOFF" are two
different problems since they come from different
c modules. But the failure mode always follows
the same pattern of bad isw, followed by XOFF, followed
by tranmist loss, followed by hard hang.
Any help or advice would be welcomed.
Log in to post a comment.