Menu

#2 atm fails to transmit packets then hangs

open
nobody
None
5
2001-08-16
2001-08-16
Gary Gaydos
No

Breif description: ATM adapter fails to transmit
packets, and eventually hard hangs the system.

Software environment:
RedHat Linux 7.1 + current errata
Kernel version 2.4.3-12smp (non-enterprise version)
ATM code version 0.78

Hardware Environment:
IBM Netfinity 5500 M10
two 500 Mhz Pentium III cpu's
899616 KB ram
one Forerunner HE 155 atm adapter
one AMD pcnet32 100Mb ethernet
one IBM 16/4 Token Ring II (unused)
one IBM serveraid 4H raid adapter
we've verified that we're using the latest firmware
for the system and adapters

Network Environment:
155 Mbit atm
Classic IP
atm mtu set to 4096

100Mbit ethernet used to contact the system
when the atm fails

Failure scenario:
This systems primary function is as an ftp server.
Transmit load on the atm appears to be a necessary but
not sufficient condition for the onset of failure mode.
By observation, the atm does not fail when the transmit
load is below 15Mbit. We have seen the atm transmit at
over 80Mbit for a few minutes without failure. The
failures seem to occur when the transmit load is
sustained above 15Mbit for two to three hours.
In our environment that works out to one to two
failures per day.

First, syslog reports
ftp3-atm kernel: he0: bad isw = 0x8?

Second, syslog reports
ftp3-atm kernel: clip_start_xmit: XOFF->XOFF transition

Third, after a few XOFF->XOFF messages the atm adapter
will stop transmitting packets
Fourth, if we do not reboot the system within ten to
fifteen minutes of atm transmit loss, the system will
hard hang, requiring a power cycle to recover.

Here's an excerpt from /var/log/messages showing a
situation where we forced a reboot prior to a hard hang:

Aug 15 11:12:29 ftp3-atm mrx: Monitoring [atm0]
Aug 15 11:12:29 ftp3-atm mrx: atm0 state is NORMAL
Aug 15 11:28:32 ftp3-atm kernel: he0: bad isw = 0x28?
Aug 15 11:29:32 ftp3-atm last message repeated 2 times
Aug 15 11:29:33 ftp3-atm kernel: he0: bad isw = 0x8?
Aug 15 11:30:11 ftp3-atm kernel: he0: bad isw = 0x8?
Aug 15 12:00:31 ftp3-atm mrx: atm0 state is NORMAL
Aug 15 12:11:34 ftp3-atm kernel: clip_start_xmit:
XOFF->XOFF transition
Aug 15 12:14:03 ftp3-atm last message repeated 68 times
Aug 15 12:15:34 ftp3-atm last message repeated 4 times
Aug 15 12:16:37 ftp3-atm last message repeated 30 times
Aug 15 12:18:19 ftp3-atm last message repeated 19 times
Aug 15 12:19:55 ftp3-atm last message repeated 18 times
Aug 15 12:20:59 ftp3-atm last message repeated 9 times
Aug 15 12:21:21 ftp3-atm last message repeated 3 times
Aug 15 12:25:05 ftp3-atm rc: Stopping keytable: succeeded
Aug 15 12:25:06 ftp3-atm rc: Stopping atsar: succeeded
Aug 15 12:25:06 ftp3-atm proftpd: proftpd shutdown
succeeded
Aug 15 12:25:06 ftp3-atm sshd: sshd -TERM succeeded
notes:
- the statements from "mrx:" are from a watchdog script
that verifies packet transmission by monitoring
/proc/net/dev and by pinging the default router

- the atm was transmitting between 15 and 20Mbit/s
when we received the first "bad isw"

- at 12:25:05 there was at least three minutes of no
atm packet tranmission and we rebooted the system prior
to it hard hanging.

Steps we've taken to narrow the problem:
1) RH 2.4.3-12smp kernel with atm (he.o) module
2) RH 2.4.3-12smp kernel with atm in a monolithic kernel
3) RH 2.4.3-12uni kernel with modular atm (he.o)
4) moving the atm adapter from its own pci bus
to the one the raid adapter is on.
No joy in any of the combinations.

I will split this report into two cases is you
feel that "bad isw" and "XOFF->XOFF" are two
different problems since they come from different
c modules. But the failure mode always follows
the same pattern of bad isw, followed by XOFF, followed
by tranmist loss, followed by hard hang.
Any help or advice would be welcomed.

Regards, Gary

Discussion


Log in to post a comment.