|
From: Gary W. S. <ga...@pr...> - 2009-03-12 21:32:20
|
Jesse, Looks better. transfering 50GB to/from the server and I'm not getting the errors in the log now. Very large pings (ping vcsoaknas01 -t -l 30000 -w 7000) are occasionally timing out BUT I haven't lost connectivity to the SSH session as of yet and the file transfer is still going. dstat is also running consistantly (no random TX hangs like before). dstat: 0 3 92 0 1 4|4224k 13M|7338k 4642k| 0 0 | 10k 12k 0 1 98 0 0 1| 936k 3872k|1951k 1040k| 0 0 |3649 3403 0 1 96 0 0 2|1496k 8879k|4638k 1700k| 0 0 |7378 8853 0 4 91 0 1 4| 13M 3678k|2382k 14M| 0 0 |9188 7267 0 3 93 0 1 4|4352k 15M|7864k 4877k| 0 0 | 11k 13k 0 2 95 0 1 3| 384k 14M|7389k 516k| 0 0 |9990 12k ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 0 2 95 0 0 2|2816k 8104k|4327k 3098k| 0 0 |7075 7510 0 2 93 0 0 4|5696k 9120k|4918k 6176k| 0 0 |8478 8300 0 2 95 0 0 3|3968k 6720k|3610k 4306k| 0 0 |6425 6107 0 2 95 0 0 3|4736k 7616k|4081k 5135k| 0 0 |7242 6974 0 2 95 0 1 3|4224k 6816k|3687k 4582k| 0 0 |6589 6344 0 2 95 0 0 3|4096k 7016k|3748k 4445k| 0 0 |6546 6311 0 1 96 0 0 2|3136k 5288k|2852k 3402k| 0 0 |5251 4936 We have 50GB on an iscsi share (or 500GB) that we are copying to/from over the wire for this test. During the writing of this email we have already copied about 1.3gb without any problem as of yet. So my next question is regarding the 4GB patch. Does this have any negative impact that I need to be aware of? Gary ________________________________ From: Brandeburg, Jesse [mailto:jes...@in...] Sent: Thu 3/12/2009 1:59 PM To: Gary W. Smith Cc: e10...@li... Subject: RE: [E1000-devel] Detected Tx Unit Hang re-added the list for tracking... I think I see the issue, you have more than 4GB ram, and it appears that your system doesn't handle dual address cycles correctly, or our adapter doesn't work quite right for some reason. Force the OS to never allow addresses > 4GB to our hardware using this patch: https://sourceforge.net/tracker2/download.php?group_id=42302&atid=447449&file_id=283326&aid=2007017 its the e1000_disable_dac.patch file. ________________________________ From: Gary W. Smith [mailto:ga...@pr...] Sent: Thursday, March 12, 2009 12:55 PM To: Brandeburg, Jesse Subject: RE: [E1000-devel] Detected Tx Unit Hang Jesse, Included is the messages log with the debug patch. It only took a couple seconds to get it to trigger the problem even with the modprobe.conf changes. options e1000 TxDescriptorStep=4,4 alias eth0 e1000 alias eth1 e1000 Anyway, I did update the BIOS about a month back to try to see if that would resolve the problem but it did not. It does have the latest. We say a similar problem under Windows 2003 with SP1+ but ruled it as being part of the TCP offload /DOS patch bug they had and I didn't think much of it (as it affected several other servers). The problem under Windows existed whether or not we used the onboard nic. In fact, we used a seperate BroadComm 1GB adapter (thinking it was the TCP offload) and it didn't resolve it either. I'm really hopping that this isn't a hardware issue (as it's not a warranteed box) but if it is then we will just have to deal with that seperately. Thanks for alll of the help, Gary ________________________________ From: Brandeburg, Jesse [mailto:jes...@in...] Sent: Thu 3/12/2009 9:33 AM To: Gary W. Smith Cc: e10...@li... Subject: RE: [E1000-devel] Detected Tx Unit Hang sorry, go to the home page http://sourceforge.net/projects/e1000 click Tracker click patches click tx hang debug code (all releases) - 1460945 download the e1000_806_dump.patch, it should apply with fuzz to your e1000 driver directory with the command download file.patch... patch -d e1000-8.0.* -p1 < file.patch here is the download link https://sourceforge.net/tracker2/download.php?group_id=42302&atid=447451&file_id=298629&aid=1460945 ________________________________ From: Gary W. Smith [mailto:ga...@pr...] Sent: Thursday, March 12, 2009 9:16 AM To: Brandeburg, Jesse Cc: e10...@li... Subject: RE: [E1000-devel] Detected Tx Unit Hang Excuse my ignorance, but which patches? ;). There's a lot of stuff on the download page. I assume you are talking about the I/OAT driver & kernel patch but I want to make sure before doing it. > > Mar 11 18:50:01 vcsoaknas01 kernel: e1000: eth0: e1000_clean_tx_irq: > Detected Tx Unit Hang > Mar 11 18:50:01 vcsoaknas01 kernel: Tx Queue <0> > Mar 11 18:50:01 vcsoaknas01 kernel: TDH <f7> > Mar 11 18:50:01 vcsoaknas01 kernel: TDT <f7> > Mar 11 18:50:01 vcsoaknas01 kernel: next_to_use <f7> > Mar 11 18:50:01 vcsoaknas01 kernel: next_to_clean <24> > Mar 11 18:50:01 vcsoaknas01 kernel: buffer_info[next_to_clean] > Mar 11 18:50:01 vcsoaknas01 kernel: time_stamp <1004de0b1> > Mar 11 18:50:01 vcsoaknas01 kernel: next_to_watch <24> > Mar 11 18:50:01 vcsoaknas01 kernel: jiffies <1004dec18> > Mar 11 18:50:01 vcsoaknas01 kernel: next_to_watch.status <0> this really indicates that the adapter is finishing all the work but that the descriptor is not making it back to main memory indicating the work was completed. We have seen this a lot with AMD systems, in particular ones with VIA chipsets. There is a bad bug in those machines when an IO device and the processor both write to the same cache line. also, if the above workaround doesn't help we'll want you to install the dump patch from the patches section of e1000.sourceforge.net and send us the output when you get a tx hang. hope this helps, Jesse |