From: Steve W. <sw...@ar...> - 2003-03-29 21:32:10
|
I'm using smartmontools-5.1-9 on PowerPC running Linux 2.1.24 (smartmontools required a few changes to deal with endianness issues, which I'll gladly post the patches after getting this issue fixed). Everything is working fine up until it tries to read the Error&Self-Test logs, which causes an I/O error. In /var/log/kernel, these two lines appear: Mar 29 21:14:46 (none) kernel: hdb: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Mar 29 21:14:46 (none) kernel: hdb: drive_cmd: error=0x10 { SectorIdNotFound }, secCnt=6, LBAsect=12734249 On Linux/x86 this works without issue on this exact drive, so I know the drive is not the culprit. I don't see how the changes I made to accommodate big endian would be the cause, as they just byte-swap the data before sending it back, and this error is happening at the ioctl in ataReadErrorLog() and ataReadSelfTestLog() prior to where I would byte-swap the data. i.e. in atacmds.c, my modifications focus mainly on these lines: memcpy(data,buf+HDIO_DRIVE_CMD_HDR_SIZE,ATA_SMART_SEC_SIZE); I've checked the homepage and scanned through the support archives, and have not seen this same issue. Admittedly I didn't read every message in the archive, so I certainly apologize if I overlooked where this had been covered before. And for what it's worth, here is the output from smartctl: $ /mnt/smartctl -a /dev/hdb smartctl version 5.1-9 Copyright (C) 2002-3 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: QUANTUM FIREBALLlct15 22 Serial Number: 313019116552 Firmware Version: A01.0F00 ATA Version is: 5 ATA Standard is: ATA/ATAPI-5 T13 1321D revision 1 Local Time is: Sat Mar 29 21:14:45 2003 localtime SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Off-line data collection status: (0x00) Offline data collection activity was never started. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete off-line data collection: ( 2) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Automatic timer ON/OFF support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 23) minutes. SMART Attributes Data Structure revision number: 11 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x0029 100 253 020 Old_age - 0 3 Spin_Up_Time 0x0027 073 070 020 Old_age - 3476 4 Start_Stop_Count 0x0032 100 100 008 Old_age - 444 5 Reallocated_Sector_Ct 0x0033 100 100 020 Old_age - 0 7 Seek_Error_Rate 0x000b 100 100 023 Old_age - 0 9 Power_On_Hours 0x0012 100 100 001 Old_age - 210 10 Spin_Retry_Count 0x0026 100 100 000 Old_age - 0 11 Calibration_Retry_Count 0x0013 100 100 020 Old_age - 0 12 Power_Cycle_Count 0x0032 100 100 008 Old_age - 417 13 Read_Soft_Error_Rate 0x000b 100 100 023 Old_age - 0 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age - 2061 196 Reallocated_Event_Count 0x0010 100 253 020 Old_age - 0 197 Current_Pending_Sector 0x0032 100 100 020 Old_age - 0 198 Offline_Uncorrectable 0x0010 100 253 000 Old_age - 0 199 UDMA_CRC_Error_Count 0x001a 196 196 000 Old_age - 10 Error SMART Error Log Read failed: Input/output error Smartctl: SMART Errorlog Read Failed Error SMART Error Self-Test Log Read failed: Input/output error Smartctl: SMART Self Test Log Read Failed $ Cheers, Steve |
From: Bruce A. <ba...@gr...> - 2003-03-30 03:38:29
|
Hi Steve, > I'm using smartmontools-5.1-9 on PowerPC running Linux 2.1.24 > (smartmontools required a few changes to deal with endianness issues, > which I'll gladly post the patches after getting this issue fixed). I'm VERY excited to hear this. I have been wondering if someone would try to get smartmontools working on big-endian (I only have x86 little endian boxes). It would be wonderful to get the code to work correctly on both, and also eventually to make sure it's 64-bit clean. It sounds as if you are already halfway there. > Everything is working fine up until it tries to read the > Error&Self-Test logs, which causes an I/O error. In /var/log/kernel, > these two lines appear: > > Mar 29 21:14:46 (none) kernel: hdb: drive_cmd: status=0x51 { DriveReady > SeekComplete Error } > Mar 29 21:14:46 (none) kernel: hdb: drive_cmd: error=0x10 { > SectorIdNotFound }, secCnt=6, LBAsect=12734249 > > On Linux/x86 this works without issue on this exact drive, so I know > the drive is not the culprit. Is the kernel version number you gave above accurate (2.1.24?). Are you using similar/identical kernel versions in your comparison testing? > I don't see how the changes I made to accommodate big endian would be > the cause, as they just byte-swap the data before sending it back, and > this error is happening at the ioctl in ataReadErrorLog() and > ataReadSelfTestLog() prior to where I would byte-swap the data. i.e. > in atacmds.c, my modifications focus mainly on these lines: > > memcpy(data,buf+HDIO_DRIVE_CMD_HDR_SIZE,ATA_SMART_SEC_SIZE); Let's see... I think what you are doing makes sense. After all before doing the ioctl you have for example: unsigned char buf[HDIO_DRIVE_CMD_HDR_SIZE+ATA_SMART_SEC_SIZE] = {WIN_SMART, 0x06, SMART_READ_LOG_SECTOR, 1,}; and since these first four quantities are ALL single bytes, no byte-swapping is needed. Then, as you say, just byte-swap the return structure. So it sounds to me like you are doing "the right thing". > I've checked the homepage and scanned through the support archives, and > have not seen this same issue. Admittedly I didn't read every message > in the archive, so I certainly apologize if I overlooked where this had > been covered before. No apology necessary. It's not been covered before. The only person I know who has worked on a big-endian architecture is Peter Cassidy who is doing a Darwin port. But he's using the Darwin native SMART commands, not a straight linux kernel. So I think the byte swapping is done for him already. But let's see if Peter has anything to add. I am copying the developers list (see mail header) so he should get this too. > > And for what it's worth, here is the output from smartctl: And for what it's worth, here are my comments (:-;) > ATA Version is: 5 > ATA Standard is: ATA/ATAPI-5 T13 1321D revision 1 > Local Time is: Sat Mar 29 21:14:45 2003 localtime Interesting timezone. Is this right? See utility.c for the relevant bits of code referencing tzname[]. Also man tzset. > Total time to complete off-line > data collection: ( 2) seconds. This looks too short (but might be right, I suppose, if you printed this output just as some data collection was finishing??) > 4 Start_Stop_Count 0x0032 100 100 008 Old_age - > 444 > 9 Power_On_Hours 0x0012 100 100 001 Old_age - > 210 > 12 Power_Cycle_Count 0x0032 100 100 008 Old_age - > 417 Do these numbers look reasonable? You should be able to use hdparm -y to spin down and spin up the disk while the system is running and see the start stop count increment while the power cycle count stays fixed. [Is the large number of power cycles in 210 hours right??] > 199 UDMA_CRC_Error_Count 0x001a 196 196 000 Old_age - > 10 Does this count increment each time you get an ide error like the one above? > Error SMART Error Log Read failed: Input/output error > Smartctl: SMART Errorlog Read Failed > Error SMART Error Self-Test Log Read failed: Input/output error > Smartctl: SMART Self Test Log Read Failed OK -- just as you described. At this point, my only good answer is "kernel developers mailing list". The point being that obviously the other SMART calls that return 512 byte structures succeeded. The fact that this one failed (and especially if the UDMA error count is correlated) might be a sign of a kernel driver bug. I wouldn't be surprised since I don't think that there is any other standard linux code that uses this ioctl(). Steve, would you like to join the group of smartmontools developers so that you can integrate your changes into the body of the code? If so, let me know if you have a sourceforge user name, and if you are familar with CVS. If not, I'll help you get started. Cheers, Bruce [PS: I remember the first time I saw a lisa. I was a postdoc, visiting a friend in Austin TX around 1984-5. He very proudly showed me his lisa, one of the first ones out, on which he had blown at least 5 grand.] |
From: Steve W. <sw...@ar...> - 2003-03-31 21:40:39
|
Hi Bruce, > I'm VERY excited to hear this. I have been wondering if someone would > try > to get smartmontools working on big-endian (I only have x86 little > endian > boxes). Glad to hear this work will (hopefully) be beneficial. > Is the kernel version number you gave above accurate (2.1.24?). Are > you > using similar/identical kernel versions in your comparison testing? Yes, it is 2.1.24. On the x86 side, I've used a very similar kernel (from the 2.2 series--don't recall the exact version as I have a few 2.2 kernels unpacked/built at the moment). The HDIO_DRIVE_CMD ioctl is identical between the 2.2.x and 2.1.24 kernels I've used, and the ide_do_drive_cmd() function looks similar--but I haven't done an in-depth comparison to note of any specific differences. I am working on building a 2.1.24 kernel on x86 right now to test with, and will report back once that is complete. >> Local Time is: Sat Mar 29 21:14:45 2003 localtime > Interesting timezone. Is this right? See utility.c for the relevant > bits > of code referencing tzname[]. Also man tzset. On this particular system, it is correct. Presently I'm running this on a very scaled down Linux system (entire filesystem is only a few megs). I would assume since the system lacks zoneinfo, libc is defaulting to the string 'localtime' for the timezone (as evident by GNU C Library v1.96, file time/tzfile.h). If I set the TZ environment variable, it does properly affect the output: $ TZ="MST" /mnt/smartctl -a /dev/hdb ..snip.. Local Time is: Mon Mar 31 20:05:38 2003 MST ..snip.. So it's certainly no error of smartmontools. Just another wonderful artifact of this slightly demented box. >> Total time to complete off-line >> data collection: ( 2) seconds. > This looks too short (but might be right, I suppose, if you printed > this > output just as some data collection was finishing??) I've ran smartctl (what seems like) a hundred times on this drive, and it's always that value. Also when I ran it on x86, it was the same value as well. >> 4 Start_Stop_Count 0x0032 100 100 008 Old_age >> - >> 444 >> 9 Power_On_Hours 0x0012 100 100 001 Old_age >> - >> 210 >> 12 Power_Cycle_Count 0x0032 100 100 008 Old_age >> - >> 417 > Do these numbers look reasonable? You should be able to use hdparm -y > to > spin down and spin up the disk while the system is running and see the > start stop count increment while the power cycle count stays fixed. > [Is the large number of power cycles in 210 hours right??] This drive was in use 24/7 for 2 years, if that helps shed some light on the numbers. The Power_Cycle_Count seems high to me, but it's hard to say. Using 'hdparm -y', the Start_Stop_Count value increments while the other two values stay static, as you had indicated. >> 199 UDMA_CRC_Error_Count 0x001a 196 196 000 Old_age - >> 10 > Does this count increment each time you get an ide error like the one > above? No, it does not increment. > At this point, my only good answer is "kernel developers mailing list". > The point being that obviously the other SMART calls that return 512 > byte > structures succeeded. The fact that this one failed (and especially if > the UDMA error count is correlated) might be a sign of a kernel driver > bug. I wouldn't be surprised since I don't think that there is any > other > standard linux code that uses this ioctl(). That's what I feared (kernel issue). As a sanity test, I'll run smartmontools on x86 using the same kernel (2.1.24), and if time permits will also test on a PowerMac running PPC/Linux with 2.1.24 (this will require me finding a spare drive to install on, but I need to do some benchmarking with PPC/Linux for another app so it's not a huge inconvenience). I'll report the status back here, and if the error occurs across all three, I'll take this issue up on a kernel dev list. This will hopefully isolate any errors specific to the hardware I'm currently using. If it does end up being something specific to the hardware, I will try to take it up with the manufacturer (who most likely will not be responsive to this issue). > Steve, would you like to join the group of smartmontools developers so > that you can integrate your changes into the body of the code? If so, > let > me know if you have a sourceforge user name, and if you are familar > with > CVS. If not, I'll help you get started. I'll need to create an account. I'll do so here shortly, and send you the details in a separate message. I am familiar with CVS, and while not my favorite, I should be able to handle it. I do appreciate the offer for help, and of course may have to take you up on it. > [PS: I remember the first time I saw a lisa. I was a postdoc, > visiting a > friend in Austin TX around 1984-5. He very proudly showed me his lisa, > one > of the first ones out, on which he had blown at least 5 grand.] That's very neat. Not to drift terribly off topic, but do you mind if I ask what he used it for (provided you still remember)? It's so very rare to encounter somebody that either owned one or had a friend that owned one, I'm always interested in hearing it's original intended use. And FWIW, I was probably equally as proud as your friend when I finally got a Lisa in 2000. But that's just due to the Apple-geek blood that pumps through my veins. Cheers, Steve |
From: Bruce A. <ba...@gr...> - 2003-04-01 04:23:42
|
Hi Steve, > >> Total time to complete off-line > >> data collection: ( 2) seconds. > > This looks too short (but might be right, I suppose, if you printed > > this > > output just as some data collection was finishing??) > > I've ran smartctl (what seems like) a hundred times on this drive, and > it's always that value. Also when I ran it on x86, it was the same > value as well. OK. I'm not sure, but I have a memory that on some disks that I've seen (IBM) this time is the time to check a single cylinder. Anyway if it's the same on x86 it's probably right. > >> 4 Start_Stop_Count 0x0032 100 100 008 Old_age > >> - > >> 444 > >> 9 Power_On_Hours 0x0012 100 100 001 Old_age > >> - > >> 210 > >> 12 Power_Cycle_Count 0x0032 100 100 008 Old_age > >> - > >> 417 > > Do these numbers look reasonable? You should be able to use hdparm -y > > to > > spin down and spin up the disk while the system is running and see the > > start stop count increment while the power cycle count stays fixed. > > [Is the large number of power cycles in 210 hours right??] > > This drive was in use 24/7 for 2 years, if that helps shed some light > on the numbers. It does. It looks as if something may be wrong with the raw value of Attribute 9. This may be due to the endian change. For fun, try using the -v N,raw8 option to smartctl, to see the individual byte. -v N,raw16 may also be useful. > to say. Using 'hdparm -y', the Start_Stop_Count value increments while Good! That sounds right. > > At this point, my only good answer is "kernel developers mailing list". > > The point being that obviously the other SMART calls that return 512 > > byte > > structures succeeded. The fact that this one failed (and especially if > > the UDMA error count is correlated) might be a sign of a kernel driver > > bug. I wouldn't be surprised since I don't think that there is any > > other > > standard linux code that uses this ioctl(). > > That's what I feared (kernel issue). As a sanity test, I'll run > smartmontools on x86 using the same kernel (2.1.24), and if time > permits will also test on a PowerMac running PPC/Linux with 2.1.24 > (this will require me finding a spare drive to install on, but I need > to do some benchmarking with PPC/Linux for another app so it's not a > huge inconvenience). I'll report the status back here, and if the > error occurs across all three, I'll take this issue up on a kernel dev > list. This sounds like a good plan. > This will hopefully isolate any errors specific to the hardware > I'm currently using. If it does end up being something specific to the > hardware, I will try to take it up with the manufacturer (who most > likely will not be responsive to this issue). I doubt it's specific to the hardware - Quantum has been pretty involved in SMART for quite some time, and I suspect their implementation in firmware is pretty solid. > > [PS: I remember the first time I saw a lisa. I was a postdoc, > > visiting a > > friend in Austin TX around 1984-5. He very proudly showed me his lisa, > > one > > of the first ones out, on which he had blown at least 5 grand.] > > That's very neat. Not to drift terribly off topic, but do you mind if > I ask what he used it for (provided you still remember)? Code development. He was a graduate student working on gravitational physics. > It's so very rare to encounter somebody that either owned one or had a > friend that owned one, I'm always interested in hearing it's original > intended use. And FWIW, I was probably equally as proud as your friend > when I finally got a Lisa in 2000. But that's just due to the > Apple-geek blood that pumps through my veins. One of my colleauges gets his jollies by running dusty versions of BSD on his PDP-11. No kidding. Bruce |
From: Steve W. <sw...@ar...> - 2003-04-01 21:17:46
|
Hi Bruce, >>> [Is the large number of power cycles in 210 hours right??] >> This drive was in use 24/7 for 2 years, if that helps shed some light >> on the numbers. > It does. It looks as if something may be wrong with the raw value of > Attribute 9. This may be due to the endian change. For fun, try using > the > -v N,raw8 > option to smartctl, to see the individual byte. > -v N,raw16 > may also be useful. Using smartctl on PowerPC, the following values are: Normal output: 9 Power_On_Hours 0x0012 100 100 001 Old_age - 248 With raw16: 9 Power_On_Hours 0x0012 100 100 001 Old_age - 0 0 248 With raw8: 9 Power_On_Hours 0x0012 100 100 001 Old_age - 0 0 0 0 0 248 Using smartctl on x86, the following values are: Normal output: 9 Power_On_Hours 0x0012 100 100 001 Old_age - 248 With raw16: 9 Power_On_Hours 0x0012 100 100 001 Old_age - 0 0 248 With raw8: 9 Power_On_Hours 0x0012 100 100 001 Old_age - 0 0 0 0 0 248 Everything looks identical there. Cheers, Steve |
From: Bruce A. <ba...@gr...> - 2003-04-01 21:23:42
|
> >>> [Is the large number of power cycles in 210 hours right??] > >> This drive was in use 24/7 for 2 years, if that helps shed some light > >> on the numbers. > > It does. It looks as if something may be wrong with the raw value of > > Attribute 9. This may be due to the endian change. For fun, try using > > the > > -v N,raw8 > > option to smartctl, to see the individual byte. > > -v N,raw16 > > may also be useful. > > Using smartctl on PowerPC, the following values are: > Normal output: > 9 Power_On_Hours 0x0012 100 100 001 Old_age - > 248 > With raw16: > 9 Power_On_Hours 0x0012 100 100 001 Old_age - > 0 0 248 > With raw8: > 9 Power_On_Hours 0x0012 100 100 001 Old_age - > 0 0 0 0 0 248 > > Using smartctl on x86, the following values are: > Normal output: > 9 Power_On_Hours 0x0012 100 100 001 Old_age - > 248 > With raw16: > 9 Power_On_Hours 0x0012 100 100 001 Old_age - > 0 0 248 > With raw8: > 9 Power_On_Hours 0x0012 100 100 001 Old_age - > 0 0 0 0 0 248 > > Everything looks identical there. OK, thanks for trying. It doesn't look like your code is doing anything "wrong". I don't understand what Attribute 9 is for this disk. It's made by Quantum, right? Does the raw value of Attribute 9 change with time? [You can monitor it with the -R 9 Directive in /etc/smartd.conf.] Cheers, Bruce |
From: Steve W. <sw...@ar...> - 2003-04-01 21:41:19
|
Hi Bruce, > OK, thanks for trying. It doesn't look like your code is doing > anything > "wrong". heh. So rarely that I hear that, it's quite refreshing. :) > I don't understand what Attribute 9 is for this disk. It's made by > Quantum, right? Does the raw value of Attribute 9 change with time? > [You > can monitor it with the -R 9 Directive in /etc/smartd.conf.] Yes it is made by Quantum. It is Quantum Part Number QML20000LC-A. Also lists 'LC22AT LC22A3M1 REV01-A A010F' on the drive label. The value is changing over time- When I first sent the message it was at 210, and presently it is at 248. I don't think that value is reflected in hours, as I don't think I've had the drive powered up for 38 hours between them. I will try out smartd after futzing with 2.1.24 on x86 today for the Error&Self Test log purposes (that is my goal for the afternoon). Cheers, Steve |
From: Bruce A. <ba...@gr...> - 2003-04-01 21:34:51
|
Hi Steve, > > I don't understand what Attribute 9 is for this disk. It's made by > > Quantum, right? Does the raw value of Attribute 9 change with time? > > [You > > can monitor it with the -R 9 Directive in /etc/smartd.conf.] > > Yes it is made by Quantum. It is Quantum Part Number QML20000LC-A. > Also lists 'LC22AT LC22A3M1 REV01-A A010F' on the drive label. The > value is changing over time- When I first sent the message it was at > 210, and presently it is at 248. I don't think that value is reflected > in hours, as I don't think I've had the drive powered up for 38 hours > between them. I will try out smartd after futzing with 2.1.24 on x86 > today for the Error&Self Test log purposes (that is my goal for the > afternoon). Hmmm. This might be one of those drives that only starts recording power on hours after SMART is enabled. Is it possible that SMART was only enabled 248 hours ago? Anyway, please see if you can figure out if this attribute raw value is counting time in hours, or something else... Cheers, Bruce |
From: Steve W. <sw...@ar...> - 2003-04-01 21:48:25
|
> Hmmm. This might be one of those drives that only starts recording > power > on hours after SMART is enabled. Is it possible that SMART was only > enabled 248 hours ago? I would say the possibility is highly unlikely. When I initially started porting smartmontools over, SMART was already enabled on this drive. Prior to this, I can't think of any incident where I would've enabled it myself, so I would have to assume it came this way from the "factory". The only possibility I can think of where I would've enabled SMART (provided it came disabled from the factory) would have been running manufacturer diagnostic software on it, but I'm very certain that I have not done that to this drive as it's never given me any reason to. > Anyway, please see if you can figure out if this attribute raw value is > counting time in hours, or something else... Will do. Cheers, Steve |