#41 READ DMA errors and frozen hard disk

closed-out-of-date
nobody
None
5
2014-12-21
2010-10-08
Anonymous
No

I'm trying to find the cause of a 30 sec freeze during the boot that involves the hdparm's apm option.
It happens with:
- kernel 2.6.32 but not 2.6.35-rc6 and above (versions between 32 and .35-rc6 not tested).
- 9.32 but not 9.27 (versions in the middle not tested).
- official Debian packages.
You can find more detail here:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=598862

Could you give me some advice to track down this problem?
Since the next approaching Debian stable version will be released with 2.6.32, it would be very helpful.

Thanks.

Cesare.

Discussion

  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2010-10-23

    I'm the bug submitter.
    For the first time, today the problem happened with a kernel different than 2.6.32: it was 2.6.36-rc6 (official kernel from Debian).

    I attach the output from "smartctl -a /dev/sda", where you can see that all the errors happened during the APM settings change. I also attach the relevant part of dmesg (plus some context) that shows what appened, 30 sec freeze too.
    Totally i've seen these freeze 5 times, reported as UDMA_CRC_Error_Count under smartctl.

    Cesare.

     
    Last edit: Anonymous 2014-03-20
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2010-10-23

    Since i'm not able to attach file, i post the syslog error inline:

    Oct 22 20:54:42 tommaso kernel: [ 8.512023] intel8x0_measure_ac97_clock: measured 55425 usecs (2670 samples)
    Oct 22 20:54:42 tommaso kernel: [ 8.512027] intel8x0: clocking to 48000
    Oct 22 20:54:42 tommaso kernel: [ 8.512777] Intel ICH Modem 0000:00:1f.6: PCI INT B -> Link[LNKB] -> GSI 9 (level, low) -> IRQ 9
    Oct 22 20:54:42 tommaso kernel: [ 8.512802] Intel ICH Modem 0000:00:1f.6: setting latency timer to 64
    Oct 22 20:54:42 tommaso kernel: [ 8.616119] MC'97 1 converters and GPIO not ready (0xff00)
    Oct 22 20:54:42 tommaso kernel: [ 39.904064] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
    Oct 22 20:54:42 tommaso kernel: [ 39.904141] ata1.00: failed command: READ DMA
    Oct 22 20:54:42 tommaso kernel: [ 39.904189] ata1.00: cmd c8/00:08:97:11:9c/00:00:00:00:00/e0 tag 0 dma 4096 in
    Oct 22 20:54:42 tommaso kernel: [ 39.904191] res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
    Oct 22 20:54:42 tommaso kernel: [ 39.904322] ata1.00: status: { DRDY }
    Oct 22 20:54:42 tommaso kernel: [ 44.944021] ata1: link is slow to respond, please be patient (ready=0)
    Oct 22 20:54:42 tommaso kernel: [ 49.928021] ata1: device not ready (errno=-16), forcing hardreset
    Oct 22 20:54:42 tommaso kernel: [ 49.928030] ata1: soft resetting link
    Oct 22 20:54:42 tommaso kernel: [ 50.109997] ata1.00: configured for UDMA/100
    Oct 22 20:54:42 tommaso kernel: [ 50.110004] ata1.00: device reported invalid CHS sector 0
    Oct 22 20:54:42 tommaso kernel: [ 50.110017] ata1: EH complete
    Oct 22 20:54:42 tommaso kernel: [ 50.611251] Adding 5245216k swap on /dev/sda3. Priority:-1 extents:1 across:5245216k

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2010-10-23

    Then the smartctl -a aoutput:

    smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
    Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

    === START OF INFORMATION SECTION ===
    Device Model: SAMSUNG HM160HC
    Serial Number: S12TJF0S982076
    Firmware Version: LQ100-10
    User Capacity: 160,041,885,696 bytes
    Device is: Not in smartctl database [for details use: -P showall]
    ATA Version is: 7
    ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
    Local Time is: Sat Oct 23 02:28:32 2010 CEST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status: (0x05) Offline data collection activity
    was aborted by an interrupting command from host.
    Auto Offline Data Collection: Disabled.
    Self-test execution status: ( 0) The previous self-test routine completed
    without error or no self-test has ever
    been run.
    Total time to complete Offline
    data collection: ( 55) seconds.
    Offline data collection
    capabilities: (0x5b) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    Offline surface scan supported.
    Self-test supported.
    No Conveyance Self-test supported.
    Selective Self-test supported.
    SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer.
    Error logging capability: (0x01) Error logging supported.
    General Purpose Logging supported.
    Short self-test routine
    recommended polling time: ( 2) minutes.
    Extended self-test routine
    recommended polling time: ( 55) minutes.
    SCT capabilities: (0x003f) SCT Status supported.
    SCT Error Recovery Control supported.
    SCT Feature Control supported.
    SCT Data Table supported.

    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
    3 Spin_Up_Time 0x0007 252 252 025 Pre-fail Always - 2062
    4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 592
    5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
    7 Seek_Error_Rate 0x000e 252 252 051 Old_age Always - 0
    8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
    9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 1663
    10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
    12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 308
    191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 4233
    192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 29
    194 Temperature_Celsius 0x0022 124 097 000 Old_age Always - 38 (Lifetime Min/Max 15/47)
    195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 0
    196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
    197 Current_Pending_Sector 0x0012 252 252 000 Old_age Always - 0
    198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
    199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 5
    200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0
    201 Soft_Read_Error_Rate 0x0032 252 252 000 Old_age Always - 0
    223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 6
    225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 3530

    SMART Error Log Version: 1
    ATA Error Count: 5
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

    Error 5 occurred at disk power-on lifetime: 1584 hours (66 days + 0 hours)
    When the command that caused the error occurred, the device was in an unknown state.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    84 51 f0 f7 44 9c e0 Error: ICRC, ABRT 240 sectors at LBA = 0x009c44f7 = 10241271

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    c8 00 f0 f7 44 9c e0 00 00:00:56.687 READ DMA
    ef 05 fe 00 00 00 40 00 00:00:26.250 SET FEATURES [Enable APM]
    c8 00 20 6f c9 29 e0 00 00:00:26.250 READ DMA
    c8 00 30 c7 44 9c e0 00 00:00:26.250 READ DMA
    c8 00 08 5f 44 9c e0 00 00:00:26.250 READ DMA

    Error 4 occurred at disk power-on lifetime: 1550 hours (64 days + 14 hours)
    When the command that caused the error occurred, the device was in an unknown state.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    84 51 30 97 3f 9f e0 Error: ICRC, ABRT 48 sectors at LBA = 0x009f3f97 = 10436503

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    c8 00 30 97 3f 9f e0 00 00:01:40.062 READ DMA
    c8 00 28 ef 45 9c e0 00 00:01:09.500 READ DMA
    c8 00 f0 f7 44 9c e0 00 00:01:09.500 READ DMA
    ef 05 fe 00 00 00 40 00 00:01:09.312 SET FEATURES [Enable APM]
    c8 00 20 77 3f 9f e0 00 00:01:09.312 READ DMA

    Error 3 occurred at disk power-on lifetime: 1538 hours (64 days + 2 hours)
    When the command that caused the error occurred, the device was in an unknown state.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    84 51 f0 f7 44 9c e0 Error: ICRC, ABRT 240 sectors at LBA = 0x009c44f7 = 10241271

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    c8 00 f0 f7 44 9c e0 00 03:18:09.437 READ DMA
    ef 05 fe 00 00 00 40 00 03:17:38.937 SET FEATURES [Enable APM]
    c8 00 30 c7 44 9c e0 00 03:17:38.937 READ DMA
    c8 00 08 5f 44 9c e0 00 03:17:38.937 READ DMA
    c8 00 30 0f 44 9c e0 00 03:17:38.937 READ DMA

    Error 2 occurred at disk power-on lifetime: 1534 hours (63 days + 22 hours)
    When the command that caused the error occurred, the device was in an unknown state.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    84 51 f0 f7 44 9c e0 Error: ICRC, ABRT 240 sectors at LBA = 0x009c44f7 = 10241271

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    c8 00 f0 f7 44 9c e0 00 17:56:12.312 READ DMA
    c8 00 08 bf 05 a4 e0 00 17:55:42.062 READ DMA
    ef 05 fe 00 00 00 40 00 17:55:41.937 SET FEATURES [Enable APM]
    c8 00 30 c7 44 9c e0 00 17:55:41.875 READ DMA
    c8 00 08 5f 44 9c e0 00 17:55:41.875 READ DMA

    Error 1 occurred at disk power-on lifetime: 1534 hours (63 days + 22 hours)
    When the command that caused the error occurred, the device was in an unknown state.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    84 51 30 97 3f 9f e0 Error: ICRC, ABRT 48 sectors at LBA = 0x009f3f97 = 10436503

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    c8 00 30 97 3f 9f e0 00 17:36:25.000 READ DMA
    ef 05 fe 00 00 00 40 00 17:35:54.125 SET FEATURES [Enable APM]
    c8 00 20 77 3f 9f e0 00 17:35:54.125 READ DMA
    c8 00 10 2f 15 2c e0 00 17:35:54.062 READ DMA
    c8 00 58 77 15 2c e0 00 17:35:54.062 READ DMA

    SMART Self-test log structure revision number 1
    Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
    # 1 Short offline Completed without error 00% 1534 -
    # 2 Extended offline Completed without error 00% 1527 -
    # 3 Extended offline Completed without error 00% 18 -
    # 4 Short offline Completed without error 00% 17 -

    Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
    SMART Selective self-test log data structure revision number 0
    Note: revision number not 1 implies that no selective self-test has ever been run
    SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Completed [00% left] (0-65535)
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing
    Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2010-10-23

    I'm really sorry for the messed up logs copy-paste, but the site doesn't permit me to attach nothing. Maybe because when i posted the bug i was not registered.

    Tell me if you need more info or a less scrambled version.

    Cesare.

     
    Last edit: Anonymous 2013-11-19
  • chakki-chakki

    chakki-chakki - 2011-07-09

    <a href="http://google.com">ff</a>
    http://google.com

     
  • Mark Lord

    Mark Lord - 2012-09-28
    • status: open --> closed-out-of-date
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks