I use ddrescue with your patch and that in many cases speeds up a process, but sometimes it wrongly processes ABRT error -- that is in many cases non-fatal -- and exits with "passthrough error" while without passthrough kernel processes it correctly as device error, tries 6 times and goes on.
Here is an output (both ATA and SCSI passthrouths I turned on just for information)
user@debian:~$ sudo ddrescue -d -f --scsi-passthrough --ata-passthrough --domain-logfile=/mnt/drsrv2/log/19914_5+246964064256_013_changed.domain /dev/sdb /dev/sda /mnt/drsrv2/log/19914_5+246964064256.log
GNU ddrescue 1.19
Press Ctrl-C to interrupt
Initial status (read from logfile)
(sizes below are limited to the domain 246964 MB to 500105 MB)
rescued: 180808 MB, errsize: 274 MB, errors: 5890
Current status
rescued: 180808 MB, errsize: 274 MB, current rate: 0 B/s
ipos: 378481 MB, errors: 5890, average rate: 0 B/s
opos: 378481 MB, run time: 0 s, successful read: 0 s ago
Scraping failed blocks... (forwards)
scsi sense key reports the command failed,
the command my not be supported, or something else went wrong
additional sense info: 0B 00 00
sense key '0B' indicates ABORTED_COMMAND
ATA return data:
descriptor= 09
additional length= 0C
extend= 01
error= 04
count= 0000
LBAhigh= 0000
LBAmid= 2C0F
LBAlow= A0C0
device= E0
status= 41
rescued: 180808 MB, errsize: 274 MB, current rate: 0 B/s
ipos: 378481 MB, errors: 5890, average rate: 0 B/s
opos: 378481 MB, run time: 3 s, successful read: 3 s ago
ddrescue: Passthrough error
There are already keys for exiting on timeouts since last succfessful read and error rates, so if it is necessary it can be set. But this behaviour just makes recovery impossible while it is still possible through kernel routines. Thank you!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That is very odd. The ata error of 04 is indeed an abort, meaning the drive basically said "No I can't do that". The reason I made the patch exit when that happened is for those disks that lock up and need to be power cycled. Otherwise ddrescue just keeps marking all the reads as failed.
So how do you think I should address this? Maybe after the first abort it could just move on to the next read and if that aborted then exit? I could make that part of the --mark-abnormal-error option, maybe as a count. Is it always the same sectors that will cause this?
Maybe I will do this when ddrescue 1.20 is released. Right now I am starting to work on a very big project. I am planning on trying to write my own cloning tool. Unfortunately I am not planning on it being free (but should be affordable). But I am trying something crazy and using driect I/O, and by that I mean totally bypassing the kernel drivers. This gives me direct access to the soft reset bit and the ability to control how long to wait before moving on to the next read. A very basic test on a couple drives showed the possibility to process errors 8 times faster than with my passthrough patch (this will vary with different drives and testing is was very limited). Imagine a rescue that would take 30 days with ddrescue (older Linux kernel), maybe 5-6 days with a newer Linux kernel (3.6+) or my passthrough, and possibly done in a day or less with my new cloning tool. Or so I hope...
Last edit: maximus57 2015-04-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, in practice ABRT can tell about different things -- it depends on a drive firmware. According standards ABRT is just sign of rejection of last command given. I've seen (with a software that shows error and state registers during data transfer) that the same drive can show UNC, ABRT, IDNF or AMNF when trying to re-read the same LBA. As far as I noticed the really cathastrophic conditions result in immediately rejected commands -- that results in very high error rate that can be caught by ddrescue, or BSY that never ends and that ignores hw rst (sw rst is meaningless when drive BSY flag is set).
The software that I used to recover data some time before had timers that could be set as a sequence (say, TMR1 (SRST)=5ms, TMR2 (HRST)=200ms,TMR3 (PWR_CY)=5000ms, TMR4(HALT)=20000ms. Values are programmable of course. When read command is issued, we read data immediately if timers not reached, else performing an action accordingly current timer, -- if drive did not set DRQ after 5ms we make sw reset, if after that drive is not ready than hw reset, if that did not help we cycle power, and if after 20 seconds after that drive is not ready -- we turn power off and stopping a task. That (first timer) can really speed up a process when there are many thousands of bad sectors, and I think that you move in right direction with that. But I'm afraid that different controllers can make "register hell", maybe it is worth to start with most widely used, -- AHCI for example? I don't know what are differences between AHCI controllers but if it is possible to reach registers and make device reset in uniform way that could be very nice!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For ddrescue patch, I think I will add a count to --mark-abnormal-error option, so the option would be --mark-abnormal-error <n>, where n is the number of sequential abnormal errors to allow before exiting. If you wanted to use the error rate you could just set abnormal error count to a high number.
As for the direct I/O, it will not be done using AHCI. The computer BIOS must be set to IDE. Drive registers are very easy to access this way (in Linux). The object is to plug the drive in AFTER Linux boots up. When in IDE mode if the drive is not seen at bootup, it does not show up when plugged in. But I can see it with direct I/O. This will hopefully keep the OS drivers from interfering as it should not be trying to do anything to a drive that is not there. The biggest problem with this is that both the drive and the controller start out at a very low speed connection. I can easily change the drive, but the controller is another story...
As for ACHI, it is both more complicated and hot pluggable. Meaning that the OS will see the drive when it is plugged in and try to take it over, which would be real bad for what we are trying to do.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, a counter of sequential abnormal events is ok, let be so :) And what about IDE mode: some time ago I made nand flash reader based on onboard IDE controller and simple logic. the problem was that it worked in PIO mode only that resulted in about 2MB/s transfer speed. To achieve high read speeds it was necessary to make a special driver that turned on the busmaster mode and DMA transfers, so I did not do that. I mean that you'll face the same problems. I did not dig into AHCI mode but is it impossible to patch a driver to get registers on the fly and/or make reset at least? Maybe a simpler way is to hide a controller from udev to prevent partitions recognition? I doubt that all the work that allows high-speed transfer is done by kernel drivers and it is better to modify it than write the own ones from scratch...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Have already performed a simple DMA command :) So I have proof of concept (but would like to point out that DMA is VERY complicated). My next issue will be getting the controller and the drive into the proper UDMA mode as they appear to come up in DMA multiword mode 2. Drive will be easy with a command, but I have not tried to change the controller timing yet. In theory I should be able to use the setpci command to do what I need, but still very complicated.
Perhaps I am not truly bypassing ALL kernel drivers, as I may be using the controller driver now. Something mapped the ports that I am using. Or maybe some of it was negotiated by the hardware. But for all practical purposes I am using commands at about the lowest level possible from an OS. Even though I am writing in C all the needed commands seem to be present to do assembly level access, or at least the access that I need (don't get me going on physical memory location and DMA from userspace without using a kernel level driver).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Maybe a simpler way is to hide a controller from udev
to prevent partitions recognition?
Not sure about that. I am trying to do something that does not require special things to be done to the operating system.
I doubt that all the work that allows high-speed
transfer is done by kernel drivers and it is better
to modify it than write the own ones from scratch...
That actually sounds like more work to me. Having to dig for the code, figure out how it is working, and then try to modify it without breaking it.
I am trying to do this with an executable that can be ran from a Live CD if needed, with no special modifications to the operating system.
Edited because I had the second quote wrong...
Last edit: maximus57 2015-04-02
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I use ddrescue with your patch and that in many cases speeds up a process, but sometimes it wrongly processes ABRT error -- that is in many cases non-fatal -- and exits with "passthrough error" while without passthrough kernel processes it correctly as device error, tries 6 times and goes on.
Here is an output (both ATA and SCSI passthrouths I turned on just for information)
user@debian:~$ sudo ddrescue -d -f --scsi-passthrough --ata-passthrough --domain-logfile=/mnt/drsrv2/log/19914_5+246964064256_013_changed.domain /dev/sdb /dev/sda /mnt/drsrv2/log/19914_5+246964064256.log
GNU ddrescue 1.19
Press Ctrl-C to interrupt
Initial status (read from logfile)
(sizes below are limited to the domain 246964 MB to 500105 MB)
rescued: 180808 MB, errsize: 274 MB, errors: 5890
Current status
rescued: 180808 MB, errsize: 274 MB, current rate: 0 B/s
ipos: 378481 MB, errors: 5890, average rate: 0 B/s
opos: 378481 MB, run time: 0 s, successful read: 0 s ago
Scraping failed blocks... (forwards)
scsi sense key reports the command failed,
the command my not be supported, or something else went wrong
additional sense info: 0B 00 00
sense key '0B' indicates ABORTED_COMMAND
ATA return data:
descriptor= 09
additional length= 0C
extend= 01
error= 04
count= 0000
LBAhigh= 0000
LBAmid= 2C0F
LBAlow= A0C0
device= E0
status= 41
rescued: 180808 MB, errsize: 274 MB, current rate: 0 B/s
ipos: 378481 MB, errors: 5890, average rate: 0 B/s
opos: 378481 MB, run time: 3 s, successful read: 3 s ago
ddrescue: Passthrough error
There are already keys for exiting on timeouts since last succfessful read and error rates, so if it is necessary it can be set. But this behaviour just makes recovery impossible while it is still possible through kernel routines. Thank you!
That is very odd. The ata error of 04 is indeed an abort, meaning the drive basically said "No I can't do that". The reason I made the patch exit when that happened is for those disks that lock up and need to be power cycled. Otherwise ddrescue just keeps marking all the reads as failed.
So how do you think I should address this? Maybe after the first abort it could just move on to the next read and if that aborted then exit? I could make that part of the --mark-abnormal-error option, maybe as a count. Is it always the same sectors that will cause this?
Maybe I will do this when ddrescue 1.20 is released. Right now I am starting to work on a very big project. I am planning on trying to write my own cloning tool. Unfortunately I am not planning on it being free (but should be affordable). But I am trying something crazy and using driect I/O, and by that I mean totally bypassing the kernel drivers. This gives me direct access to the soft reset bit and the ability to control how long to wait before moving on to the next read. A very basic test on a couple drives showed the possibility to process errors 8 times faster than with my passthrough patch (this will vary with different drives and testing is was very limited). Imagine a rescue that would take 30 days with ddrescue (older Linux kernel), maybe 5-6 days with a newer Linux kernel (3.6+) or my passthrough, and possibly done in a day or less with my new cloning tool. Or so I hope...
Last edit: maximus57 2015-04-01
Well, in practice ABRT can tell about different things -- it depends on a drive firmware. According standards ABRT is just sign of rejection of last command given. I've seen (with a software that shows error and state registers during data transfer) that the same drive can show UNC, ABRT, IDNF or AMNF when trying to re-read the same LBA. As far as I noticed the really cathastrophic conditions result in immediately rejected commands -- that results in very high error rate that can be caught by ddrescue, or BSY that never ends and that ignores hw rst (sw rst is meaningless when drive BSY flag is set).
The software that I used to recover data some time before had timers that could be set as a sequence (say, TMR1 (SRST)=5ms, TMR2 (HRST)=200ms,TMR3 (PWR_CY)=5000ms, TMR4(HALT)=20000ms. Values are programmable of course. When read command is issued, we read data immediately if timers not reached, else performing an action accordingly current timer, -- if drive did not set DRQ after 5ms we make sw reset, if after that drive is not ready than hw reset, if that did not help we cycle power, and if after 20 seconds after that drive is not ready -- we turn power off and stopping a task. That (first timer) can really speed up a process when there are many thousands of bad sectors, and I think that you move in right direction with that. But I'm afraid that different controllers can make "register hell", maybe it is worth to start with most widely used, -- AHCI for example? I don't know what are differences between AHCI controllers but if it is possible to reach registers and make device reset in uniform way that could be very nice!
For ddrescue patch, I think I will add a count to --mark-abnormal-error option, so the option would be --mark-abnormal-error <n>, where n is the number of sequential abnormal errors to allow before exiting. If you wanted to use the error rate you could just set abnormal error count to a high number.
As for the direct I/O, it will not be done using AHCI. The computer BIOS must be set to IDE. Drive registers are very easy to access this way (in Linux). The object is to plug the drive in AFTER Linux boots up. When in IDE mode if the drive is not seen at bootup, it does not show up when plugged in. But I can see it with direct I/O. This will hopefully keep the OS drivers from interfering as it should not be trying to do anything to a drive that is not there. The biggest problem with this is that both the drive and the controller start out at a very low speed connection. I can easily change the drive, but the controller is another story...
As for ACHI, it is both more complicated and hot pluggable. Meaning that the OS will see the drive when it is plugged in and try to take it over, which would be real bad for what we are trying to do.
Yes, a counter of sequential abnormal events is ok, let be so :) And what about IDE mode: some time ago I made nand flash reader based on onboard IDE controller and simple logic. the problem was that it worked in PIO mode only that resulted in about 2MB/s transfer speed. To achieve high read speeds it was necessary to make a special driver that turned on the busmaster mode and DMA transfers, so I did not do that. I mean that you'll face the same problems. I did not dig into AHCI mode but is it impossible to patch a driver to get registers on the fly and/or make reset at least? Maybe a simpler way is to hide a controller from udev to prevent partitions recognition? I doubt that all the work that allows high-speed transfer is done by kernel drivers and it is better to modify it than write the own ones from scratch...
Have already performed a simple DMA command :) So I have proof of concept (but would like to point out that DMA is VERY complicated). My next issue will be getting the controller and the drive into the proper UDMA mode as they appear to come up in DMA multiword mode 2. Drive will be easy with a command, but I have not tried to change the controller timing yet. In theory I should be able to use the setpci command to do what I need, but still very complicated.
Perhaps I am not truly bypassing ALL kernel drivers, as I may be using the controller driver now. Something mapped the ports that I am using. Or maybe some of it was negotiated by the hardware. But for all practical purposes I am using commands at about the lowest level possible from an OS. Even though I am writing in C all the needed commands seem to be present to do assembly level access, or at least the access that I need (don't get me going on physical memory location and DMA from userspace without using a kernel level driver).
Not sure about that. I am trying to do something that does not require special things to be done to the operating system.
That actually sounds like more work to me. Having to dig for the code, figure out how it is working, and then try to modify it without breaking it.
I am trying to do this with an executable that can be ran from a Live CD if needed, with no special modifications to the operating system.
Edited because I had the second quote wrong...
Last edit: maximus57 2015-04-02