Menu

Dell XPS 9550: All nvme disk IO hangs upon switch from AC->battery (problem narrowed down to Veracrypt)

Aaron
2019-08-19
2022-12-01
1 2 > >> (Page 1 of 2)
  • Aaron

    Aaron - 2019-08-19

    Hello all,

    First off, many thanks to the Veracrypt maintainers for keeping Truecrypt alive and well! You are awesome.

    Secondly, you're about to dive into (what I think will be) a very obscure issue (i.e. slow descent into Hell). So, you may want to turn back now :-)


    System: Dell XPS 9550 with Samsung NVME pm951 1TB SSD. Running latest BIOS (1.11.2)
    OS: Windows 10 Pro x64 build 18362
    Veracrypt: 1.23-hotfix-2 (have confirmed problem still exists in latest 1.24 beta and exists as far back as 1.16)
    System partition is encrypted. Using legacy BIOS bootloader (not UEFI

    Problem: When one switches from AC power (either the dell dock or regular AC adapter) to battery, the system immediately starts experiencing lengthy hangs due to hung IO. After each hang period, the disk IO resumes for a few moments and the event log viewer shows "Reset to device, \Device\RaidPort0, was issued." when the IO recovers. The hung IO continues to repeat randomly (and frequently) until the system is rebooted. Oddly enough, booting from battery is fine. Also, going from battery -> AC is fine. But, no matter how you booted, once you are on AC, you will experience the issue if you go back to battery.

    I have been using Veracrypt in my current system's config for many years now; the issue has only shown up in the past 6-8 months. My best guess is that a Windows 10 OS update has changed something that has caused the evilness to appear. (I basically have rolled back everything else I can think of to try to find the 'trigger' that I could un-do. No luck.)

    I was utterly convinced that this was a hardware/firmware issue due to the association with power source change. I troubleshooted this extensively with Dell Premium support (not that premium lol). In the process, they asked me to reinstall the OS clean. Didn't think it would work, but gave in and did it.

    Turns out, the issue was GONE upon clean reinstall. That is, until I reinstalled Veracrypt AND encrypted my system partition. Then, problem is back. Decrypting the drive (keeping Veracrypt installed) made problem go away again. So, the problem only appears when system partition is encrypted with veracrypt.

    I'm guessing that likely the issue is not entirely veracrypt's fault (probably some shitty Dell firmware/driver code that interfaces badly with veracrypt), but I've definitely narrowed the problem to only showing up once Veracrypt is installed and OS Encryption is enabled.

    Detailed list of troubleshooting so far:
    1. Rolling back as much stuff as possible. Drivers, bios, etc. Didn't work so I upgraded everything again.

    1. Trying various different Veracrypt versions. All the ones I've tried have had the problem.

    2. Setting up CONTROL-SCROLL to trigger bluescreen (capture memory dump) during a period of hung IO. Then, boot up again and drop the file into WinDBG. I suck at kernel debugging, but it appears that there are lots of IRPs hung in veracrypt.sys. But hard to know for sure due to my lack of skill. And I can't get the feintest idea of WHY they are hung.

    3. Enabling "Driver Verifier" to run against veracrypt.sys as well as the Samsung NVMe driver (secnvme.sys). No bluescreens (no caught violations of Kernel mode rules, even when the hang happens).

    4. Enabling local kernel debugging and configuring the system to retain all !dbgprint output. This gave me some debugging info, but nothing from the Veracrypt.sys driver.

    5. I looked at veracrypt.sys's source code and noticed a bunch of Dump() calls that were basically compiled out due to the release build. So, I got enough of a compilation environment going to produce a debug build of veracrypt.sys for myself. So, now I see a bunch of veracrypt output when I run !dbgprint with WinDBG. Problem still shows up with my compiled driver. Of course, this requires me to now boot my OS with "disable driver signature enforcement" :-)

    Anyway, as you can tell I've tried a bunch of things but I'm getting a little out of my depth. Any suggestions of what else I should try to figure out what the heck is causing IO to hang upon power source switch?

    Thanks much :-)

     

    Last edit: Aaron 2019-08-19
  • Aaron

    Aaron - 2019-09-03

    No takers? :-/ Eh, I don't blame you :-)

    Still, help debugging this driver would be appreciated as I'm out of my depth.

    I discovered TC_TRACE_IO_QUEUE so I #define'd it and re-compiled. I also discovered Sysinternals's DbgView tool which allowed me to grab a kernel ring buffer capture right when the hang was happening. See attached.

    Observations: Take a look at the 20 or so events leading up to 00008395. Again, out of my depth here, but it seems like some kind of IO jobs might be queuing up in the driver? And then everything hangs until some-odd 30 seconds later. After this point, it looks like IO jobs are completing again.

    Any idea as to how I could track down what's causing IO to hang? Happy to tweak the code and recompile as necessary :-)

    Aaron

     
  • Aaron

    Aaron - 2019-10-11

    Bump. Sorry for the spam but I'm still praying someone smarter than I sees this thread, has pity on me, and chimes in with some advice. :-)

     
  • DDD

    DDD - 2020-05-05

    I had no idea you could do Control-ScrollLock(2x) to make a blue screen. You did some interesting checking. I think I've noticed in the Event Viewer that the speed of processor 7 in group 0 is being limited by system firmware, so maybe that's Veracrypt? And that there is this system args threading exception that's caught but isn't put into Windows Error Reporting. It isn't too much longer after a system being limited event that the system gets frozen. It's interesting that you've tried different versions of Veracrypt - how far back had you gone? Something with system power was supposed to be fixed, I think, in 1.24 (one of the versions before Update4). You'd have to look at the changelog. If you can compile Veracrypt in Windows, perhaps you could look more closely at what's causing it to hang because you'll have the debugging tools and can look at it more closely. I figured it had something to do with power, too, but I don't think I have NVMe on my system.

     
  • Aaron

    Aaron - 2020-05-05

    I think I had reverted back all the way to Veracrypt 1.18. It did not help. Again, it seemed to coincide with either a firmware update (but I tried rolling back my firmware) or a Windows 10 update. I'm thinking it might have been Win10. Hard to say.

    I"m currently running 1.24 update 4 and I still experience the issue.

    I'm happy to get the compiler and debugger working again for the most recent code but to be honest i'm out of my depth with the kernel mode code so I'm not entirely sure what logging to add, etc.

     
  • DDD

    DDD - 2020-05-05

    I thought for a while that it had to do with the .net framework, because Windows 7 seems to run with no hangups.

     
  • Aaron

    Aaron - 2020-05-06

    No....I'm pretty sure it's a kernel issue combined with some dumb firmware/os/driver glitch(es). The crash dump I take upon CTRL-SCROLL shows all the IRPs hung inside Veracrypt.sys for some reason....

     

    Last edit: Aaron 2020-05-06
  • DDD

    DDD - 2020-05-06

    I don't really have a clue what I'm talking about, but perhaps it's something with ACPI? You mentioned that the problem happens when going to battery power. I like that you really got into troubleshooting what the problem could be. I like how thorough you get.

     
    • Aaron

      Aaron - 2020-05-06

      If by "thorough" you mean "desparate" than yes I am desparate :-)

      Regarding being related to ACPI: Possibly. But while power state is the trigger, ultimately the problem is due to hung IO operations that occur after the laptop's firmware, OS, or drivers make poor decisions in response to that state change. And, those problems don't manifest themselves without veracrypt installed. So either the bug is in Veracrypt.sys itself (unlikely) or somehow Veracrypt.sys is making the bug in the other drivers/kernel/firmware appear (more likely imo).

       
  • DDD

    DDD - 2020-05-06

    One thing I can say, though, is that as of 2020-05-06, Windows Update doesn't really look for each and every driver to see if there's an update. You have to go in and manually check for each one.

     
    • Aaron

      Aaron - 2020-05-06

      I did try a variety of driver updates when this all started. Admittedly in the last few months I've not been as religious about updating them....

       
  • DDD

    DDD - 2020-05-07

    To be honest, I've recreated it with Veracrypt not installed, on a Windows 1909 system, with all the updates. I get a Windows hardware error, being a LiveKernelEvent , but Windows doesn't send the hardware error reports, but it used to. I think Veracrypt just exacerbates the problem. I think it's a problem with the kernel. I'm going to play around with it a bit, because when I went from being plugged in to battery, the o.s. will go into sleep mode after 10 min of inactivity, and I think it jolts it in a way to be more responsive. What I do is crash Firefox. I have AVG installed, so maybe it could be AVG that's a problem too. not sure. I'm going to try to crash it plugged in, and on battery, and see what happens. Do you have the hard disk policy set to write-cache or not to write-cache?

     
    • Aaron

      Aaron - 2020-05-07

      Ok, yeah this sounds like a different issue than mine...

      (Disk is set to write cache.)

       
      • DDD

        DDD - 2020-05-17

        Did you update your drivers? Did that help or solve the problem?

         
        • Aaron

          Aaron - 2020-05-18

          I haven't experimented with updating them in the last 6 months, but prior to that I had tried updating pretty much everything --- hard disk firmware, drivers, bios, etc.

           
  • DDD

    DDD - 2020-05-07

    I've got a LiveKernelEvent 1a8, and 193, not sure what those mean.

     
  • DDD

    DDD - 2020-05-07

    It's probably the same issue.

     
  • Aaron

    Aaron - 2021-02-16

    Ok, so out of annoyance due to this issue I've been trying to continue my debugging efforts from a few years ago and get smarter about how the driver works. I again compiled Veracrypt 1.24-update7, caused a hang, and captured a manually-initiated crash dump.

    I am nearly 100% positive the hang is here, in the IoThreadProc thread:

    THREAD ffffb1032c538040  Cid 0004.01a8  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
                ffff8902c7107920  NotificationEvent
            IRP List:
                ffffb1032eb4b890: (0006,0160) Flags: 00060000  Mdl: ffffb10336508990
            Not impersonating
            DeviceMap                 ffffda014fc47a20
            Owning Process            ffffb10323879080       Image:         System
            Attached Process          N/A            Image:         N/A
            Wait Start TickCount      38972          Ticks: 1860 (0:00:00:29.062)
            Context Switch Count      684844         IdealProcessor: 0             
            UserTime                  00:00:00.000
            KernelTime                00:00:06.625
            Win32 Start Address veracrypt!IoThreadProc (0xfffff8003a8a97e0)
            Stack Init ffff8902c7107b90 Current ffff8902c7107520
            Base ffff8902c7108000 Limit ffff8902c7101000 Call 0000000000000000
            Priority 16 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
            Child-SP          RetAddr           Call Site
            ffff8902`c7107560 fffff800`36a65850 nt!KiSwapContext+0x76
            ffff8902`c71076a0 fffff800`36a64d7f nt!KiSwapThread+0x500
            ffff8902`c7107750 fffff800`36a64623 nt!KiCommitThreadWait+0x14f
            ffff8902`c71077f0 fffff800`3a8b62a4 nt!KeWaitForSingleObject+0x233
            ffff8902`c71078e0 fffff800`3a8b6316 veracrypt!TCReadWriteDevice+0x124 [c:\aaron\dev\veracrypt\src\driver\ntdriver.c @ 3765] 
            ffff8902`c7107970 fffff800`3a8a9773 veracrypt!TCReadDevice+0x36 [c:\aaron\dev\veracrypt\src\driver\ntdriver.c @ 3778] 
            ffff8902`c71079b0 fffff800`3a8a9cab veracrypt!TCCachedRead+0xf3 [c:\aaron\dev\veracrypt\src\driver\encryptedioqueue.c @ 343] 
            ffff8902`c7107a20 fffff800`36b17e25 veracrypt!IoThreadProc+0x4cb [c:\aaron\dev\veracrypt\src\driver\encryptedioqueue.c @ 449] 
            ffff8902`c7107b10 fffff800`36bfd0d8 nt!PspSystemThreadStartup+0x55
            ffff8902`c7107b60 00000000`00000000 nt!KiStartSystemThread+0x28
    

    In NtDriver.c::TCReadWriteDevice(), it's basically creating a new IRP with IoBuildSynchronousFsdRequest to the lower-level driver. It then waits for this IRP to complete (KeWaitForSingleObject) but my guess is this never happens. Thus, the IoThreadProc never is freed up to do further work. The system then gradually grinds to a halt.

    Examining the IRP it's generating further...

    First, find the relevant IRP:

    kd> .thread ffffb1032c538040
    ...
    kd> .frame 4
    04 ffff8902`c71078e0 fffff800`3a8b6316 veracrypt!TCReadWriteDevice+0x124 [c:\aaron\dev\veracrypt\src\driver\ntdriver.c @ 3765] 
    
    0: kd> dv /i/t/V
    prv param  ffff8902`c7107970 @rsp+0x0090       int write = 0n0
    prv param  ffff8902`c7107978 @rsp+0x0098       struct _DEVICE_OBJECT * deviceObject = 0xffffb103`2c549060 Device for "\Driver\disk"
    prv param  ffff8902`c7107980 @rsp+0x00a0       void * buffer = 0xffffb103`2c602000
    prv param  ffff8902`c7107988 @rsp+0x00a8       union _LARGE_INTEGER offset = {61057008128}
    prv param  ffff8902`c7107990 @rsp+0x00b0       unsigned long length = 0x2000
    prv local  ffff8902`c7107950 @rsp+0x0070       long status = 0n259
    prv local  ffff8902`c7107940 @rsp+0x0060       struct _IO_STATUS_BLOCK ioStatusBlock = struct _IO_STATUS_BLOCK
    prv local  ffff8902`c7107938 @rsp+0x0058       struct _IRP * irp = 0xffffb103`2eb4b890
    prv local  ffff8902`c7107920 @rsp+0x0040       struct _KEVENT completionEvent = struct _KEVENT
    

    Ok, so 0xffffb1032eb4b890. Let's inspect it:

    kd> !irp 0xffffb1032eb4b890 1
    Irp is active with 2 stacks 2 is current (= 0xffffb1032eb4b9a8)
     Mdl=ffffb10336508990: No System Buffer: Thread ffffb1032c538040:  Irp stack trace.  
    Flags = 00060000
    ThreadListEntry.Flink = ffffb1032c5384f0
    ThreadListEntry.Blink = ffffb1032c5384f0
    IoStatus.Status = 00000000
    IoStatus.Information = 00000000
    RequestorMode = 00000000
    Cancel = 00
    CancelIrql = 0
    ApcEnvironment = 00
    UserIosb = ffff8902c7107940
    UserEvent = ffff8902c7107920
    Overlay.AsynchronousParameters.UserApcRoutine = 00000000
    Overlay.AsynchronousParameters.UserApcContext = 00000000
    Overlay.AllocationSize = 00000000 - 00000000
    CancelRoutine = 00000000   
    UserBuffer = 00000000
    &Tail.Overlay.DeviceQueueEntry = ffffb1032eb4b908
    Tail.Overlay.Thread = ffffb1032c538040
    Tail.Overlay.AuxiliaryBuffer = 00000000
    Tail.Overlay.ListEntry.Flink = 00000000
    Tail.Overlay.ListEntry.Blink = 00000000
    Tail.Overlay.CurrentStackLocation = ffffb1032eb4b9a8
    Tail.Overlay.OriginalFileObject = 00000000
    Tail.Apc = 00000001
    Tail.CompletionKey = 00000001
         cmd  flg cl Device   File     Completion-Context
     [N/A(0), N/A(0)]
                0  0 00000000 00000000 00000000-00000000    
    
                Args: 00000000 00000000 00000000 00000000
    >[IRP_MJ_READ(3), N/A(34)]
                0  0 ffffb1032c549060 00000000 00000000-00000000    
               \Driver\disk
                Args: 00002000 00000000 e37480200 00000000
    

    Okay, so the IRP looks like it's dispatched to disk.sys but Lord knows why it's not being serviced.
    @idrassi any ideas as to how I could go about figuring out why the Veracrypt-generated IRP is never completing?

     
  • Aaron

    Aaron - 2021-02-16

    (P.S. Sorry my kernel debugging skills are such shit.)

     
    • DDD

      DDD - 2021-02-18

      It's refreshing to read your post. It's refreshing to take a look at debugging code. You're sure it's a problem with the Veracrypt code? (yes, I'm ignorant. :) )

       
      • Aaron

        Aaron - 2021-02-18

        Honestly I don't know for sure. As I said in my OP I am guessing the bug is elsewhere but Veracrypt is triggering the bug to surface. I HAVE confirmed these system hangs don't occur if I decrypt the drive and uninstall veracrypt.

        I had an idea about what could be going on but it is just a wild guess. The IoThreadProc appears to be popping work off of a shared queue and processing that work one job at a time. As part of that work, in this case, it's waiting for IOP completion. As a result, no other work is being done during that wait.

        But what if the lower-level logic servicing that IOP separately decides it wants to perform higher-level disk IO? E.g. maybe the lower level NVME driver is doing something dumb like calling a high-layer driver which THEN in turn attempts to do more IO via veracrypt.sys. And then that driver (maybe NVME driver) is waiting for veracrypt, which is in turn waiting on the nvme driver. I.e. a deadlock?

        To this end, my next goal was to attempt to inspect the IO job queue during the hang. I already did that a little and there was 1 queue entry waiting (MsMpEng.exe) for IoThreadProc to service next. But I still need to track down the original process that caused the current job to be sent to IoThreadProc.

         
        • DDD

          DDD - 2021-02-20

          With regards to your second paragraph, I've noticed that too. I've noticed that everything else gets put on hold, things dealt with one at a time, and then sometimes Windows will time out on certain things. I think that it may not even be restricted to just NVMe. Hey, keep at it!! It's really refreshing to read your posts, to see how you're attempting to debug it. I had asked Mounir how to do it, but no response yet, so it's nice to see that you're tackling it. I'm very much looking forward to anything else you will write. :)

          and just to add, I think maybe when there are multiple reads and writes, such as when a person uses qbittorrent, and the hard drive is heavily fragmented, then that just adds to the frustration with the veracrypt driver, because it seems to just cause things to slow down eventually to a stop. The response rate goes up and up. Right now, I'm seeing some files taking about 4 seconds to service, and I'm not using an ssd drive. I'm not sure if the veracrypt driver is using threads or what's happening.

          Right now, all my torrents are sitting at zero download speed for the past 15 minutes, but the hard drive is at red hot 100%, because veracrypt is probably trying to service them all, and then qbittorrent is waiting for that to happen until it starts downloading more.

           

          Last edit: DDD 2021-02-20
  • DDD

    DDD - 2021-02-21

    So after having used qbittorrent, my hard drive eventually did freeze. What happens is that the computer runs out of memory, then tries to access the page file on the encrypted drive, but then the system has to wait until the Veracrypt driver gets through it's single-file list of reads and writes to get through. Even pressing ctrl-alt-del to bring up the lock screen can make the system wait and wait and wait because Windows doesn't (as far as I know yet) just keep the lock screen app in RAM at all times, but will write that memory to the page file too. Mind you, Windows has so many little programs running in memory that they just take up too much space, like 8 GB is not enough or something. I think, too, that by the time the Veracrypt driver finally services some of that data, maybe it's already obsolete or expired. My system froze, and because Windows doesn't correctly recognise my keys on my built-in keyboard, I couldn't initiate a crash. If I were to attach a usb keyboard, Windows would still have to try to get through all its reads and writes, the page file etc just to be able to recognise the external keyboard as a valid input method.

    Has anybody else noticed that the system just has this odd buzzing sound when the system is frozen? Is that an infinite loop? I can hear that reads and writes are not being done, but sometimes just this buzzing sound. Not the normal read write sound though.

    @Aaron, I'm really looking forward to reading more of your posts and your troubleshooting. And I'm not even sure which email address to send a kernel dump to. Anybody know?

    Oh yes, and I think that what I will do is to maybe just use my computer for an hour at a time, then reboot, especially when doing so many reads and writes, so that the system doesn't become progressively slower until it hangs.

     

    Last edit: DDD 2021-02-21
  • Aaron

    Aaron - 2021-02-25

    IDK. I've tried my hardest to track down why the \Driver\disk IRP is not being serviced. I would have expected it to say "PENDING" next to it in the debugger output if Disk.sys had serviced it but was waiting for completion of a related (lower down) IRP. But I see nothing like that. So I don't get it.

    Anyone have any idea how disk.sys services IRP_MJ_READ under healthy conditions?


    I've tried using !irpfind but it takes >5 hours to run and ultimately spams a bunch of 'missing page' errors that cause me to lose the output up to that point. From the partial output, I see a bunch of IRPs with all zeros for associated driver. I have been able to grab some of the output before I lost it, however, and I see a few IRPs with all 0s as the associated device. I don't get how that's possible either:

    0: kd> !irp ffff800f17cf7c20
    Irp is active with 6 stacks 6 is current (= 0xffff800f17cf7e58)
     No Mdl: No System Buffer: Thread 00000000:  Irp stack trace.  Pending has been returned
         cmd  flg cl Device   File     Completion-Context
     [N/A(0), N/A(0)]
                0  0 00000000 00000000 00000000-00000000    
    
                Args: 00000000 00000000 00000000 00000000
     [N/A(0), N/A(0)]
                0  0 00000000 00000000 00000000-00000000    
    
                Args: 00000000 00000000 00000000 00000000
     [N/A(0), N/A(0)]
                0  0 00000000 00000000 00000000-00000000    
    
                Args: 00000000 00000000 00000000 00000000
     [IRP_MJ_POWER(16), IRP_MN_SET_POWER(2)]
                0 e0 ffff800f17c6c360 00000000 fffff8072aed11b0-fffff8072aeade40 Success Error Cancel 
               \Driver\pci  ACPI!ACPIDeviceIrpDeviceFilterRequest
                Args: 00000000 00000001 00000004 00000000
     [IRP_MJ_POWER(16), IRP_MN_SET_POWER(2)]
                0 e1 ffff800f153364d0 00000000 fffff80729988b90-ffff800f17cf7500 Success Error Cancel pending
               \Driver\ACPI nt!PopRequestCompletion
                Args: 00000000 00000001 00000004 00000000
    >[N/A(0), N/A(0)]
                0  0 00000000 00000000 00000000-ffff800f17cf7500    
    
                Args: 00000000 00000000 00000000 00000000
    

    At this point I'm kinda leaning toward giving up for now because I apparently still don't have the requisite knowledge/skills to debug this (unfortunately).

     
    • DDD

      DDD - 2021-02-25

      I'm sorry to hear that you're giving up on it for now. I really enjoy reading your posts. I know my system has crashed a few times because of driver_power_state_failure, and I had been wondering if it was connected to Veracrypt.

       
1 2 > >> (Page 1 of 2)

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.