Menu

Client-side hang during BitTorrent restore

Siet
2021-08-11
2021-11-02
  • Siet

    Siet - 2021-08-11

    Hi,

    I'm having a new problem with restoring an image to multiple PCs with BitTorrent mode.
    My general setup is described here.

    While restoring, clients will sometimes hang while running ezio with the following symptoms:

    • The uploading seconds counter will stop, but the cursor can still blink
    • A newline gets printed when using CTRL+D or CTRL+C, no other response
    • I can switch to another VT (CTRL+ALT+Fx), but the login prompt on the new VT doesn't appear
    • If I open a different VT before cloning, once the uploading seconds counter stops, no commands will work on the other VT . Even an ls hangs indefinitely.
    • In rare cases, a command on the other VT will execute after a long time, for example ls sometimes returns something after 20 minutes
    • Due to this, it's hard to look at log files
    • If I start dmesg -w beforehand, no new messages appear in dmesg when the client hangs

    So all in all: Graphics output works, but every command, hotkey or executable hangs, producing no output.

    The cloning will fail because the steps that happen after ezio don't get executed.

    This seems to be a Linux problem and not an ezio problem, but maybe a problem that is caused by ezio.
    It doesn't happen when cloning with multicast.
    When restarting the cloning process, it may work or not work.
    It doesn't seem to be happening to a specific machine or hardware setup.

    The behavior seems similar to how Linux behaves when it's stalled by I/O, but there are no "task ... blocked for more than 120 seconds" messages in dmesg. No oom messages either.

    This is a new-ish problem. Clonezilla versions from 2020 don't cause it.
    The version I was using was 20210518-hirsute.

    Does someone have an idea about further steps to debug this?

     

    Last edit: Siet 2021-08-11
  • Steven Shiau

    Steven Shiau - 2021-08-11

    Please give the latest Clonezilla live a try:
    https://clonezilla.org/downloads.php
    Or even the testing one.

    Steven

     
  • Siet

    Siet - 2021-08-28

    Still happens with 20210817-hirsute.
    I didn't try with 20210817-impish, because the kernel inside it is actually older than 20210817-hirsute.

    I've also tested a bit more and tried to narrow it down. It seems like, from this list of stable releases, the problem seems to have been introduced somewhere between 20200703-focal (works) and 20210127-groovy, but I'm not sure yet.

     

    Last edit: Siet 2021-08-28
  • Steven Shiau

    Steven Shiau - 2021-09-01

    "I didn't try with 20210817-impish, because the kernel inside it is actually older than 20210817-hirsute." -> What did you mean?

    Steven

     
  • Siet

    Siet - 2021-09-01

    If you download 20210817-hirsute from here: https://sourceforge.net/projects/clonezilla/files/clonezilla_live_alternative/20210817-hirsute
    and from 20210817-impish here: https://sourceforge.net/projects/clonezilla/files/clonezilla_live_alternative_testing/20210817-impish
    then extract both and run the linux command "file" on both kernels, for example:

    file clonezilla-live-20210817-*/live/vmlinuz
    clonezilla-live-20210817-hirsute-amd64/live/vmlinuz: Linux kernel x86 boot executable bzImage, version 5.11.0-25-generic (buildd@lgw01-amd64-044) #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021, RO-rootFS, swap_dev 0xE, Normal VGA
    clonezilla-live-20210817-impish-amd64/live/vmlinuz:  Linux kernel x86 boot executable bzImage, version 5.11.0-20-generic (buildd@lcy01-amd64-029) #21+21.10.1-Ubuntu SMP Wed Jun 9 15:08:14 UTC 2021, RO-rootFS, swap_dev 0xE, Normal VGA
    

    you'll see that the kernel in 20210817-hirsute is 5.11.0-25 built on July 9 and the one in 20210817-impish is 5.11.0-20 built on June 9.
    So the kernel in 20210817-impish is actually a month older, even though it should be the newer because it is the testing version and Ubuntu Impish should be on the 5.13 kernel.
    I didn't try the older kernel because this freeze can only really be a kernel issue or an ezio issue. The kernel in 20210817-impish was older and ezio is the same in both versions, so it made no sense to try it.

    20210829-impish has the newer kernel:

    file clonezilla-live-20210829-impish-amd64/live/vmlinuz 
    clonezilla-live-20210829-impish-amd64/live/vmlinuz: Linux kernel x86 boot executable bzImage, version 5.13.0-14-generic (buildd@lcy01-amd64-002) #14-Ubuntu SMP Mon Aug 2 12:43:35 UTC 2021, RO-rootFS, swap_dev 0x9, Normal VGA
    

    But this wasn't out yet when I wrote my last post.

     

    Last edit: Siet 2021-09-01
  • Steven Shiau

    Steven Shiau - 2021-09-01

    I believe the version is the same, and the only difference is the compiling time.
    Anyhow, please give Clonezilla live 20210829-impish a try. It comes with Linux kernel 5.13.0-14.

    Steven

     
  • Siet

    Siet - 2021-11-01

    It's still happening with 20211027-impish.

    Because the problem has been staying the same with multiple kernel versions now and it never happens when using multicast, I'm beginning to think that maybe ezio is at fault, because the working version 20200703-focal was one of the last ones to use the old ezio-static.

    /Edit: While testing I noticed that the ezio-static from 20200703-focal is also faster than the current one.

     

    Last edit: Siet 2021-11-01
    • Date Huang

      Date Huang - 2021-11-02

      Hi Siet

      I want to check with you.
      Will ezio be executed or not in new version?
      It seems ezio will stuck in the middle of cloning.
      Could you provide hardware specs on those stuck machine?

      For now, I think there is some ram shortage on your machine because EZIO will use lots RAM.
      And if system is running out of ram, all operation will stuck.

      Let me work for this testing with Steven and give you a trial ISO to test again?
      Thanks a lot for your patience.

      I will also do some enhance to log some info to debug it.

      Regards,
      Date

       
      • Date Huang

        Date Huang - 2021-11-02

        By the way,
        @siet

        Could you help us to determine those question below?
        It will help us to figure out the problem.

        1. Did server stuck also?
        2. How many clients did stuck while cloning? How many clients in total?
        3. Always same machines stuck or random machine?
        4. Please list the hardware specs of those stuck machine
        5. When stucking, is it possible to use free -h to show the ram usage in the system (if system still has response, if not, just ignore this one)

        Thanks a lot

         

Log in to post a comment.