Attempting to create an image failed, twice with a message:
watchdog: BUG: soft lockup - CPU#4 stuck for 234s!
(different times) no progress later.
Once I left it hanging and saw that the message repeated every minute or so with increased time.
Using versions 3.1.1-27 and 3.1.3-16 completed OK.
version 3.2.0-5 was significantly slower (tested multiple times): 36m vs 24m.
The results compared OK (past the initial image version).
Saving a 500GB SSD disk over USB 3.0 to a network (1G) drive (TBs of free space).
laptop is "Acer Aspire 5" with "i5-1135G7 @ 2.40GHz".
clonezilla always started from ventoy. Everything stayed the same between tests.
Something is not right there.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I downloaded 3.2.1-3 and dd'ed it to a USB disk which I rebooted on the same laptop as before.
So no ventoy this time, if it matters.
I did not mention before but I always select no compression for this job. [edit: and no splitting of the backup either]
Now that it completed I see that it is also slow, taking 36 minutes.
Last edit: Eyal Lebedinsky 2025-02-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
No idea, since the rate here is quite normal.
If newer Linux kernel does not help, you can keep using the older version.
Or if you are sure the issue is on the Linux kernel itself, please report to our upstream, i.e., Debian Linux.
Steven
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am not sure what you are suggesting here.
Are you saying that this is an issue with the kernel that is used in the package?
I now started cz, then broke into a console and dded the SSD (500GB) to the same /home/partimage that cz set up. It completed at a rate of 116MB/s which is the full capacity of the 1G NIC.
In this test the source and destination computers are the same as with the above cz runs. Same hwr and swr, EXCEPT that here cz is replaced with dd.
This suggests that the SSD, NIC and kernel are probably not the source of the slowdown.
Which component you think needs investigation?
Regards,
Last edit: Eyal Lebedinsky 2025-02-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I now ran two tests without rebooting cz. One with q1 then another with q2 (the default as used before).
The q1 test ran fast, at about 7GB/m (max expected using the 1G NIC).
The q2 test ran slow, at about 5.2-5.2GB/m.
I am attaching some log files which may give a clue.
Sure, if you compare the rate, of course partclone.dd will faster than partclone.ext4, because partclone.dd does not have to parse the metadata of file system and search the used blocks on the file system.
However, if you compare the time for imaging your /dev/sdb1:
partclone.dd:
File system: raw
Device size: 500.0 GB = 976502785 Blocks
Space in use: 500.0 GB = 976502785 Blocks
Free Space: 0 Byte = 0 Blocks
Block size: 512 Byte
Total block 976502785
Total Time: 01:11:47, Ave. Rate: 6.96GB/min, 100.00% completed!
Partclone.dd took 72 mins, while partclone.ext4 took 37 mins in this case. Apparently the latter works better.
Of course, if the used blocks in the file system is about 100%, partclone.dd is the best one.
Did you compare the rates by using partclone.dd in two different versions of Clonezilla live?
Steven
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I decided to run -q2 tests with the old version (3.1.3-16) and the new (3.2.1-3). Saw the same speed difference.
However, this time I was sitting near the computer and noticed something.
Looking at the network switch, where the source and destination computers are plugged,
the new cz had many pauses in the blinking lights of 5-10 seconds each, where the old cz had very few pauses, each less than 1 second, later than 80% copied.
BTW in these tests the SSD was not used at all apart from attaching it to the cz run.
Somehow it looks like the issue is about the network storage.
Very likely it might be the protocol issue, or anything else.
Do you have spare network switch you can swap?
Steven
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
FYI: now tested with 3.2.1-7 with similar results.
This time (repeating both 3.1 and 3.2 versions) I ran 'top' in a second text screen. It looked very different.
3.1.3-16 had partclone.ext4 permanently at the top with reasonable %CPU, with nfs threads below it.
3.2.1-7 shows partclone.ext4 regularly drop off the list for many seconds at a time, with nfs threads above it.
Clearly there was an important change, looking at the running threads, the network activity and the total run time. Maybe a different/changed scheduler?
Is there a way to collect more information to pinpoint the source of the difference? Maybe some perf data of sorts to see where time is spent?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your feedback.
We checked that the rate shown in Partclone was changed. The older version has some issues about the rate.
So it's better to compare the real time it takes.
Could you please check the real time it takes in these two different versions?
Does that really have a big difference between them?
Steven
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, thanks for your report.
Is that possible you could change:
"Saving a 500GB SSD disk over USB 3.0 to a network (1G) drive (TBs of free space)."
to
"Saving a 500GB SSD disk over USB type C to another SSD drive"?
Let's make the environment more isolated if you can.
Steven
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Then run the following command 2 or 3 times:
partclone.ext4 -z 10485760 -N -L /var/log/clonezilla//partclone.log -c -s /dev/sdb1 --output /home/partimag/u2204-3.2.1-7-2025-02-20-10-img/sdb1.ext4-ptcl-img.uncomp
(Replace the image dir, otherwise the existing one will be overwritten)
By doing so, it would be easier to identify the issue is on the Linux/hardware support issue, or the Partclone itself.
Steven
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, writing to disk is good both ways. However, there is an issue when using the network. Old version is faster, new version is slower, consistently.
I mentioned that I see long pauses (switch light stops showing activity) with the new version. Something changed for sure.
Is there a way to see a profile of the partclone run to see where time is spent?
Running 20250218-plucky over a network using a different switch.
- Note: right off the bat I see long network pauses in this run... Total Time: 00:37:24, Ave. Rate: 4.65GB/min, 100.00% completed!
Running 20250218-plucky using my original switch.
- Note: right off the bat I see long network pauses in this run... Total Time: 00:39:03, Ave. Rate: 4.45GB/min, 100.00% completed!
Attached syslog after these two runs.
later: run using clonezilla-live-20240715-noble-amd64 which I think is from the same date as the 3.1.3 version. Also fast. Total Time: 00:24:41, Ave. Rate: 7.05GB/min, 100.00% completed!
Since we are not able to reproduce the issue here, we do not actually fix this.
If it's Linux kernel related issue, maybe you can try the newer Clonezilla live in the future: https://clonezilla.org//downloads.php
since it will come with newer kernel.
Or maybe you can try to have a different type of network switch? Try to see if is switch/kernel or network protocol related issue.
Steven
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Attempting to create an image failed, twice with a message:
watchdog: BUG: soft lockup - CPU#4 stuck for 234s!
(different times) no progress later.
Once I left it hanging and saw that the message repeated every minute or so with increased time.
Using versions 3.1.1-27 and 3.1.3-16 completed OK.
version 3.2.0-5 was significantly slower (tested multiple times): 36m vs 24m.
The results compared OK (past the initial image version).
Saving a 500GB SSD disk over USB 3.0 to a network (1G) drive (TBs of free space).
laptop is "Acer Aspire 5" with "i5-1135G7 @ 2.40GHz".
clonezilla always started from ventoy. Everything stayed the same between tests.
Something is not right there.
Did you have check the sha256sum for the clonezilla-live-3.2.0-5-amd64.iso ?
I'm use (MacbookPro 2010) clonezilla-live-3.2.0-5-amd64.iso with ventoy 1.1.00 without problems
Last edit: czfan 2025-02-02
Yes. My files are dated 26/Dec/2024, when I downloaded it.
Later I managed to complete an image using
3.2.0-5
and it was identical to the others (just significantly slower).Is this issue reproducible in the testing Clonezilla live? E.g., 3.2.1-3 or 20250209-*?
https://clonezilla.org//downloads.php
Steven
I downloaded 3.2.1-3 and dd'ed it to a USB disk which I rebooted on the same laptop as before.
So no ventoy this time, if it matters.
I did not mention before but I always select no compression for this job.
[edit: and no splitting of the backup either]
Now that it completed I see that it is also slow, taking 36 minutes.
Last edit: Eyal Lebedinsky 2025-02-10
No idea, since the rate here is quite normal.
If newer Linux kernel does not help, you can keep using the older version.
Or if you are sure the issue is on the Linux kernel itself, please report to our upstream, i.e., Debian Linux.
Steven
I am not sure what you are suggesting here.
Are you saying that this is an issue with the kernel that is used in the package?
I now started cz, then broke into a console and
dd
ed theSSD
(500GB) to the same/home/partimage
that cz set up. It completed at a rate of 116MB/s which is the full capacity of the 1G NIC.In this test the source and destination computers are the same as with the above cz runs. Same hwr and swr, EXCEPT that here cz is replaced with
dd
.This suggests that the SSD, NIC and kernel are probably not the source of the slowdown.
Which component you think needs investigation?
Regards,
Last edit: Eyal Lebedinsky 2025-02-10
In that case, I totally have no idea. As I mentioned, here the rate for cloning ext4 is quite normal.
Or maybe you can enter expert mode, and choose the option "-q1" to test that?
https://clonezilla.org//clonezilla-live/doc/01_Save_disk_image/images/ocs-09-advanced-param-q.png
Steven
I now ran two tests without rebooting cz. One with q1 then another with q2 (the default as used before).
The q1 test ran fast, at about 7GB/m (max expected using the 1G NIC).
The q2 test ran slow, at about 5.2-5.2GB/m.
I am attaching some log files which may give a clue.
Last edit: Eyal Lebedinsky 2025-02-12
Sure, if you compare the rate, of course partclone.dd will faster than partclone.ext4, because partclone.dd does not have to parse the metadata of file system and search the used blocks on the file system.
However, if you compare the time for imaging your /dev/sdb1:
partclone.dd:
File system: raw
Device size: 500.0 GB = 976502785 Blocks
Space in use: 500.0 GB = 976502785 Blocks
Free Space: 0 Byte = 0 Blocks
Block size: 512 Byte
Total block 976502785
Total Time: 01:11:47, Ave. Rate: 6.96GB/min, 100.00% completed!
partclone.ext4:
File system: EXTFS
Device size: 500.0 GB = 122062848 Blocks
Space in use: 168.5 GB = 41126354 Blocks
Free Space: 331.5 GB = 80936494 Blocks
Block size: 4096 Byte
Total block 122062848
Total Time: 00:36:46, Ave. Rate: 4.58GB/min, 100.00% completed!
Partclone.dd took 72 mins, while partclone.ext4 took 37 mins in this case. Apparently the latter works better.
Of course, if the used blocks in the file system is about 100%, partclone.dd is the best one.
Did you compare the rates by using partclone.dd in two different versions of Clonezilla live?
Steven
I decided to run -q2 tests with the old version (3.1.3-16) and the new (3.2.1-3). Saw the same speed difference.
However, this time I was sitting near the computer and noticed something.
Looking at the network switch, where the source and destination computers are plugged,
the new cz had many pauses in the blinking lights of 5-10 seconds each, where the old cz had very few pauses, each less than 1 second, later than 80% copied.
BTW in these tests the SSD was not used at all apart from attaching it to the cz run.
Maybe this is the source of the slowness???
Eyal
Last edit: Eyal Lebedinsky 2025-02-15
Somehow it looks like the issue is about the network storage.
Very likely it might be the protocol issue, or anything else.
Do you have spare network switch you can swap?
Steven
BTW, have you tried Ubuntu-based Clonezilla live? e.g., 20250218-plucky?
https://clonezilla.org/downloads.php
Steven
FYI: now tested with 3.2.1-7 with similar results.
This time (repeating both 3.1 and 3.2 versions) I ran 'top' in a second text screen. It looked very different.
3.1.3-16 had partclone.ext4 permanently at the top with reasonable %CPU, with nfs threads below it.
3.2.1-7 shows partclone.ext4 regularly drop off the list for many seconds at a time, with nfs threads above it.
Clearly there was an important change, looking at the running threads, the network activity and the total run time. Maybe a different/changed scheduler?
Is there a way to collect more information to pinpoint the source of the difference? Maybe some perf data of sorts to see where time is spent?
Thanks for your feedback.
We checked that the rate shown in Partclone was changed. The older version has some issues about the rate.
So it's better to compare the real time it takes.
Could you please check the real time it takes in these two different versions?
Does that really have a big difference between them?
Steven
By "real time" you mean wall clock time, right? I already mentioned that the old version took 24m to complete and the new one 36m.
later: if you want me to measure the time on my wristwatch - the old one really took 24m.
The new one 36m
HTH
If you need something else then please tell me how.
version 3.1.3-16
version 3.2.1-7
Eyal
Last edit: Eyal Lebedinsky 2025-02-20
OK, thanks for your report.
Is that possible you could change:
"Saving a 500GB SSD disk over USB 3.0 to a network (1G) drive (TBs of free space)."
to
"Saving a 500GB SSD disk over USB type C to another SSD drive"?
Let's make the environment more isolated if you can.
Steven
Another thing you can try there is:
1. Boot Clonezilla live 3.2.1-7 amd64
2. In the command line, run: sudo -i
3. wget https://free.nchc.org.tw/drbl-core/pool/drbl/unstable/partclone/partclone_0.3.32-drbl-1_amd64.deb
4. dpkg -i partclone_0.3.32-drbl-1_amd64.deb
Then run the following command 2 or 3 times:
partclone.ext4 -z 10485760 -N -L /var/log/clonezilla//partclone.log -c -s /dev/sdb1 --output /home/partimag/u2204-3.2.1-7-2025-02-20-10-img/sdb1.ext4-ptcl-img.uncomp
(Replace the image dir, otherwise the existing one will be overwritten)
By doing so, it would be easier to identify the issue is on the Linux/hardware support issue, or the Partclone itself.
Steven
Results of first test "Saving a 500GB SSD disk over USB type C to another SSD drive"
Short story: took the same time.
Tomorrow I will attend to the second request.
Eyal
Last edit: Eyal Lebedinsky 2025-02-22
Planning for the second test (with a downloaded partclone):
Saving to a network location or to local SSD?
Eyal
Done it three times. Saving to local SSD
Used the attached script for the tests.
Eyal
Last edit: Eyal Lebedinsky 2025-02-23
For Clonezilla live 3.1.3-16 running Partclone 0.3.32:
For Clonezilla live 3.2.1-7 running Partclone 0.3.33:
The time spent is almost the same for these two cases...
No idea why you had low results the previous tests...
Steven
Yes, writing to disk is good both ways. However, there is an issue when using the network. Old version is faster, new version is slower, consistently.
I mentioned that I see long pauses (switch light stops showing activity) with the new version. Something changed for sure.
Is there a way to see a profile of the partclone run to see where time is spent?
Running
20250218-plucky
over a network using a different switch.- Note: right off the bat I see long network pauses in this run...
Total Time: 00:37:24, Ave. Rate: 4.65GB/min, 100.00% completed!
Running
20250218-plucky
using my original switch.- Note: right off the bat I see long network pauses in this run...
Total Time: 00:39:03, Ave. Rate: 4.45GB/min, 100.00% completed!
Attached syslog after these two runs.
later: run using
clonezilla-live-20240715-noble-amd64
which I think is from the same date as the 3.1.3 version. Also fast.Total Time: 00:24:41, Ave. Rate: 7.05GB/min, 100.00% completed!
Attached log files of above three runs.
Eyal
Last edit: Eyal Lebedinsky 2025-02-27
Any progress? What is the plan now?
Note that I also tested the latest
3.2.1-9
with similar slow result (37m).Eyal
Last edit: Eyal Lebedinsky 2025-03-13
Since we are not able to reproduce the issue here, we do not actually fix this.
If it's Linux kernel related issue, maybe you can try the newer Clonezilla live in the future:
https://clonezilla.org//downloads.php
since it will come with newer kernel.
Or maybe you can try to have a different type of network switch? Try to see if is switch/kernel or network protocol related issue.
Steven