> so i will try unstable next
ok, now trying unstable. still seeing all the same dhcp problems, but still able to collect mac addresses and PXE boot to drbl. but then i get the same error:
> when i select either "clonezilla: save disk as
> image" or "ubuntu 11.10 drbl mode" on the client i want to clone,
> it dies with the following:
>
> ...
> [ 0.157167] ---[ end trace 5a5d197966b56a2f ]---
> [ 0.157167] Fixing recursive fault but reboot is needed!
> [ 0.160009] PCI: Using configuration type 1 for base access
> [ 0.160009] PCI: Using configuratino type 1 for extended access
>
> i tried multiple reboots but get the same error (sometimes with, sometimes
> without the PCI lines).
in my bios, under 'power', support for both acpi 2.0 and acpi apic are 'enabled' by default. i thought i remembered that at least one of these had to be disabled for ghost solution suite to work (turned out that was actually ahci, but i didn't realize that til i looked it up later [1]). so i tried disabling them. disabling acpi 2.0 doesn't help, but disabling acpi apic does (even with acpi 2.0 left enabled)! this is with*out* setting 'acpi=off' and/or 'noapic' after tabbing to the boot options in the drbl pxe startup screen, which didn't help (forum message where steven suggested that: http://sourceforge.net/projects/drbl/forums/forum/394008/topic/3915708).
that allowed me to collect the image from the template machine! it took about an hour for ~50GB used space on that machine's drive.
but then i tried to multicast it to 4 clients. everything went great for a while -- they all connected and started getting the image. it said it should take an hour. i came back the next day, and it hasn't finished. two still looked healthy, but said they were 96% done and had been going at it 17 hours and were down to 50MB/min -- they were stuck at that point -- the remaining time was increasing, the transfer rate was decreasing, and the percentage complete was not changing. two looked very unhealthy, the screen was all messed up (partially the blue download screen, partially error text output). one had a bunch of repeats of:
[ 3455.numbers]: ata3.00: failed command: WRITE FPDMA QUEUED
[ 3455.numbers]: ata3.00: cmd 61/00:hexnumbers/40 tag 30 ncq 524288 out
[ 3455.numbers]: res 40/00:hexnumbers/40 Emask 0x10 (ATA bus error)
[ 3455.numbers]: ata3.00: status { DRDY }
the other one seemed to die at 27 minutes into it, it said 27 minutes elapsed and 45 to go, rate 860 MB/min, 38% done. its error output was:
general protection fault: 0000 [#1] SMP
CPU0
lots of hexy/stack tracy stuff without anything illuminating, lots of references to RSP, RIP, DR3, DR0, CR2, etc
when i reboot them, they all say 'image loading failure. reload image!'
so then i tried multicasting just to the 2 that looked healthy. one of them actually worked! it booted windows, had all my data and applications, could browse the internet, etc. yay!
the other died at 96% again:
read CRC error: no such file or directory, please check your image file.
gzip: stdin: unexpected end of file
Partclone fail, please check /var/log/partclone.log !
>>> Time elapsed: 3392.29 sec (~56 mins)
>>> NOTE: The elapsed time...
Finished restoring image 2012-...-img to /dev/sda2.
***********************************
Informing the kernel the file system has been changed .......... done!
Preparing the next ... 9 8 7 6 5 4 3 2 1
****************************
Failed to restore partiiton image file /host/images/2012-...-img/sda2* to /dev/sda2! Maybe this image is corrupt or there is no /host/images/2012-...-img/sda2*! If you are restoring the image of partition to different partition, check the FAQ on Clonezilla website for how to make it. Press "Enter" to continue....
the server has:
Client 192.168.1.1 (54:04:a6:ef:67:26) finished cloning. Stats: Multicast restored 2012-02-04-19-img, /dev/sda1, success, 26.2 MB, .134 mins; /dev/sda2, success, 62.6 GB, 64.356 mins;
Client 192.168.1.3 (54:04:a6:ef:66:c9) finished cloning. Stats: Multicast restored 2012-02-04-19-img, /dev/sda1, success, 26.2 MB, 1.347 mins; /dev/sda2, ***FAIL***, 62.6 GB, 56.538 mins;
the server has no /var/log/partclone.log, and /var/log/partimage/ is empty. when i hit "Enter", the broken machine reboots and the default drbl option of booting to local OS if available actually tries to start windows, which realizes it has disk errors and attempts to repair them. eventually it fails, saying "Boot manager failed to find OS loader."
so now i am trying unicast with one client. it gets 7% into the clone (2 mins, 25 to go) and fails. it was going at 2.3 GB/min, though, almost twice as fast as multicast -- is this expected? anyway, it says:
CRC error again at -202022912...
/opt/drbl/sbin/ocs-functions: line 4333: 2877 Exit 141
{ for img in $target_d/$img_file_prefix;
do
cat $img;
done }
2878 Broken pipe | $unzip_stdin_cmd
2879 Segmentation fault | LC_ALL=C partclone.${fs_} $PARTCLONE_RESTORE_OPT -L $partclone_img_info_tmp -s - -r -o $part
>>> time elapsed ...
and then the same message as above about "failed to restore... maybe this image is corrupt...
this may have happened about the time the box where windows was trying to repair itself rebooted, which PXE'd into drbl and by default would have also selected the unicast restore (but i changed it to boot to local OS). does starting another drbl session during an ongoing unicast kill everything?
i tried to unicast to the box where windows tried and failed to repair itself, and that worked.
Client 192.168.1.3 (54:04:a6:ef:66:c9) finished cloning. Stats: Unicast restored 2012-02-04-19-img, /dev/sda1, success, 26.2 MB, .084 mins; /dev/sda2, success, 62.6 GB, 36.158 mins;
so now i'm trying a multicast restore using clonezilla live from drbl on the two machines that haven't worked yet. it's 5% in and working at the 2GB/min rate -- great! but then one of the clients failed, and the other one stopped progress at the same moment. the one that died just has a bunch of hexy stack-trace output that is uninformative. EIP, CR2, etc.
now trying a unicast restore from clonezilla live in drbl. i start one, let it go for awhile, then start the other one -- it doesn't seem to interfere! they go great for a while, but then both appear to die at the same moment. one (A) has the WRITE FPDMA QUEUED error from above, the other (B) has just hexy stack trace stuff (but without the EIP, CR2, etc type codes). i reboot (B) to make it try again, but machine (A) actually seems to keep going -- the blue restore screen is gone, but the numbers still appear overlaid on the error output and are counting down as if it is still downloading the image. and it actually does get to the end at the expected time and reports it has successfully cloned. the server says:
Client 192.168.1.6 (14:da:e9:71:d5:0e) finished cloning. Stats: Unicast restored 2012-02-04-19-img, /dev/sda1, success, 26.2 MB, .088 mins; /dev/sda2, success, 62.6 GB, 24.216 mins;
at this moment, (B)'s attempted retry fails, and it dumps out a bunch of uninformative hexy stack-trace again. so i think you have some kind of problem in drbl that different clients communicating with the server can interfere with one another and cause jobs to fail.
meanwhile, when (A) tries to boot into its new local OS, windows fails to start. startup repair automatically runs, and finds "Unspecified changes to the system configuration might have caused the problem." this machine is the only one that has a slightly different configuration -- same motherboard, but faster cpu, bigger drive, and includes optical drive. would sysprep before taking the image address this?
the other machine still dies every time i try to unicast to it, sometimes early, sometimes late into the image download. the output is always that hexy uninformative stack trace-like stuff. so what is going wrong with it?
one thing i noticed is that while taking the image, the server machine becomes very unresponsive, and trying to use it for other tasks seems to really interfere with the speed of the transfer. however, this doesn't seem to be true when restoring the image. is this expected? why is it the case?
steven, can you help out at all with what is going wrong? also, why was acpi apic a problem -- is it documented anywhere that this needs to be turned off in the bios? and why did my virtualbox debian attempts result in dhcpd/eth0:1 breaking?
les, thanks for your help, but these problems don't look related to using a single nic, aliasing, or enabling nat -- don't you agree?
thanks,
-erik
[1] http://www.symantec.com/business/support/index?page=content&id=TECH109551&locale=en_US
details:
during /opt/drbl/sbin/drblsrv -i, after i accept the default N for:
Do you want to upgrade the operating system? [y/N]
i get:
*****************************************************.
2nd, installing the necessary files for DRBL...
*****************************************************.
...
Setting up isc-dhcp-server (4.1.1-P1-17ubuntu10.1) ...
Generating /etc/default/isc-dhcp-server...
* Starting ISC DHCP server dhcpd * check syslog for diagnostics.
[fail]
invoke-rc.d: initscript isc-dhcp-server, action "start" failed.
/var/log/syslog has:
Feb 4 14:46:50 ubuntu dhcpd: Internet Systems Consortium DHCP Server 4.1.1-P1
Feb 4 14:46:50 ubuntu dhcpd: Copyright 2004-2010 Internet Systems Consortium.
Feb 4 14:46:50 ubuntu dhcpd: All rights reserved.
Feb 4 14:46:50 ubuntu dhcpd: For info, please visit https://www.isc.org/software/dhcp/
Feb 4 14:46:50 ubuntu dhcpd: Internet Systems Consortium DHCP Server 4.1.1-P1
Feb 4 14:46:50 ubuntu dhcpd: Copyright 2004-2010 Internet Systems Consortium.
Feb 4 14:46:50 ubuntu dhcpd: All rights reserved.
Feb 4 14:46:50 ubuntu dhcpd: For info, please visit https://www.isc.org/software/dhcp/
Feb 4 14:46:50 ubuntu dhcpd: Wrote 0 leases to leases file.
Feb 4 14:46:50 ubuntu dhcpd:
Feb 4 14:46:50 ubuntu dhcpd: No subnet declaration for eth1 (no IPv4 addresses).
Feb 4 14:46:50 ubuntu dhcpd: ** Ignoring requests on eth1. If this is not what
Feb 4 14:46:50 ubuntu dhcpd: you want, please write a subnet declaration
Feb 4 14:46:50 ubuntu dhcpd: in your dhcpd.conf file for the network segment
Feb 4 14:46:50 ubuntu dhcpd: to which interface eth1 is attached. **
Feb 4 14:46:50 ubuntu dhcpd:
Feb 4 14:46:50 ubuntu dhcpd:
Feb 4 14:46:50 ubuntu dhcpd: Not configured to listen on any interfaces!
Feb 4 14:46:52 ubuntu kernel: [92208.410597] type=1400
audit(1328395612.820:23): apparmor="STATUS" operation="profile_replace"
name="/usr/sbin/dhcpd" pid=2214 comm="apparmor_parser"
but /opt/drbl/sbin/drblpush -i still sees eth0 and eth0:1 correctly:
eth0: IP address 128.223.140.140, netmask 255.255.254.0
eth0:1: IP address 192.168.1.7, netmask 255.255.255.0
Configured ethernet card(s) found in your system: eth0 eth0:1
------------------------------------------------------
The ethernet port for Internet access is: eth0
The ethernet port(s) for DRBL environment: eth0:1
so why is dhcpd always talking about eth1?
and when i go to collect my mac addresses, again the dhcpd fails to stop:
Stopping isc-dhcp-server ...
* Stopping ISC DHCP server dhcpd [fail]
Stopping tftpd-hpa ...
Rather than invoking init scripts through /etc/init.d, use the service(8)
utility, e.g. service tftpd-hpa stop
Since the script you are attempting to invoke has been converted to an
Upstart job, you may also use the stop(8) utility, e.g. stop tftpd-hpa
*****************************************************.
anyway, i still can collect all my mac addresses, still with the
mysterious 00:D0:..., it's definitely not on my side of the WAN, could
it be coming in from the WAN?
this is all still true:
> because of this note in the documentation:
> NOTE! This alias IP address will cause some problems if you do not provide
> static IP address to DRBL client via its MAC address. In this example, the DRBL
> server will lease IP address to any machine connected to eth0 if no MAC address
> is set in the DHCP service. Hence you'd better not to use alias IP if you do
> not know exactly what you are doing! Two or more NICs are recommended!
>
> i am saying i do want drbl's dhcp to give same ip based on mac address at
> eth0:1
> it agrees and gives me this:
>
> NIC NIC IP Clients
> +-----------------------------+
> | DRBL SERVER |
> | |
> | +-- [eth0] 128.223.140.140 +- to WAN
> | |
> | +-- [eth0:1] 192.168.1.7 +- to clients group 0:1 [ 6 clients, their IP
> | | from 192.168.1.1 - 192.168.1.6]
> +-----------------------------+
i use drbl ssi/clonezilla box mode.
it works no matter whether i set Y or N here
> Do you want to let DRBL server as a NAT server? If not, your DRBL client will
> NOT be able to access Internat.
> [Y/n]
this still happens:
> then, when it actually tries to set up the server, it again has trouble stopping
> dhcp, but it starts again ok?
>
> Now start the service: portmap isc-dhcp-server nis nfs-kernel-server tftpd-hpa
> drbl-clients-nat
> portmap stop/waiting
> portmap start/running, process 27221
> * Stopping ISC DHCP server
> dhcpd
> [fail]
> * Starting ISC DHCP server
> dhcpd
> [ OK ]
|