Menu

#4622 running "mknb ppc64" or installing pcm causes ppc64le boots to fail.

2.10
closed
None
dhcp
7
2015-07-06
2015-03-21
No

running "mknb ppc64" or installing pcm causes ppc64le boots to fail.

After installing pcm on our system we found that all our stateful ubuntu14.10 nodes failed to boot.

We see this in the console log file:

[Sat Mar 21 10:55:02 2015][    0.000000] Using PowerNV machine description
[Sat Mar 21 10:55:02 2015][    0.000000] Page sizes from device-tree:
....
The system is going down NOW!
[Sat Mar 21 11:10:58 2015]
Sent SIGTERM to all processes
[Sat Mar 21 11:10:59 2015]
Sent SIGKILL to all processes
[Sat Mar 21 11:11:06 2015] -> smp_release_cpus()
[Sat Mar 21 11:11:07 2015]spinning_secondaries = 191
[Sat Mar 21 11:11:07 2015] <- smp_release_cpus()
[Sat Mar 21 11:11:07 2015] <- setup_system()
[Sat Mar 21 11:11:11 2015]Done
[Sat Mar 21 11:11:47 2015]ERROR: BOOTIF missing, can't detect boot nic
[Sat Mar 21 11:11:48 2015]Could not load host key: /etc/ssh/ssh_host_ecdsa_key
[Sat Mar 21 11:11:48 2015]Generating private key...Done
[Sat Mar 21 11:11:48 2015]Setting IP via DHCP...
[Sat Mar 21 11:11:48 2015][   39.686852] bnx2x: [bnx2x_dcbnl_set_dcbx:2350(enP3p9s0f0)]Requested DCBX mode 5 is beyond advertised capabilities
[Sat Mar 21 11:11:48 2015][   39.717841] bnx2x: [bnx2x_dcbnl_set_dcbx:2350(enP3p9s0f1)]Requested DCBX mode 5 is beyond advertised capabilities
[Sat Mar 21 11:11:48 2015][   39.738944] bnx2x: [bnx2x_dcbnl_set_dcbx:2350(enP3p10s0f0)]Requested DCBX mode 5 is beyond advertised capabilities
[Sat Mar 21 11:11:50 2015]Acquiring network addresses..Acquired IPv4 address on enP3p9s0f0: 10.0.0.19/16
[Sat Mar 21 11:11:51 2015]Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
[Sat Mar 21 11:11:51 2015]Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
[Sat Mar 21 11:11:51 2015]Could not set IPMB address: Bad file descriptor
[Sat Mar 21 11:11:51 2015]Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
[Sat Mar 21 11:11:51 2015]Get Device ID command failed
[Sat Mar 21 11:11:54 2015]rsyslogd: error: option -c is no longer supported - ignored
[Sat Mar 21 11:11:56 2015]                                                         
[Sat Mar 21 11:11:56 2015]                                                         
[Sat Mar 21 11:11:57 2015]boot
[Sat Mar 21 11:11:57 2015]Rebooting.
[Sat Mar 21 11:11:59 2015][   50.582279] reboot: Restarting system
...

And it keeps repeating ths loop.

We traced this problem to the following settings:

in /etc/dhcpd

[root@dccxcat /]# grep -C 5 conf-file /etc/dhcp/dhcpd.conf
#xCAT generated dhcp configuration
    } else if option client-architecture = 00:09 { #x86_64 uefi alternative id
       filename "xcat/xnba.efi";
    } else if option client-architecture = 00:02 { #ia64
       filename "elilo.efi";
    } else if option client-architecture = 00:0e { #OPAL-v3
       option conf-file = "http://10.0.0.2/tftpboot/pxelinux.cfg/p/10.0.0.0_16";
    } else if substring(filename,0,1) = null { #otherwise, provide yaboot if the client isn't specific
       filename "/yaboot";
    }
    range dynamic-bootp 10.0.0.201 10.0.0.254;
  } # 10.0.0.0/255.255.0.0 subnet_end

[root@dccxcat /]# cat /tftpboot/pxelinux.cfg/p/10.0.0.0_16

default xCAT
   label xCAT
   kernel http://10.0.0.2:80//tftpboot/xcat/genesis.kernel.ppc64
   initrd http://10.0.0.2:80//tftpboot/xcat/genesis.fs.ppc64.gz
   append "quiet xcatd=10.0.0.2:3001 "

We can reproduce the problem with the followiong commands

mknb
makedhcp -a

the mknb creates the file: /tftpboot/pxelinux.cfg/p/10.0.0.0_16, and the makedhcp -a
causes any conf-file data in the dhcp leases file to be be removed.

        supersede conf-file = "http://10.0.0.2/tftpboot/petitboot/tulgpu009";

"grep tulgpu009 -C5 /var/lib/dhcpd/dhcpd.leases "
shows the following before running makedhcp -a:

host tulgpu009 {
  dynamic;
  hardware ethernet 00:0a:f7:73:88:20;
  fixed-address 10.0.0.19;
        supersede server.ddns-hostname = "tulgpu009";
        supersede host-name = "tulgpu009";
        supersede conf-file = "http://10.0.0.2/tftpboot/petitboot/tulgpu009";
}

And after running makedhcp -a it shows:

host tulgpu009 {
  dynamic;
  hardware ethernet 00:0a:f7:73:88:20;
  fixed-address 10.0.0.19;
        supersede server.ddns-hostname = "tulgpu009";
        supersede host-name = "tulgpu009";
}

Once this happens, to recover we need to remove the http://10.0.0.2/tftpboot/http://10.0.0.2/tftpboot/pxelinux.cfg/p/10.0.0.0_16

This also may imply there are potentially problems with generating genisis boot images and using them on a ppc64 or ppc64el script.

There seem to be times where we can get into a situation where once we
generate the gensis pxelinux.cfg files that we can't every boot ppc64le
again, until those files are removed.

We recover from this problem by doing:

 rm -rf /tftpboot/pxelinux.cfg/

I have attached some files:

tulgpu009.boot.log -- console log of the repeating boot after doing mknb
mknb.badboot.pcap -- tcpdump pcap trace showing boot sequence picking up the pxelinux.cfg.p.10.0.0.0_16 file.
mknb.goodboot.pcap -- tcpdump pcap trace showing boot after wer remove the pxelinux.cfg.p.10.0.0.0_16

pcap traces are viewable with wireshark:

The badboot shows:

GET /tftpboot/pxelinux.cfg/p/10.0.0.0_16 HTTP/1.1
Host: 10.0.0.2
User-Agent: Wget
Connection: close

HTTP/1.1 200 OK
Date: Sat, 21 Mar 2015 15:10:45 GMT
Server: Apache/2.4.6 (Red Hat Enterprise Linux) PHP/5.4.16
Last-Modified: Sat, 21 Mar 2015 15:08:38 GMT
ETag: "c3-511cdcec732f0"
Accept-Ranges: bytes
Content-Length: 195
Connection: close

default xCAT
   label xCAT
   kernel http://10.0.0.2:80//tftpboot/xcat/genesis.kernel.ppc64
   initrd http://10.0.0.2:80//tftpboot/xcat/genesis.fs.ppc64.gz
   append "quiet xcatd=10.0.0.2:3001 "

And then it proceeds to load the genesis kernel (for the wrong architecture)..... and the boot loop repeates repeatedly.

and the goodboot shows:

GET /tftpboot/pxelinux.cfg/p/10.0.0.0_16 HTTP/1.1
Host: 10.0.0.2
User-Agent: Wget
Connection: close

HTTP/1.1 404 Not Found
Date: Sat, 21 Mar 2015 15:14:38 GMT
Server: Apache/2.4.6 (Red Hat Enterprise Linux) PHP/5.4.16
Content-Length: 233
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /tftpboot/pxelinux.cfg/p/10.0.0.0_16 was not found on this server.</p>
</body></html>

Discussion

  • ralph bellofatto

    I believe that a fix for this could be in the petitboot.pm file, where it makes the decision to NOT write the dhcp leases conf-file when the status is set to "boot"

    This logic in the petitboot.pm file seems to be preventing the conf-file from being written into the dhcp leases file.

      $normalnodes{$node}=1; #Assume a normal netboot (well, normal dhcp, 
                          #which is normally with a valid 'filename' field,
                          #but the typical ppc case will be 'special' makedhcp
                          #to clear the filename field, so the logic is a little
                          #opposite
      #  $sub_req->({command=>['makedhcp'], #This is currently batched elswhere
      #         node=>[$node]},$callback);  #It hopefully will perform correctly
      if ($cref and $cref->{currstate} eq "boot") {
        $breaknetbootnodes{$node}=1;
        delete $normalnodes{$node}; #Signify to omit this from one makedhcp command
    

    Given the possible presance of a gensis boot file in the tftpboot directory and the default field conf-file the dhcpd.conf file, i think that the this record needs to be in the leases file all the time.

     
  • zhao er tao

    zhao er tao - 2015-03-24
    • assigned_to: zhao er tao
     
  • zhao er tao

    zhao er tao - 2015-03-24

    Hi, Ralph, running "mknb ppc64" only won't cause this issue, right?
    I will fix this issue for "makedhcp -a" that create dhcp lease entry for every node based on their "provmethod" attribute.

     
  • ralph bellofatto

    Running mknb causes the file /tftpboot/pxelinux.cfg/p/10.0.0.0_16 to be created if it is no already there.

    if the dhcpd leases file does not have the conf-file entry in it (and it does not always have it in it) then we will have this problem immediately upon running mknb.

    host tulgpu009 {
    dynamic;
    hardware ethernet 00:0a:f7:73:88:20;
    fixed-address 10.0.0.19;
    supersede server.ddns-hostname = "tulgpu009";
    supersede host-name = "tulgpu009";
    supersede conf-file = "http://10.0.0.2/tftpboot/petitboot/tulgpu009";
    }

    I'm finding that xCAT is inconsistant weather this conf-file entry gets filled out. I have found that makedhcp -a removes the conf-file entry. However, there may be other circumstances where it does not get put there as well.

    To be reliable, the dhcp leases file for petite boot must ALWAYS have the conf-file entry. Otherwise, if the pxeboot file for that sub net exists, then the boots will fail.

    Please check that the makedhcp -a is the only place where this entry in the leases file can go missing.

     
  • ralph bellofatto

    actually, I mispoke

    After checking my notes.

    mknb ppc64 will cause the genesis files to be created. Once those files are created and anything is done on the system that will result in a leases file Not having a conf-file entry, then the ppc64le systems won't boot.

     
  • Guang Cheng Li

    Guang Cheng Li - 2015-04-14
    • Priority: 5 --> 7
     
  • Guang Cheng Li

    Guang Cheng Li - 2015-04-14

    This is a problem we need to fix, PCM also complained for their HA configuration, in hierarchy environment the service xcatd restart on SN also has this problem.

     
  • zhao er tao

    zhao er tao - 2015-05-13
    • status: open --> pending
     
  • zhao er tao

    zhao er tao - 2015-05-13

    Fixed with git commit hash num a84a9e655bb955ec3071bb3fed93478205008913 and ff20b96dba91feec3e506ae354877a85c89a6951 for master branch. Will you pls try with the xCAT build later than today.

     
  • ralph bellofatto

    will this fix be applied to 2.10 and 2.9, or only 2.10?

     
  • zhao er tao

    zhao er tao - 2015-05-14

    This fix is only put in 2.10 branch.

     
  • Guang Cheng Li

    Guang Cheng Li - 2015-07-06

    This bug has been pending status for 2 months, closing it out...

     
  • Guang Cheng Li

    Guang Cheng Li - 2015-07-06
    • status: pending --> closed
    • component: unknown --> dhcp
     
  • ralph bellofatto

    I am out of the office until 07/20/2015.

    I will respond to your message when I return.

    Note: This is an automated response to your message "[xcat:bugs] #4622
    running "mknb ppc64" or installing pcm causes ppc64le boots to fail." sent
    on 07/06/2015 1:19:54 AM.

    This is the only notification you will receive while this person is away.

     
MongoDB Logo MongoDB