From: Mark P. <mp...@pc...> - 2016-04-21 22:45:00
|
Thinking about edge cases, maybe pull the Product Manufacturer and trim to 4 characters, if the field the MTM is pulled from doesn't exist. That should cover SNs not necessarily being unique between manufacturers. Have both nodediscovey and dodiscovery check manufacturer, if Lenovo or IBM continue as currently coded, if not, check for MTM, then Product Name, then Product Manufacturer. I'll see what I can come up with. I should have some hardware to test with soon, outside of these Intel nodes. Sent from Outlook<https://aka.ms/f818qx> On Thu, Apr 21, 2016 at 2:17 PM -0700, "Mark Potter" <mp...@pc...<mailto:mp...@pc...>> wrote: That's what I figured. While I have it working, my hack is, well, hacky. I can't come up with a scenario where serial alone wouldn't be unique enough for deployment. I'm working on having it not need the MTM if it's absent and if it's present still use it. That would seem to make MTMS work in just about every scenario. Sent from Outlook<https://aka.ms/f818qx> On Thu, Apr 21, 2016 at 2:09 PM -0700, "Jarrod Johnson" <jjo...@le...<mailto:jjo...@le...>> wrote: Yeah, that '[]' syntax is specific thing set by IBM originally and thus only IBM and Lenovo do that…. In general, the whole concept of a 'machine type' or 'model number' somehow distinct from the 'model name' isn't a universal one. E.g. two different model numbers can both map to 'x3650 M5' for various reasons… From: Mark Potter [mailto:mp...@pc...] Sent: Thursday, April 21, 2016 4:54 PM To: xCAT Users Mailing list Subject: Re: [xcat-user] MTMS Discovery Questions I noticed that the discovered nodes don't have an MTM defined so I went digging. In dodiscovery the MTM is set with, at least for these nodes, 72: MTM=`cat /sys/devices/virtual/dmi/id/product_name|awk -F'[' '{print $2}'|awk -F']' '{print $1}'` This command does not produce output for multiple systems thus far. Dell R730s would up with a blank MTM as do the Intel nodes I'm testing. This is the cause of the discovery not working as expected. I don't have a lot of hardware to test with but changing it to: 72: #MTM=`cat /sys/devices/virtual/dmi/id/product_name|awk -F'[' '{print $2}'|awk -F']' '{print $1}'` 73: MTM=`cat /sys/devices/virtual/dmi/id/product_name|awk '{ print $1 }' and rerunning mknb, caused MTMS discovery to function properly for me with the Intel nodes I'm testing with. This method of finding the MTM is a little iffy for some manufactures based on the way that the MTM is pulled during bmcdiscovery: (bmcdiscover.pm) 870: my $fru0_cmd = "/opt/xcat/bin/ipmitool-xcat -I lanplus $bmcusername $bmcpassword -H $ip fru print 0"; 871: my @fru0_output_array = xCAT::Utils->runcmd($fru0_cmd, -1); 872: my $fru0_output = join(" ", @fru0_output_array); 873: if (($fru0_output =~ /Product Part Number :\s*(\S*).*Product Serial :\s*(\S*)/)) { 874: $mtm = $1; 875: $serial = $2; which will pull only the first set of alpha-numerics in the Product Part Number. Dell hardware doesn't even have that field available and the cat /sys/devices/virtual/dmi/id/product_name lines up with the Product Name field in the FRU output: Board Mfg Date : Sat Oct 18 07:21:00 2014 Board Mfg : DELL Board Product : PowerEdge R730 Board Serial : CN7792149P01WZ Board Part Number : 0599V5A03 Product Manufacturer : DELL Product Name : PowerEdge R730 Product Version : 01 Product Serial : 6Z0GD42 In the end I'm wondering what the necessity is for the machine type outside of IBM/Lenovo hardware. It seems that a serial number should be a unique enough identifier to key off of and easy enough to code a check to see if MTM is set and if it's not, simply ignore it and key off of the serial alone for discovery. Regards, Mark L. Potter ________________________________ From: Mark Potter [mp...@pc...] Sent: Wednesday, April 20, 2016 10:37 AM To: xCAT Users Mailing list Subject: Re: [xcat-user] MTMS Discovery Questions I realized that I didn't include any logs in my last email. I tried again by removing both the bmc nodes and the defined nodes from the stanza file, clearing dhcpd.leases, and running "makehosts -n, makedns -n, makedhcp -n, makedhcp -a" before starting so it wasn't running the makedhcp commands that caused the issue. commands.log --------- ==================================================== [Date] 2016-04-20 10:28:31 [ClientType] cli [Request] bmcdiscover -s nmap --range 172.1.252-253.1-253 -t -z -w [Response] node-intel-s2600wp-pcpc-intel-002: objtype=node groups=all bmc=172.1.252.5 cons=ipmi mgt=ipmi mtm=INTEL-S2600WP serial=PCPC-INTEL-002 nodetype=mp hwtype=bmc node-intel-s2600wp-pcpc-intel-004: objtype=node groups=all bmc=172.1.252.10 cons=ipmi mgt=ipmi mtm=INTEL-S2600WP serial=PCPC-INTEL-004 nodetype=mp hwtype=bmc node-intel-s2600wp-pcpc-intel-003: objtype=node groups=all bmc=172.1.252.14 cons=ipmi mgt=ipmi mtm=INTEL-S2600WP serial=PCPC-INTEL-003 nodetype=mp hwtype=bmc ==================================================== [Date] 2016-04-20 10:29:21 [ClientType] cli [Request] mkdef -z [Response] 3 object definitions have been created or modified. ==================================================== [Date] 2016-04-20 10:29:24 [ClientType] cli [Request] nodels [Response] intel-r1c1n2 intel-r1c1n3 intel-r1c1n4 node-intel-s2600wp-pcpc-intel-002 node-intel-s2600wp-pcpc-intel-003 node-intel-s2600wp-pcpc-intel-004 xchead02 ==================================================== [Date] 2016-04-20 10:29:29 [ClientType] cli [Request] makehosts -n [Response] ==================================================== [Date] 2016-04-20 10:29:32 [ClientType] cli [Request] makedns -n [Response] Handling intel-r1c1n2 in /etc/hosts. Handling intel-r1c1n4 in /etc/hosts. Handling localhost in /etc/hosts. Handling xchead02 in /etc/hosts. Handling intel-r1c1n3 in /etc/hosts. Getting reverse zones, this may take several minutes for a large cluster. Completed getting reverse zones. Updating zones. Completed updating zones. Restarting named Restarting named complete Updating DNS records, this may take several minutes for a large cluster. Completed updating DNS records. ==================================================== [Date] 2016-04-20 10:29:49 [ClientType] cli [Request] rsetboot node-intel-s2600wp-pcpc-intel-002-node-intel-s2600wp-pcpc-intel-004 net [Response] node-intel-s2600wp-pcpc-intel-004: Network node-intel-s2600wp-pcpc-intel-002: Network node-intel-s2600wp-pcpc-intel-003: Network ==================================================== [Date] 2016-04-20 10:29:56 [ClientType] cli [Request] rpower node-intel-s2600wp-pcpc-intel-002-node-intel-s2600wp-pcpc-intel-004 reset [Response] node-intel-s2600wp-pcpc-intel-004: reset node-intel-s2600wp-pcpc-intel-003: reset node-intel-s2600wp-pcpc-intel-002: reset ==================================================== [Date] 2016-04-20 10:31:24 [ClientType] [Request] getcredentials x509cert [Response] ==================================================== [Date] 2016-04-20 10:31:24 [ClientType] [Request] getdestiny [Response] ==================================================== [Date] 2016-04-20 10:31:25 [ClientType] [Request] getcredentials x509cert [Response] ==================================================== [Date] 2016-04-20 10:31:25 [ClientType] [Request] getdestiny [Response] ==================================================== [Date] 2016-04-20 10:31:58 [ClientType] [Request] getcredentials x509cert [Response] ==================================================== [Date] 2016-04-20 10:31:58 [ClientType] [Request] getdestiny [Response] --------- cluster.log ---------- Apr 20 10:29:56 xchead02 xcat[83701]: xCAT: Allowing rpower to node-intel-s2600wp-pcpc-intel-002-node-intel-s2600wp-pcpc-intel-004 reset for root from localhost Apr 20 10:31:24 xchead02 xcat[83729]: xCAT: Allowing getcredentials x509cert from 172-1-252-7.lightspeed.nsvltn.sbcglobal.net Apr 20 10:31:25 xchead02 xcat[83741]: xCAT: Allowing getcredentials x509cert from 172-1-252-9.lightspeed.nsvltn.sbcglobal.net Apr 20 10:31:58 xchead02 xcat[83755]: xCAT: Allowing getcredentials x509cert from 172-1-252-11.lightspeed.nsvltn.sbcglobal.net Apr 20 10:32:37 xchead02 xcat[62143]: xcatd: Processing discovery request from 172.1.252.7 Apr 20 10:32:37 xchead02 xcat[62143]: Discovery Error: Could not find any node. Apr 20 10:32:37 xchead02 xcat[62143]: Discovery Error: Could not find any node. Apr 20 10:32:39 xchead02 xcat[62143]: xcatd: Processing discovery request from 172.1.252.9 Apr 20 10:32:39 xchead02 xcat[62143]: Discovery Error: Could not find any node. Apr 20 10:32:39 xchead02 xcat[62143]: Discovery Error: Could not find any node. ---------- Regards, Mark L. Potter ________________________________ From: Mark Potter [mp...@pc...] Sent: Wednesday, April 20, 2016 10:06 AM To: xCAT Users Mailing list Subject: Re: [xcat-user] MTMS Discovery Questions As I gathered this information I realized that I did "makedhcp -n" and "makedhcp -a" which aren't called for in the docs, just out of habit. Is that causing the nodes not to deploy in some way? Here's the process I used (http://xcat-docs.readthedocs.org/en/latest/guides/admin-guides/manage_clusters/ppc64le/discovery/mtms/discovery_using_defined.html): # bmcdiscover -s nmap --range "172.1.252-253.1-253" -t -z -w # nodels node-intel-s2600wp-pcpc-intel-002 node-intel-s2600wp-pcpc-intel-003 node-intel-s2600wp-pcpc-intel-004 # bmcdiscover -s nmap --range "172.1.252-253.1-253" -z > intel.stanza.1 <below cut to a single node for brevity> # cat intel.stanza node-intel-s2600wp-pcpc-intel-002: objtype=node groups=all bmc=172.1.252.5 cons=ipmi mgt=ipmi mtm=INTEL-S2600WP serial=PCPC-INTEL-002 <edits made> # cat intel.stanza intel-r1c1n2: objtype=node groups=intel bmc=172.1.100.2 cons=ipmi mgt=ipmi mtm=INTEL-S2600WP serial=PCPC-INTEL-002 ip=172.1.1.2 # cat intel.stanza | mkdef -z # nodels intel-r1c1n2 intel-r1c1n3 intel-r1c1n4 node-intel-s2600wp-pcpc-intel-002 node-intel-s2600wp-pcpc-intel-003 node-intel-s2600wp-pcpc-intel-004 # lsdef node-intel-s2600wp-pcpc-intel-002 Object name: node-intel-s2600wp-pcpc-intel-002 bmc=172.1.252.5 cons=ipmi groups=all hwtype=bmc mgt=ipmi mtm=INTEL-S2600WP nodetype=mp postbootscripts=otherpkgs postscripts=syslog,remoteshell,syncfiles serial=PCPC-INTEL-002 status=booting statustime=04-20-2016 09:47:03 # lsdef intel-r1c1n2 Object name: intel-r1c1n2 bmc=172.1.100.2 chain=runcmd=bmcsetup,osimage=node-netboot cons=ipmi groups=intel ip=172.1.1.2 mgt=ipmi mtm=INTEL-S2600WP postbootscripts=otherpkgs postscripts=syslog,remoteshell,syncfiles serial=PCPC-INTEL-002 <add chain config - I used the group here instead of a single node> # tabdump chain #node,currstate,currchain,chain,ondiscover,comments,disable "intel",,,"runcmd=bmcsetup,osimage=node-netboot",,, #makehosts -n # makedns -n # makedhcp -n # makedhcp -a # rsetboot node-intel-s2600wp-pcpc-intel-002 net # rpower node-intel-s2600wp-pcpc-intel-002 status # rpower node-intel-s2600wp-pcpc-intel-002 status node-intel-s2600wp-pcpc-intel-002: off # rpower node-intel-s2600wp-pcpc-intel-002 on Regards, Mark L. Potter ________________________________ From: Mark Potter [mp...@pc...] Sent: Wednesday, April 20, 2016 7:49 AM To: xca...@li...<mailto:xca...@li...>; xca...@li...<mailto:xca...@li...> Cc: xca...@li...<mailto:xca...@li...> Subject: Re: [xcat-user] MTMS Discovery Questions It did, although I think is was due to other reasons. Genesis did fail to boot but I ended up being able to boot a node. I ended up having to use nodediscoverdef to associate the node with a defined node. Using the stanza file generated from bmcdiscover didn't end up with the node deploying or even being recognized by discovery. Logs showed the discovery seeing the node but no node being identified. I'm going to be working on it more this morning and going back through the docs and seeing if I missed anything. Sent from Outlook<https://aka.ms/f818qx> On Tue, Apr 19, 2016 at 6:14 PM -0700, "Xiao Peng Wang" <wx...@cn...<mailto:wx...@cn...>> wrote: Hi Mark, The error might happen in the code 10xcat-cmdline.sh which is used to hack the genesis system. But you mentioned that the system dropped back to pxe because of this error. Do you mean the genesis failed to boot and then the server tried to boot from pxe again? Thanks Best Regards ---------------------------------------------------------------------- Wang Xiaopeng (王晓朋) IBM China System Technology Laboratory Tel: 86-10-82453455 Email: wx...@cn...<mailto:wx...@cn...> Address: 28,ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District Beijing P.R.China 100193 ----- Original message ----- From: Mark Potter <mp...@pc...<mailto:mp...@pc...>> To: xCAT Users Mailing list <xca...@li...<mailto:xca...@li...>> Cc: Subject: Re: [xcat-user] MTMS Discovery Questions Date: Wed, Apr 20, 2016 3:40 AM Using the new instructions I have a node booting with genesis but it fails with "sed: can't read /etc/passwd" and drops back to PXE. I can't find anything with that error when searching. I hope it's just something I've overlooked because it seems I'l really close to having this working. I will have some comments on the docs, very few but I've got a couple, once I get this working! Regards, Mark L. Potter ________________________________ From: Victor Hu [vh...@us...] Sent: Monday, April 18, 2016 10:09 AM To: xca...@li...<mailto:xca...@li...> Subject: Re: [xcat-user] MTMS Discovery Questions Hi Mark, I'm currently working on a re-factor/rewrite of the MTMS section to hopefully clear up the procedure a bit. You can track it by following: https://github.com/xcat2/xcat-core/pull/705 It's not yet merged, but fairly close... maybe by tomorrow. Once merged, you will see the new document in the "latest" branch of ReadTheDocs. http://xcat-docs.readthedocs.org/en/latest/guides/admin-guides/manage_clusters/ppc64le/discovery/mtms_discovery.html Feedback is welcome, feel free to open GitHub issues to report Regards, VICTOR K. HU HPC Software Development ----- Original message ----- From: Mark Potter <mp...@pc...<mailto:mp...@pc...>> To: "xca...@li...<mailto:xca...@li...>" <xca...@li...<mailto:xca...@li...>>, "xca...@li...<mailto:xca...@li...>" <xca...@li...<mailto:xca...@li...>> Cc: Subject: Re: [xcat-user] MTMS Discovery Questions Date: Mon, Apr 18, 2016 9:29 AM How would predefinition work? I'd like to be able to define the nodes, setting serial and machine type, and have them recognized using bmcdiscovery. Oddly, I hadn't tried that, I was busy figuring out getting the white boxes I have in the lab to have serial/machine type set in fields xCAT would recognize. Sent from Outlook<https://aka.ms/f818qx> On Sun, Apr 17, 2016 at 6:52 PM -0700, "Xiao Peng Wang" <wx...@cn...<mailto:wx...@cn...>> wrote: xCAT does support to use MTMS to discover nodes. You can manually predefine the node with MTMS manually or using bmcdiscover command to scan the nodes automatically. Refer to the doc: https://xcat-docs.readthedocs.org/en/2.11/guides/admin-guides/manage_clusters/ppc64le/discovery/mtms_discovery.html Thanks Best Regards ---------------------------------------------------------------------- Wang Xiaopeng (王晓朋) IBM China System Technology Laboratory Tel: 86-10-82453455 Email: wx...@cn...<mailto:wx...@cn...> Address: 28,ZhongGuanCun Software Park,No.8 Dong Bei Wang West Road, Haidian District Beijing P.R.China 100193 ----- Original message ----- From: Mark Potter <mp...@pc...<mailto:mp...@pc...>> To: "xca...@li...<mailto:xca...@li...>" <xca...@li...<mailto:xca...@li...>> Cc: Subject: Re: [xcat-user] MTMS Discovery Questions Date: Sat, Apr 16, 2016 12:48 AM Actually it's not all that odd. The nodes I'm working with are white box nodes so the FRU stuff is blank on purpose. Luckily these are test nodes so I can take all the time I need. I set the right FRU fields and xCAT picked everything up without an issue. I'm now reading through code seeing how switch discovery works and if I can manage to get it to look for serial and auto assign based on that. The use case is that we do a lot of integration and testing and already scan serials. We don't leave an image on the machines and the switches end up configured for the client. Since we already have the serials, in order, per rack, configuring top of rack switches twice can be quite a bit of added time that could be avoided. I can generate stanza files off the serial scan and pre add every node with the right name. If I can get be discovery to auto associate the nodes them that's one less step. If I can't then I'll write a script to go through a stanza file and fix it up nice and neat and go that route. I'm hoping for the prior but I'm not a Perl guru, I can read and modify but that's with looking things up. Getting the serials set on my test nodes was a good step in the right direction. Sent from Outlook<https://aka.ms/f818qx> On Fri, Apr 15, 2016 at 7:25 AM -0700, "Jarrod Johnson" <jjo...@le...<mailto:jjo...@le...>> wrote: Yeah, it can get a little weird digging into that part of it in a cross-vendor way. The switch discovery was intended as one of the more cross-vendor tolerant modes of getting the job done. From: Mark Potter [mailto:mp...@pc...] Sent: Thursday, April 14, 2016 6:06 PM To: xca...@li...<mailto:xca...@li...> Subject: Re: [xcat-user] MTMS Discovery Questions And after getting everything working the Intel Nodes I testing with don't have a machine type or serial number that can be identified by xCAT. I've some thinking to do about this one now. Sent from Outlook<https://aka.ms/f818qx> On Thu, Apr 14, 2016 at 10:09 AM -0700, "Jarrod Johnson" <jjo...@le...<mailto:jjo...@le...>> wrote: For reference, one thought was that the gathering of the serial numbers could follow from the switch discovery, skipping tag scan in if that was desirable. In the current circumstance: [root@odin ~]# nodediscoverls UUID NODE METHOD MTM SERIAL 5CD1216B-5C37-11E1-BA0C-5CF3FC6E49C8 undef undef 8737AC1 23XXH29 # nodediscoverdef -u 5CD1216B-5C37-11E1-BA0C-5CF3FC6E49C8 -n n1 There are some other things coming…. From: Mark Potter [mailto:mp...@pc...] Sent: Thursday, April 14, 2016 12:11 PM To: xCAT Users Mailing list Subject: [xcat-user] MTMS Discovery Questions I am working on getting MTMS up and running as it fits our use case very well. We do a large volume of nodes that are shipped out to customers and all we are responsible for is testing. Scanning tags for serial numbers is already part of the process so getting a list of serial/position is much easier than configuring top of rack switches and doing switch/port discovery. I am curious as to why this method can't be used with pre-defined nodes as switch/port discovery can. What I'd like to be able to do is use the serial number/position file that I have to predefine the nodes and have them completely setup during discovery. Will this be an exercise in futility and will I be better off just developing scripts to handle the discovered nodes? I am guessing, from reading the docs, that I'll be scripting a chunk of this process but I thought it wouldn't hurt to ask the hive mind before I dived in and broke out the python. Regards, Mark L. Potter ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z _______________________________________________ xCAT-user mailing list xCA...@li...<mailto:xCA...@li...> https://lists.sourceforge.net/lists/listinfo/xcat-user ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z _______________________________________________ xCAT-user mailing list xCA...@li...<mailto:xCA...@li...> https://lists.sourceforge.net/lists/listinfo/xcat-user ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z _______________________________________________ xCAT-user mailing list xCA...@li...<mailto:xCA...@li...> https://lists.sourceforge.net/lists/listinfo/xcat-user |