From: Jarrod J. <jjo...@le...> - 2024-01-25 21:34:28
|
What is the OS of the deployment server? kill -USR1 $(cat /var/run/confluent/pid) This should produce a /var/log/confluennt/hangtraces Would be interesting to see if there's ansible related stacks in hangtraces that seem stuck... ________________________________ From: David Magda <dma...@ee...> Sent: Thursday, January 25, 2024 4:25 PM To: xCAT Users Mailing list <xca...@li...> Subject: Re: [xcat-user] [External] Ansible and Confluent First suggested command: """ # confluent_selfcheck OS Deployment: Initialized Confluent UUID: Consistent Web Server: Running Web Certificate: Traceback (most recent call last): File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> cert = certificates_missing_ips(conn) File "/opt/confluent/bin/confluent_selfcheck", line 57, in certificates_missing_ips ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' """ On the being-installed system, ignoring the typical Linux stuff, the output of 'ps -elfH' has: """ 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? 00:00:00 /usr/libexec/polkitd --no-debug 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? 00:00:00 /usr/sbin/rsyslogd -n -iNONE 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? 00:00:03 /usr/lib/snapd/snapd 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? 00:00:00 /lib/systemd/systemd-logind 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? 00:00:02 /usr/libexec/udisks2/udisksd 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? 00:00:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? 00:00:00 /usr/sbin/ModemManager 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m subiquity.cmd.server 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? 00:00:00 sh -c /custom-installation/post.sh 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? 00:00:00 /bin/bash /custom-installation/post.sh 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 /snap/subiquity/5004/usr/bin/subiquity 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? 00:00:00 sshd: root@pts/0 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? 00:00:00 -bash 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? 00:00:00 ps -elfH 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? 00:00:00 /lib/netplan/netplan-dbus 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? 00:00:00 /usr/libexec/packagekitd 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? 00:00:00 /lib/systemd/systemd --user 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? 00:00:00 (sd-pam) """ While 'ps axf' produces (trimmed): """ 2042 ? Ss 0:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m subiquity.cmd.server 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh 27501 ? S 0:00 \_ /bin/bash /custom-installation/post.sh 27588 ? S 3:21 \_ /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 /snap/subiquity/5004/usr/bin/subiquity """ Doing a "kill -9 27588" (on apiclient) causes the installation to 'finish'. After the reboot, and after "firshboot.sh" does its thing, we have the following from 'ps axf': """ 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules --mode=final 1376 ? S 0:00 \_ /bin/sh -c tee -a /var/log/cloud-init-output.log 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log 1378 ? S 0:00 \_ /bin/sh /var/lib/cloud/instance/scripts/runcmd 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh 1429 ? S 0:01 \_ /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 """ This causes the "/var/log/httpd/ssl_access_log" to start filling up. A subsequent reboot, where "firstboot.sh" is not run, has the the system coming up without "apiclient" running, and so there's no longer 'spam' in "ssl_access_log". Running "apiclient" manually from the CLI with the exact options causes a bunch of stuff in "ssl_access_log": """ fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - """ at the same time as the above is being generated, there is nothing in "/var/log/confluent/trace" or "stderr”. On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: > Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also > would be tempted to see if 'confluent_selfcheck' has any suggestions. You > can also ssh into the node during that phase to confirm what it is doing > while it is seemingly hung, e.g. looking at ps axf > ________________________________ > From: David Magda <dma...@ee...> > Sent: Wednesday, January 24, 2024 9:37 PM > To: xCA...@li... <xCA...@li...> > Subject: [External] [xcat-user] Ansible and Confluent > > Hello, > > I'm trying to get Ansible working with Confluent 3.8.0. (Using an older > version due to legacy OS reasons.) > > In /var/lib/confluent/public/os/ I created a new profile called > ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took the > provided "autoinstall/user-data" file, added some partition stanzas, some > packages, etc. > > Once I sorted out a 'basic' automated Ubuntu install I tried creating a > "ansible/post.d/01-packages.yaml" file with-in the profile directory with > the following contents: > > """ > - name: install chrony > apt: > pkg: > - chrony > """ > > The Ubuntu (subiquity) installer seems to 'hang' at: > > """ > start: subiquity/Late/run/command_1: /custom-installation/post.sh > """ > > which probably corresponds to this part of the "user-data" file: > > """ > late-commands: > - chroot /target apt-get -y -q purge snapd modemmanager > - /custom-installation/post.sh > """ > > When the 'hang' occurs the following starts filling up the > "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: > > """ > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > """ > > When I force a restart of the system/VM, it can boot off the disk, and > goes through the regular start-up process, including a bunch of cloud-init > stuff. Though after it runs "/etc/confluent/firstboot.sh", the > "ssl_access_log" file once again starts filling with the > "remoteconfig/status" stuff per above. > > Renaming "ansible/" to "ansible_off/" seems to make the problem go away. > Similar behaviour with Ubuntu 20.04. > > I'm wondering what's going with the 'hang' when "post.sh" is executed, and > the flooding after "firstboot.sh". > > Regards, > David _______________________________________________ xCAT-user mailing list xCA...@li... https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C19f3a540a0bc4a2ca42c08dc1dec6e5e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418148525412338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=45rqrSCFhmih33jrSi9cDz4vjZmDJq7fWnbRNEKV3b4%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> |