You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(227) |
Sep
(185) |
Oct
(259) |
Nov
(168) |
Dec
(163) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
(94) |
Feb
(92) |
Mar
(121) |
Apr
(83) |
May
(158) |
Jun
(72) |
Jul
(150) |
Aug
(64) |
Sep
(81) |
Oct
(98) |
Nov
(79) |
Dec
(27) |
2004 |
Jan
(93) |
Feb
(81) |
Mar
(85) |
Apr
(43) |
May
(71) |
Jun
(28) |
Jul
(89) |
Aug
(156) |
Sep
(51) |
Oct
(50) |
Nov
(48) |
Dec
(56) |
2005 |
Jan
(59) |
Feb
(180) |
Mar
(68) |
Apr
(58) |
May
(44) |
Jun
(59) |
Jul
(50) |
Aug
(103) |
Sep
(100) |
Oct
(66) |
Nov
(41) |
Dec
(33) |
2006 |
Jan
(41) |
Feb
(51) |
Mar
(133) |
Apr
(66) |
May
(40) |
Jun
(34) |
Jul
(86) |
Aug
(28) |
Sep
(62) |
Oct
(54) |
Nov
(24) |
Dec
(23) |
2007 |
Jan
(72) |
Feb
(81) |
Mar
(33) |
Apr
(64) |
May
(23) |
Jun
(67) |
Jul
(33) |
Aug
(54) |
Sep
(38) |
Oct
(40) |
Nov
(108) |
Dec
(84) |
2008 |
Jan
(49) |
Feb
(44) |
Mar
(65) |
Apr
(43) |
May
(75) |
Jun
(171) |
Jul
(121) |
Aug
(86) |
Sep
(189) |
Oct
(326) |
Nov
(172) |
Dec
(178) |
2009 |
Jan
(86) |
Feb
(154) |
Mar
(159) |
Apr
(112) |
May
(113) |
Jun
(64) |
Jul
(147) |
Aug
(170) |
Sep
(157) |
Oct
(153) |
Nov
(149) |
Dec
(184) |
2010 |
Jan
(196) |
Feb
(234) |
Mar
(191) |
Apr
(233) |
May
(95) |
Jun
(200) |
Jul
(134) |
Aug
(189) |
Sep
(158) |
Oct
(135) |
Nov
(104) |
Dec
(135) |
2011 |
Jan
(101) |
Feb
(142) |
Mar
(157) |
Apr
(142) |
May
(145) |
Jun
(195) |
Jul
(306) |
Aug
(268) |
Sep
(128) |
Oct
(80) |
Nov
(125) |
Dec
(112) |
2012 |
Jan
(93) |
Feb
(125) |
Mar
(94) |
Apr
(102) |
May
(134) |
Jun
(85) |
Jul
(80) |
Aug
(130) |
Sep
(104) |
Oct
(104) |
Nov
(133) |
Dec
(107) |
2013 |
Jan
(136) |
Feb
(127) |
Mar
(172) |
Apr
(183) |
May
(158) |
Jun
(84) |
Jul
(132) |
Aug
(143) |
Sep
(46) |
Oct
(94) |
Nov
(42) |
Dec
(61) |
2014 |
Jan
(248) |
Feb
(89) |
Mar
(93) |
Apr
(102) |
May
(59) |
Jun
(44) |
Jul
(131) |
Aug
(69) |
Sep
(199) |
Oct
(88) |
Nov
(38) |
Dec
(59) |
2015 |
Jan
(54) |
Feb
(57) |
Mar
(70) |
Apr
(71) |
May
(63) |
Jun
(79) |
Jul
(85) |
Aug
(106) |
Sep
(69) |
Oct
(72) |
Nov
(48) |
Dec
(28) |
2016 |
Jan
(42) |
Feb
(70) |
Mar
(89) |
Apr
(87) |
May
(114) |
Jun
(57) |
Jul
(47) |
Aug
(60) |
Sep
(38) |
Oct
(36) |
Nov
(12) |
Dec
(28) |
2017 |
Jan
(32) |
Feb
(44) |
Mar
(135) |
Apr
(101) |
May
(98) |
Jun
(42) |
Jul
(54) |
Aug
(21) |
Sep
(23) |
Oct
(83) |
Nov
(89) |
Dec
(15) |
2018 |
Jan
(18) |
Feb
(2) |
Mar
(35) |
Apr
(12) |
May
(52) |
Jun
(103) |
Jul
(65) |
Aug
(35) |
Sep
(47) |
Oct
(81) |
Nov
(86) |
Dec
(44) |
2019 |
Jan
(34) |
Feb
(63) |
Mar
(58) |
Apr
(21) |
May
(39) |
Jun
(30) |
Jul
(43) |
Aug
(22) |
Sep
(26) |
Oct
(62) |
Nov
(39) |
Dec
(47) |
2020 |
Jan
(40) |
Feb
(27) |
Mar
(30) |
Apr
(20) |
May
(42) |
Jun
(24) |
Jul
(60) |
Aug
(26) |
Sep
(60) |
Oct
(29) |
Nov
(15) |
Dec
(7) |
2021 |
Jan
(34) |
Feb
(31) |
Mar
(54) |
Apr
(8) |
May
(40) |
Jun
(49) |
Jul
(14) |
Aug
(26) |
Sep
(25) |
Oct
(13) |
Nov
(46) |
Dec
(19) |
2022 |
Jan
(45) |
Feb
(8) |
Mar
(20) |
Apr
(25) |
May
(8) |
Jun
(12) |
Jul
(10) |
Aug
(11) |
Sep
(4) |
Oct
(11) |
Nov
(3) |
Dec
(3) |
2023 |
Jan
|
Feb
(25) |
Mar
(7) |
Apr
(16) |
May
(7) |
Jun
(8) |
Jul
(31) |
Aug
(11) |
Sep
(32) |
Oct
(18) |
Nov
(25) |
Dec
(6) |
2024 |
Jan
(48) |
Feb
(31) |
Mar
(7) |
Apr
(1) |
May
(22) |
Jun
(8) |
Jul
(3) |
Aug
|
Sep
|
Oct
(5) |
Nov
|
Dec
|
From: Brian J. <mar...@gm...> - 2024-02-07 22:12:07
|
Hi, I'm building a Confluent server for a development environment, and I can't contact hpc.lenovo.com to install the repo the traditional way (despite security telling me there were no blocks on hpc.lenovo.com that they were aware of). So I used the "local" repo way (I can download the package on my work laptop just not at the data center) but I can't install packages due to the Signature on the "lenovohpckey.pub" file: warning: Signature not supported. Hash Algorithm SHA1 not available The Confluent host is running Rocky 9.3 Any suggestions? Brian Joiner |
From: David M. <dma...@ee...> - 2024-02-06 18:25:37
|
[Sorry for the delay.] "/etc/ssh/ssh_known_hosts” does not initially exist. Running that command creates it. Re-running "nodeapply -F dm-boot1” after that gives the same results. > On Jan 29, 2024, at 14:31, Jarrod Johnson <jjo...@le...> wrote: > > Oh, how does /etc/ssh/ssh_known_hosts look on the management node? > > Does osdeploy initialize -k > > Make it work? > > From: David Magda <dma...@ee...> > Sent: Monday, January 29, 2024 2:07 PM > To: xCAT Users Mailing list <xca...@li...> > Subject: Re: [xcat-user] [External] Ansible and Confluent > >> Yes, I was able to SSH in as root: >> >> """ >> # sudo -u confluent bash >> bash-4.2$ eval $(ssh-agent) >> Agent pid 216756 >> bash-4.2$ ssh-add /etc/confluent/ssh/automation >> Identity added: /etc/confluent/ssh/automation (Confluent Automation by mp01.example.com) >> bash-4.2$ ssh root@172.17.15.222 >> The authenticity of host '172.17.15.222 (172.17.15.222)' can't be established. >> ECDSA key fingerprint is SHA256:5Q4YF3R0Zd1uT9vtXvLdkI1BDn7gvqz9djMaeubezAU. >> ECDSA key fingerprint is MD5:c8:1d:85:bf:7f:51:29:95:48:79:94:6e:5a:99:45:83. >> Are you sure you want to continue connecting (yes/no)? yes >> Warning: Permanently added '172.17.15.222' (ECDSA) to the list of known hosts. >> Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-92-generic x86_64) >> […] >> root@dm-boot1:~# >> """ >> >> Trying to re-run 'nodeapply' didn't work (after the SSH host key is now known): >> >> """ >> # nodeapply -F dm-boot1 >> dm-boot1: >> dm-boot1: --------------------------------------------------------------------------- >> dm-boot1: Running python script 'syncfileclient' from https://[fe80::[EUI-64]%2]/confluent-public/os/ubuntu-22.04.3-x86_64-test1/scripts/ >> dm-boot1: Executing in /tmp/confluentscripts.ZSMiTTzcr >> dm-boot1: Traceback (most recent call last): >> dm-boot1: File "/usr/lib/python3.10/http/client.py", line 566, in _get_chunk_left >> dm-boot1: chunk_left = self._read_next_chunk_size() >> dm-boot1: File "/usr/lib/python3.10/http/client.py", line 533, in _read_next_chunk_size >> dm-boot1: return int(line, 16) >> dm-boot1: ValueError: invalid literal for int() with base 16: b'' >> dm-boot1: >> dm-boot1: During handling of the above exception, another exception occurred: >> dm-boot1: >> dm-boot1: Traceback (most recent call last): >> dm-boot1: File "/usr/lib/python3.10/http/client.py", line 583, in _read_chunked >> dm-boot1: chunk_left = self._get_chunk_left() >> dm-boot1: File "/usr/lib/python3.10/http/client.py", line 568, in _get_chunk_left >> dm-boot1: raise IncompleteRead(b'') >> dm-boot1: http.client.IncompleteRead: IncompleteRead(0 bytes read) >> dm-boot1: >> dm-boot1: During handling of the above exception, another exception occurred: >> dm-boot1: >> dm-boot1: Traceback (most recent call last): >> dm-boot1: File "/tmp/confluentscripts.ZSMiTTzcr/syncfileclient", line 286, in <module> >> dm-boot1: synchronize() >> dm-boot1: File "/tmp/confluentscripts.ZSMiTTzcr/syncfileclient", line 233, in synchronize >> dm-boot1: status, rsp = ac.grab_url_with_status('/confluent-api/self/remotesyncfiles') >> dm-boot1: File "/opt/confluent/bin/apiclient", line 405, in grab_url_with_status >> dm-boot1: return rsp.status, rsp.read() >> dm-boot1: File "/usr/lib/python3.10/http/client.py", line 460, in read >> dm-boot1: return self._read_chunked(amt) >> dm-boot1: File "/usr/lib/python3.10/http/client.py", line 598, in _read_chunked >> dm-boot1: raise IncompleteRead(b''.join(value)) >> dm-boot1: http.client.IncompleteRead: IncompleteRead(0 bytes read) >> dm-boot1: 'syncfileclient' exited with code 1 >> """ >> >> >> > On Jan 26, 2024, at 16:26, Jarrod Johnson <jjo...@le...> wrote: >> > >> > create the following as a python script: >> > import confluent.sshutil as ssh >> > print(ssh.get_passphrase()) >> > >> > >> > Then: >> > export PYTHONPATH=/opt/confluent/lib/python >> > python thatscript.py >> > >> > Then: >> > sudo -u confluent bash >> > eval $(ssh-agent) >> > ssh-add /etc/confluent/ssh/automation >> > >> > Then paste in the passphrase from above. >> > >> > Does that let confluent user ssh into the node? >> >> >> >> From: David Magda <dma...@ee...> >> >> Sent: Friday, January 26, 2024 4:22 PM >> >> To: xCAT Users Mailing list <xca...@li...> >> >> Subject: Re: [xcat-user] [External] Ansible and Confluent >> >> >> >> Yup: >> >> >> >> """ >> >> # sha1sum /var/lib/confluent/public/site/ssh/*pubkey /etc/confluent/ssh/automation.pub >> >> b88168467bf2920011f4a769d7cbd7aab0de0b35 /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey >> >> 27574dd33ad3781bb588d7fcef2b8a6dd189d3cb /var/lib/confluent/public/site/ssh/mp01.example.com.rootpubkey >> >> b88168467bf2920011f4a769d7cbd7aab0de0b35 /etc/confluent/ssh/automation.pub >> >> “"” >> > […] >> >> >> _______________________________________________ >> xCAT-user mailing list >> xCA...@li... >> https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C8365bb20b4914654c1e808dc20fdad3d%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638421525842478994%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9hRC51204bYCXe3ufH1IqpPjpwaj4mtdAEKJotVvL0c%3D&reserved=0 >> _______________________________________________ >> xCAT-user mailing list >> xCA...@li... >> https://lists.sourceforge.net/lists/listinfo/xcat-user > |
From: Jarrod J. <jjo...@le...> - 2024-01-29 19:31:57
|
Oh, how does /etc/ssh/ssh_known_hosts look on the management node? Does osdeploy initialize -k Make it work? ________________________________ From: David Magda <dma...@ee...> Sent: Monday, January 29, 2024 2:07 PM To: xCAT Users Mailing list <xca...@li...> Subject: Re: [xcat-user] [External] Ansible and Confluent Yes, I was able to SSH in as root: """ # sudo -u confluent bash bash-4.2$ eval $(ssh-agent) Agent pid 216756 bash-4.2$ ssh-add /etc/confluent/ssh/automation Identity added: /etc/confluent/ssh/automation (Confluent Automation by mp01.example.com) bash-4.2$ ssh root@172.17.15.222 The authenticity of host '172.17.15.222 (172.17.15.222)' can't be established. ECDSA key fingerprint is SHA256:5Q4YF3R0Zd1uT9vtXvLdkI1BDn7gvqz9djMaeubezAU. ECDSA key fingerprint is MD5:c8:1d:85:bf:7f:51:29:95:48:79:94:6e:5a:99:45:83. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '172.17.15.222' (ECDSA) to the list of known hosts. Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-92-generic x86_64) […] root@dm-boot1:~# """ Trying to re-run 'nodeapply' didn't work (after the SSH host key is now known): """ # nodeapply -F dm-boot1 dm-boot1: dm-boot1: --------------------------------------------------------------------------- dm-boot1: Running python script 'syncfileclient' from https://[fe80::[EUI-64]%2]/confluent-public/os/ubuntu-22.04.3-x86_64-test1/scripts/ dm-boot1: Executing in /tmp/confluentscripts.ZSMiTTzcr dm-boot1: Traceback (most recent call last): dm-boot1: File "/usr/lib/python3.10/http/client.py", line 566, in _get_chunk_left dm-boot1: chunk_left = self._read_next_chunk_size() dm-boot1: File "/usr/lib/python3.10/http/client.py", line 533, in _read_next_chunk_size dm-boot1: return int(line, 16) dm-boot1: ValueError: invalid literal for int() with base 16: b'' dm-boot1: dm-boot1: During handling of the above exception, another exception occurred: dm-boot1: dm-boot1: Traceback (most recent call last): dm-boot1: File "/usr/lib/python3.10/http/client.py", line 583, in _read_chunked dm-boot1: chunk_left = self._get_chunk_left() dm-boot1: File "/usr/lib/python3.10/http/client.py", line 568, in _get_chunk_left dm-boot1: raise IncompleteRead(b'') dm-boot1: http.client.IncompleteRead: IncompleteRead(0 bytes read) dm-boot1: dm-boot1: During handling of the above exception, another exception occurred: dm-boot1: dm-boot1: Traceback (most recent call last): dm-boot1: File "/tmp/confluentscripts.ZSMiTTzcr/syncfileclient", line 286, in <module> dm-boot1: synchronize() dm-boot1: File "/tmp/confluentscripts.ZSMiTTzcr/syncfileclient", line 233, in synchronize dm-boot1: status, rsp = ac.grab_url_with_status('/confluent-api/self/remotesyncfiles') dm-boot1: File "/opt/confluent/bin/apiclient", line 405, in grab_url_with_status dm-boot1: return rsp.status, rsp.read() dm-boot1: File "/usr/lib/python3.10/http/client.py", line 460, in read dm-boot1: return self._read_chunked(amt) dm-boot1: File "/usr/lib/python3.10/http/client.py", line 598, in _read_chunked dm-boot1: raise IncompleteRead(b''.join(value)) dm-boot1: http.client.IncompleteRead: IncompleteRead(0 bytes read) dm-boot1: 'syncfileclient' exited with code 1 """ > On Jan 26, 2024, at 16:26, Jarrod Johnson <jjo...@le...> wrote: > > create the following as a python script: > import confluent.sshutil as ssh > print(ssh.get_passphrase()) > > > Then: > export PYTHONPATH=/opt/confluent/lib/python > python thatscript.py > > Then: > sudo -u confluent bash > eval $(ssh-agent) > ssh-add /etc/confluent/ssh/automation > > Then paste in the passphrase from above. > > Does that let confluent user ssh into the node? >> >> From: David Magda <dma...@ee...> >> Sent: Friday, January 26, 2024 4:22 PM >> To: xCAT Users Mailing list <xca...@li...> >> Subject: Re: [xcat-user] [External] Ansible and Confluent >> >> Yup: >> >> """ >> # sha1sum /var/lib/confluent/public/site/ssh/*pubkey /etc/confluent/ssh/automation.pub >> b88168467bf2920011f4a769d7cbd7aab0de0b35 /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey >> 27574dd33ad3781bb588d7fcef2b8a6dd189d3cb /var/lib/confluent/public/site/ssh/mp01.example.com.rootpubkey >> b88168467bf2920011f4a769d7cbd7aab0de0b35 /etc/confluent/ssh/automation.pub >> “"” > […] _______________________________________________ xCAT-user mailing list xCA...@li... https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C8365bb20b4914654c1e808dc20fdad3d%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638421525842478994%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9hRC51204bYCXe3ufH1IqpPjpwaj4mtdAEKJotVvL0c%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> |
From: David M. <dma...@ee...> - 2024-01-29 19:07:36
|
Yes, I was able to SSH in as root: """ # sudo -u confluent bash bash-4.2$ eval $(ssh-agent) Agent pid 216756 bash-4.2$ ssh-add /etc/confluent/ssh/automation Identity added: /etc/confluent/ssh/automation (Confluent Automation by mp01.example.com) bash-4.2$ ssh root@172.17.15.222 The authenticity of host '172.17.15.222 (172.17.15.222)' can't be established. ECDSA key fingerprint is SHA256:5Q4YF3R0Zd1uT9vtXvLdkI1BDn7gvqz9djMaeubezAU. ECDSA key fingerprint is MD5:c8:1d:85:bf:7f:51:29:95:48:79:94:6e:5a:99:45:83. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '172.17.15.222' (ECDSA) to the list of known hosts. Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-92-generic x86_64) […] root@dm-boot1:~# """ Trying to re-run 'nodeapply' didn't work (after the SSH host key is now known): """ # nodeapply -F dm-boot1 dm-boot1: dm-boot1: --------------------------------------------------------------------------- dm-boot1: Running python script 'syncfileclient' from https://[fe80::[EUI-64]%2]/confluent-public/os/ubuntu-22.04.3-x86_64-test1/scripts/ dm-boot1: Executing in /tmp/confluentscripts.ZSMiTTzcr dm-boot1: Traceback (most recent call last): dm-boot1: File "/usr/lib/python3.10/http/client.py", line 566, in _get_chunk_left dm-boot1: chunk_left = self._read_next_chunk_size() dm-boot1: File "/usr/lib/python3.10/http/client.py", line 533, in _read_next_chunk_size dm-boot1: return int(line, 16) dm-boot1: ValueError: invalid literal for int() with base 16: b'' dm-boot1: dm-boot1: During handling of the above exception, another exception occurred: dm-boot1: dm-boot1: Traceback (most recent call last): dm-boot1: File "/usr/lib/python3.10/http/client.py", line 583, in _read_chunked dm-boot1: chunk_left = self._get_chunk_left() dm-boot1: File "/usr/lib/python3.10/http/client.py", line 568, in _get_chunk_left dm-boot1: raise IncompleteRead(b'') dm-boot1: http.client.IncompleteRead: IncompleteRead(0 bytes read) dm-boot1: dm-boot1: During handling of the above exception, another exception occurred: dm-boot1: dm-boot1: Traceback (most recent call last): dm-boot1: File "/tmp/confluentscripts.ZSMiTTzcr/syncfileclient", line 286, in <module> dm-boot1: synchronize() dm-boot1: File "/tmp/confluentscripts.ZSMiTTzcr/syncfileclient", line 233, in synchronize dm-boot1: status, rsp = ac.grab_url_with_status('/confluent-api/self/remotesyncfiles') dm-boot1: File "/opt/confluent/bin/apiclient", line 405, in grab_url_with_status dm-boot1: return rsp.status, rsp.read() dm-boot1: File "/usr/lib/python3.10/http/client.py", line 460, in read dm-boot1: return self._read_chunked(amt) dm-boot1: File "/usr/lib/python3.10/http/client.py", line 598, in _read_chunked dm-boot1: raise IncompleteRead(b''.join(value)) dm-boot1: http.client.IncompleteRead: IncompleteRead(0 bytes read) dm-boot1: 'syncfileclient' exited with code 1 """ > On Jan 26, 2024, at 16:26, Jarrod Johnson <jjo...@le...> wrote: > > create the following as a python script: > import confluent.sshutil as ssh > print(ssh.get_passphrase()) > > > Then: > export PYTHONPATH=/opt/confluent/lib/python > python thatscript.py > > Then: > sudo -u confluent bash > eval $(ssh-agent) > ssh-add /etc/confluent/ssh/automation > > Then paste in the passphrase from above. > > Does that let confluent user ssh into the node? >> >> From: David Magda <dma...@ee...> >> Sent: Friday, January 26, 2024 4:22 PM >> To: xCAT Users Mailing list <xca...@li...> >> Subject: Re: [xcat-user] [External] Ansible and Confluent >> >> Yup: >> >> """ >> # sha1sum /var/lib/confluent/public/site/ssh/*pubkey /etc/confluent/ssh/automation.pub >> b88168467bf2920011f4a769d7cbd7aab0de0b35 /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey >> 27574dd33ad3781bb588d7fcef2b8a6dd189d3cb /var/lib/confluent/public/site/ssh/mp01.example.com.rootpubkey >> b88168467bf2920011f4a769d7cbd7aab0de0b35 /etc/confluent/ssh/automation.pub >> “"” > […] |
From: Jarrod J. <jjo...@le...> - 2024-01-26 21:26:56
|
create the following as a python script: import confluent.sshutil as ssh print(ssh.get_passphrase()) Then: export PYTHONPATH=/opt/confluent/lib/python python thatscript.py Then: sudo -u confluent bash eval $(ssh-agent) ssh-add /etc/confluent/ssh/automation Then paste in the passphrase from above. Does that let confluent user ssh into the node? ________________________________ From: David Magda <dma...@ee...> Sent: Friday, January 26, 2024 4:22 PM To: xCAT Users Mailing list <xca...@li...> Subject: Re: [xcat-user] [External] Ansible and Confluent Yup: """ # sha1sum /var/lib/confluent/public/site/ssh/*pubkey /etc/confluent/ssh/automation.pub b88168467bf2920011f4a769d7cbd7aab0de0b35 /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey 27574dd33ad3781bb588d7fcef2b8a6dd189d3cb /var/lib/confluent/public/site/ssh/mp01.example.com.rootpubkey b88168467bf2920011f4a769d7cbd7aab0de0b35 /etc/confluent/ssh/automation.pub """ > On Jan 26, 2024, at 15:59, Jarrod Johnson <jjo...@le...> wrote: > >> Hmm, how odd.... >> >> # cat /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey /etc/confluent/ssh/automation.pub >> >> Do they match? >> >> From: David Magda <dma...@ee...> >> Sent: Friday, January 26, 2024 3:50 PM >> To: xCAT Users Mailing list <xca...@li...> >> Subject: Re: [xcat-user] [External] Ansible and Confluent >> >> """ >> # ls -lrth /var/lib/confluent/public/site/ssh/*pubkey >> -rw-r--r-- 1 confluent root 410 Oct 31 12:05 /var/lib/confluent/public/site/ssh/mp01.example.com.rootpubkey >> -rw-r--r-- 1 confluent root 129 Oct 31 12:05 /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey >> >> # ssh dm-boot1 'hostname -f; uptime' >> dm-boot1 >> 15:47:19 up 21 min, 0 users, load average: 0.00, 0.00, 0.00 >> """ >> >> >> > On Jan 26, 2024, at 15:43, Jarrod Johnson <jjo...@le...> wrote: >> > >> > # ls /var/lib/confluent/public/site/ssh/*pubkey >> > >> > >> > From: David Magda <dma...@ee...> >> > Sent: Friday, January 26, 2024 3:40 PM >> > To: xCAT Users Mailing list <xca...@li...> >> > Subject: Re: [xcat-user] [External] Ansible and Confluent >> > >> > There’s no “syncfiles” in the default Ubuntu profile, nor anything in the web docs on its format, but I found a template in "/opt/confluent/lib/osdeploy/el9/profiles/default/syncfiles”. >> > >> > Created a file with the line: >> > >> > /etc/hosts -> /etc/hosts_test >> > >> > With the results: >> > >> > """ >> > # nodeapply -F dm-boot1 >> > dm-boot1: >> > dm-boot1: --------------------------------------------------------------------------- >> > dm-boot1: Running python script 'syncfileclient' from https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2F%5Bfe80%3A%3A749f%3A43ff%3Afe72%3A55e4%5D%2Fconfluent-public%2Fos%2Fubuntu-22.04.3-x86_64-test1%2Fscripts%2F&data=05%7C02%7Cjjohnson2%40lenovo.com%7C70eef24bd08f435e3e0508dc1eb518a7%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638419010399866636%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=03TwWoks99f27Kd68e9O5QNzwHezT9%2Bf6LdO2GL4lDg%3D&reserved=0<https://[fe80::749f:43ff:fe72:55e4]/confluent-public/os/ubuntu-22.04.3-x86_64-test1/scripts/> >> > dm-boot1: Executing in /tmp/confluentscripts.HUGo3sMtt >> > dm-boot1: Traceback (most recent call last): >> > dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 286, in <module> >> > dm-boot1: synchronize() >> > dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 233, in synchronize >> > dm-boot1: status, rsp = ac.grab_url_with_status('/confluent-api/self/remotesyncfiles') >> > dm-boot1: File "/opt/confluent/bin/apiclient", line 413, in grab_url_with_status >> > dm-boot1: raise Exception(rsp.read()) >> > dm-boot1: Exception: b"500 - Command '['rsync', '-rvLD', '/tmp/tmpSUbmoD.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255" >> > dm-boot1: 'syncfileclient' exited with code 1 >> > """ >> > >> > In "/var/log/confluent/stderr” we have: >> > >> > """ >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): Traceback (most recent call last): >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 111, in wait >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): listener.cb(fileno) >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 53, in on_read >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): current.switch(([original], [], [])) >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): result = function(*args, **kwargs) >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): File "/opt/confluent/lib/python/confluent/util.py", line 45, in run >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): raise subprocess.CalledProcessError(retcode, process.args, output=stdout) >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 >> > Jan 26 15:28:53 File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 317, in squelch_exception >> > sys.stderr.write("Removing descriptor: %r\n" % (fileno,)): Removing descriptor: 65 >> > """ >> > >> > And in “trace” we have: >> > >> > """ >> > Jan 26 15:28:53 Traceback (most recent call last): >> > File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler >> > for rsp in resourcehandler_backend(env, start_response): >> > File "/opt/confluent/lib/python/confluent/httpapi.py", line 636, in resourcehandler_backend >> > for res in selfservice.handle_request(env, start_response): >> > File "/opt/confluent/lib/python/confluent/selfservice.py", line 502, in handle_request >> > status, output = syncfiles.get_syncresult(nodename) >> > File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in get_syncresult >> > result = syncrunners[nodename].wait() >> > File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 181, in wait >> > return self._exit_event.wait() >> > File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 132, in wait >> > current.throw(*self._exc) >> > File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main >> > result = function(*args, **kwargs) >> > File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node >> > ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] >> > File "/opt/confluent/lib/python/confluent/util.py", line 45, in run >> > raise subprocess.CalledProcessError(retcode, process.args, output=stdout) >> > CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 >> > >> > """ >> > >> > > On Jan 26, 2024, at 15:01, Jarrod Johnson <jjo...@le...> wrote: >> > > >> > > Ok, another track (trying to compensate for not being able to use selfcheck). >> > > >> > > Can you try sticking some file in the profile's syncfiles, then do: >> > > nodeapply -F <node> >> > > >> > > And see if any errors happen, either in output or in the /var/log/confluet area. >> > > >> > >> From: David Magda <dma...@ee...> >> > >> Sent: Friday, January 26, 2024 2:01 PM >> > >> To: xCAT Users Mailing list <xca...@li...> >> > >> Subject: Re: [xcat-user] [External] Ansible and Confluent >> > >> >> > >> We have Confluent installed on a RH/CentOS 7 system that originally had/has xCat installed for deployment of our Lenovo hardware/HPC solution. I just installed it there as it was/is our 'install server'. (We don't want to touch it too much, as it was a previous team of folks that set things up, and there's been a lot of team churn.) >> > >> >> > >> I've attached the "hangtraces" to this message; hopefully the mailing list software will pass it along. I noticed “ipmi” in some of the paths, and for the record this is a VM running under Proxmox, and does not have any LOM configured: >> > >> >> > >> """ >> > >> # nodeattrib dm-boot1 >> > >> dm-boot1: crypted.selfapikey: ******** >> > >> dm-boot1: deployment.apiarmed: >> > >> dm-boot1: deployment.pendingprofile: ubuntu-22.04.3-x86_64-test1 >> > >> dm-boot1: deployment.profile: >> > >> dm-boot1: deployment.sealedapikey: >> > >> dm-boot1: deployment.stagedprofile: >> > >> dm-boot1: deployment.state: >> > >> dm-boot1: deployment.state_detail: >> > >> dm-boot1: deployment.useinsecureprotocols: always >> > >> dm-boot1: dns.servers: 172.17.15.252,172.17.15.247,172.17.15.254 >> > >> dm-boot1: groups: everything >> > >> dm-boot1: net.hwaddr: 4e:78:df:d3:8d:59 >> > >> dm-boot1: net.ipv4_address: 172.17.15.222/21 >> > >> dm-boot1: net.ipv4_gateway: 172.17.8.254 >> > >> """ >> > >> >> > >> Running an strace(1) on the 'apiclient' that runs as part of the "post.sh" process, we have a continuous poll/read/write stream: >> > >> >> > >> """ >> > >> […] >> > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\221\377vZs\21\22S\10\351\232\321I7Y$R\370]\312"..., 254) = 254 >> > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> > >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) >> > >> read(3, "\27\3\3\0\226", 5) = 5 >> > >> read(3, "\0055\271\274&\2464\237\242h\341\30\231\274\327g\224\344g\306\313\206\326\355x\307\341\331C\366H\331"..., 150) = 150 >> > >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) >> > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\222\334e\336f\353u\343p\22\215:\264e\30a\3172\245\361"..., 254) = 254 >> > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> > >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) >> > >> read(3, "\27\3\3\0\226", 5) = 5 >> > >> read(3, "\0055\271\274&\2464\240\326\202\347(\213\311\260|\333\230\372A\235\341\273U\201\223\2209ah\325J"..., 150) = 150 >> > >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) >> > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\223\240\341<\3602\323\177Y\311\317/\371\336P/s\301t8"..., 254) = 254 >> > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> > >> poll([{fd=3, events=POLLIN}], 1, 15000^Cstrace: Process 27477 detached >> > >> <detached ...> >> > >> """ >> > >> >> > >> Per lsof(1), FD 3 is: >> > >> >> > >> """ >> > >> python3 27477 root 3u IPv6 158157 0t0 TCP [fe80::[EUI-64_client]]:44800->[fe80::[EUI-64_server]]:https (ESTABLISHED) >> > >> """ >> > >> >> > >> >> > >> >> > >> On Thu, January 25, 2024 16:34, Jarrod Johnson wrote: >> > >> > What is the OS of the deployment server? >> > >> > >> > >> > kill -USR1 $(cat /var/run/confluent/pid) >> > >> > >> > >> > This should produce a /var/log/confluennt/hangtraces >> > >> > >> > >> > Would be interesting to see if there's ansible related stacks in >> > >> > hangtraces that seem stuck... >> > >> > >> > >> > >> > >> > ________________________________ >> > >> > From: David Magda <dma...@ee...> >> > >> > Sent: Thursday, January 25, 2024 4:25 PM >> > >> > To: xCAT Users Mailing list <xca...@li...> >> > >> > Subject: Re: [xcat-user] [External] Ansible and Confluent >> > >> > >> > >> > First suggested command: >> > >> > >> > >> > """ >> > >> > # confluent_selfcheck >> > >> > OS Deployment: Initialized >> > >> > Confluent UUID: Consistent >> > >> > Web Server: Running >> > >> > Web Certificate: Traceback (most recent call last): >> > >> > File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> >> > >> > cert = certificates_missing_ips(conn) >> > >> > File "/opt/confluent/bin/confluent_selfcheck", line 57, in >> > >> > certificates_missing_ips >> > >> > ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) >> > >> > AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' >> > >> > """ >> > >> > >> > >> > On the being-installed system, ignoring the typical Linux stuff, the >> > >> > output of 'ps -elfH' has: >> > >> > >> > >> > """ >> > >> > >> > >> > 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? >> > >> > 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher >> > >> > --run-startup-triggers >> > >> > 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? >> > >> > 00:00:00 /usr/libexec/polkitd --no-debug >> > >> > 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? >> > >> > 00:00:00 /usr/sbin/rsyslogd -n -iNONE >> > >> > 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? >> > >> > 00:00:03 /usr/lib/snapd/snapd >> > >> > 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? >> > >> > 00:00:00 /lib/systemd/systemd-logind >> > >> > 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? >> > >> > 00:00:02 /usr/libexec/udisks2/udisksd >> > >> > 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? >> > >> > 00:00:00 /usr/bin/python3 >> > >> > /usr/share/unattended-upgrades/unattended-upgrade-shutdown >> > >> > --wait-for-signal >> > >> > 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? >> > >> > 00:00:00 /usr/sbin/ModemManager >> > >> > 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? >> > >> > 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server >> > >> > 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? >> > >> > 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m >> > >> > subiquity.cmd.server >> > >> > 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? >> > >> > 00:00:00 sh -c /custom-installation/post.sh >> > >> > 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? >> > >> > 00:00:00 /bin/bash /custom-installation/post.sh >> > >> > 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? >> > >> > 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient >> > >> > /confluent-api/self/remoteconfig/status -w 204 >> > >> > 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 >> > >> > 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 >> > >> > /snap/subiquity/5004/usr/bin/subiquity >> > >> > 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? >> > >> > 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups >> > >> > 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? >> > >> > 00:00:00 sshd: root@pts/0 >> > >> > 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? >> > >> > 00:00:00 -bash >> > >> > 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? >> > >> > 00:00:00 ps -elfH >> > >> > 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? >> > >> > 00:00:00 /lib/netplan/netplan-dbus >> > >> > 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? >> > >> > 00:00:00 /usr/libexec/packagekitd >> > >> > 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? >> > >> > 00:00:00 /lib/systemd/systemd --user >> > >> > 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? >> > >> > 00:00:00 (sd-pam) >> > >> > """ >> > >> > >> > >> > While 'ps axf' produces (trimmed): >> > >> > >> > >> > """ >> > >> > 2042 ? Ss 0:00 /bin/sh >> > >> > /snap/subiquity/5004/usr/bin/subiquity-server >> > >> > 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m >> > >> > subiquity.cmd.server >> > >> > 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh >> > >> > 27501 ? S 0:00 \_ /bin/bash >> > >> > /custom-installation/post.sh >> > >> > 27588 ? S 3:21 \_ /usr/bin/python3 >> > >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w >> > >> > 204 >> > >> > 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 >> > >> > /snap/subiquity/5004/usr/bin/subiquity >> > >> > """ >> > >> > >> > >> > Doing a "kill -9 27588" (on apiclient) causes the installation to >> > >> > 'finish'. After the reboot, and after "firshboot.sh" does its thing, we >> > >> > have the following from 'ps axf': >> > >> > >> > >> > """ >> > >> > 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules >> > >> > --mode=final >> > >> > 1376 ? S 0:00 \_ /bin/sh -c tee -a >> > >> > /var/log/cloud-init-output.log >> > >> > 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log >> > >> > 1378 ? S 0:00 \_ /bin/sh >> > >> > /var/lib/cloud/instance/scripts/runcmd >> > >> > 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh >> > >> > 1429 ? S 0:01 \_ /usr/bin/python3 >> > >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w >> > >> > 204 >> > >> > """ >> > >> > >> > >> > This causes the "/var/log/httpd/ssl_access_log" to start filling up. A >> > >> > subsequent reboot, where "firstboot.sh" is not run, has the the system >> > >> > coming up without "apiclient" running, and so there's no longer 'spam' in >> > >> > "ssl_access_log". >> > >> > >> > >> > Running "apiclient" manually from the CLI with the exact options causes a >> > >> > bunch of stuff in "ssl_access_log": >> > >> > >> > >> > """ >> > >> > fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET >> > >> > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> > """ >> > >> > >> > >> > at the same time as the above is being generated, there is nothing in >> > >> > "/var/log/confluent/trace" or "stderr�. >> > >> > >> > >> > >> > >> > On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: >> > >> >> Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also >> > >> >> would be tempted to see if 'confluent_selfcheck' has any suggestions. >> > >> >> You >> > >> >> can also ssh into the node during that phase to confirm what it is doing >> > >> >> while it is seemingly hung, e.g. looking at ps axf >> > >> >> ________________________________ >> > >> >> From: David Magda <dma...@ee...> >> > >> >> Sent: Wednesday, January 24, 2024 9:37 PM >> > >> >> To: xCA...@li... <xCA...@li...> >> > >> >> Subject: [External] [xcat-user] Ansible and Confluent >> > >> >> >> > >> >> Hello, >> > >> >> >> > >> >> I'm trying to get Ansible working with Confluent 3.8.0. (Using an older >> > >> >> version due to legacy OS reasons.) >> > >> >> >> > >> >> In /var/lib/confluent/public/os/ I created a new profile called >> > >> >> ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took >> > >> >> the >> > >> >> provided "autoinstall/user-data" file, added some partition stanzas, >> > >> >> some >> > >> >> packages, etc. >> > >> >> >> > >> >> Once I sorted out a 'basic' automated Ubuntu install I tried creating a >> > >> >> "ansible/post.d/01-packages.yaml" file with-in the profile directory >> > >> >> with >> > >> >> the following contents: >> > >> >> >> > >> >> """ >> > >> >> - name: install chrony >> > >> >> apt: >> > >> >> pkg: >> > >> >> - chrony >> > >> >> """ >> > >> >> >> > >> >> The Ubuntu (subiquity) installer seems to 'hang' at: >> > >> >> >> > >> >> """ >> > >> >> start: subiquity/Late/run/command_1: /custom-installation/post.sh >> > >> >> """ >> > >> >> >> > >> >> which probably corresponds to this part of the "user-data" file: >> > >> >> >> > >> >> """ >> > >> >> late-commands: >> > >> >> - chroot /target apt-get -y -q purge snapd modemmanager >> > >> >> - /custom-installation/post.sh >> > >> >> """ >> > >> >> >> > >> >> When the 'hang' occurs the following starts filling up the >> > >> >> "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: >> > >> >> >> > >> >> """ >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> """ >> > >> >> >> > >> >> When I force a restart of the system/VM, it can boot off the disk, and >> > >> >> goes through the regular start-up process, including a bunch of >> > >> >> cloud-init >> > >> >> stuff. Though after it runs "/etc/confluent/firstboot.sh", the >> > >> >> "ssl_access_log" file once again starts filling with the >> > >> >> "remoteconfig/status" stuff per above. >> > >> >> >> > >> >> Renaming "ansible/" to "ansible_off/" seems to make the problem go away. >> > >> >> Similar behaviour with Ubuntu 20.04. >> > >> >> >> > >> >> I'm wondering what's going with the 'hang' when "post.sh" is executed, >> > >> >> and >> > >> >> the flooding after "firstboot.sh". >> > >> >> >> > >> >> Regards, >> > >> >> David _______________________________________________ xCAT-user mailing list xCA...@li... https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C70eef24bd08f435e3e0508dc1eb518a7%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638419010399896272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=RHV3D2q69aNVARogKndfGWHEa5FsmXRd%2BPk87EX9yC8%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> |
From: David M. <dma...@ee...> - 2024-01-26 21:22:27
|
Yup: """ # sha1sum /var/lib/confluent/public/site/ssh/*pubkey /etc/confluent/ssh/automation.pub b88168467bf2920011f4a769d7cbd7aab0de0b35 /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey 27574dd33ad3781bb588d7fcef2b8a6dd189d3cb /var/lib/confluent/public/site/ssh/mp01.example.com.rootpubkey b88168467bf2920011f4a769d7cbd7aab0de0b35 /etc/confluent/ssh/automation.pub """ > On Jan 26, 2024, at 15:59, Jarrod Johnson <jjo...@le...> wrote: > >> Hmm, how odd.... >> >> # cat /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey /etc/confluent/ssh/automation.pub >> >> Do they match? >> >> From: David Magda <dma...@ee...> >> Sent: Friday, January 26, 2024 3:50 PM >> To: xCAT Users Mailing list <xca...@li...> >> Subject: Re: [xcat-user] [External] Ansible and Confluent >> >> """ >> # ls -lrth /var/lib/confluent/public/site/ssh/*pubkey >> -rw-r--r-- 1 confluent root 410 Oct 31 12:05 /var/lib/confluent/public/site/ssh/mp01.example.com.rootpubkey >> -rw-r--r-- 1 confluent root 129 Oct 31 12:05 /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey >> >> # ssh dm-boot1 'hostname -f; uptime' >> dm-boot1 >> 15:47:19 up 21 min, 0 users, load average: 0.00, 0.00, 0.00 >> """ >> >> >> > On Jan 26, 2024, at 15:43, Jarrod Johnson <jjo...@le...> wrote: >> > >> > # ls /var/lib/confluent/public/site/ssh/*pubkey >> > >> > >> > From: David Magda <dma...@ee...> >> > Sent: Friday, January 26, 2024 3:40 PM >> > To: xCAT Users Mailing list <xca...@li...> >> > Subject: Re: [xcat-user] [External] Ansible and Confluent >> > >> > There’s no “syncfiles” in the default Ubuntu profile, nor anything in the web docs on its format, but I found a template in "/opt/confluent/lib/osdeploy/el9/profiles/default/syncfiles”. >> > >> > Created a file with the line: >> > >> > /etc/hosts -> /etc/hosts_test >> > >> > With the results: >> > >> > """ >> > # nodeapply -F dm-boot1 >> > dm-boot1: >> > dm-boot1: --------------------------------------------------------------------------- >> > dm-boot1: Running python script 'syncfileclient' from https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2F%5Bfe80%3A%3A749f%3A43ff%3Afe72%3A55e4%5D%2Fconfluent-public%2Fos%2Fubuntu-22.04.3-x86_64-test1%2Fscripts%2F&data=05%7C02%7Cjjohnson2%40lenovo.com%7C5c35e577e5704f7f951308dc1eb0a251%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418991206164688%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=jmxwT6AqR9GO6CZMBSs1HnR%2B1969RQMAHzwhYLX2E54%3D&reserved=0 >> > dm-boot1: Executing in /tmp/confluentscripts.HUGo3sMtt >> > dm-boot1: Traceback (most recent call last): >> > dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 286, in <module> >> > dm-boot1: synchronize() >> > dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 233, in synchronize >> > dm-boot1: status, rsp = ac.grab_url_with_status('/confluent-api/self/remotesyncfiles') >> > dm-boot1: File "/opt/confluent/bin/apiclient", line 413, in grab_url_with_status >> > dm-boot1: raise Exception(rsp.read()) >> > dm-boot1: Exception: b"500 - Command '['rsync', '-rvLD', '/tmp/tmpSUbmoD.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255" >> > dm-boot1: 'syncfileclient' exited with code 1 >> > """ >> > >> > In "/var/log/confluent/stderr” we have: >> > >> > """ >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): Traceback (most recent call last): >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 111, in wait >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): listener.cb(fileno) >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 53, in on_read >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): current.switch(([original], [], [])) >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): result = function(*args, **kwargs) >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): File "/opt/confluent/lib/python/confluent/util.py", line 45, in run >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): raise subprocess.CalledProcessError(retcode, process.args, output=stdout) >> > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print >> > file.write(str+terminator): CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 >> > Jan 26 15:28:53 File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 317, in squelch_exception >> > sys.stderr.write("Removing descriptor: %r\n" % (fileno,)): Removing descriptor: 65 >> > """ >> > >> > And in “trace” we have: >> > >> > """ >> > Jan 26 15:28:53 Traceback (most recent call last): >> > File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler >> > for rsp in resourcehandler_backend(env, start_response): >> > File "/opt/confluent/lib/python/confluent/httpapi.py", line 636, in resourcehandler_backend >> > for res in selfservice.handle_request(env, start_response): >> > File "/opt/confluent/lib/python/confluent/selfservice.py", line 502, in handle_request >> > status, output = syncfiles.get_syncresult(nodename) >> > File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in get_syncresult >> > result = syncrunners[nodename].wait() >> > File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 181, in wait >> > return self._exit_event.wait() >> > File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 132, in wait >> > current.throw(*self._exc) >> > File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main >> > result = function(*args, **kwargs) >> > File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node >> > ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] >> > File "/opt/confluent/lib/python/confluent/util.py", line 45, in run >> > raise subprocess.CalledProcessError(retcode, process.args, output=stdout) >> > CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 >> > >> > """ >> > >> > > On Jan 26, 2024, at 15:01, Jarrod Johnson <jjo...@le...> wrote: >> > > >> > > Ok, another track (trying to compensate for not being able to use selfcheck). >> > > >> > > Can you try sticking some file in the profile's syncfiles, then do: >> > > nodeapply -F <node> >> > > >> > > And see if any errors happen, either in output or in the /var/log/confluet area. >> > > >> > >> From: David Magda <dma...@ee...> >> > >> Sent: Friday, January 26, 2024 2:01 PM >> > >> To: xCAT Users Mailing list <xca...@li...> >> > >> Subject: Re: [xcat-user] [External] Ansible and Confluent >> > >> >> > >> We have Confluent installed on a RH/CentOS 7 system that originally had/has xCat installed for deployment of our Lenovo hardware/HPC solution. I just installed it there as it was/is our 'install server'. (We don't want to touch it too much, as it was a previous team of folks that set things up, and there's been a lot of team churn.) >> > >> >> > >> I've attached the "hangtraces" to this message; hopefully the mailing list software will pass it along. I noticed “ipmi” in some of the paths, and for the record this is a VM running under Proxmox, and does not have any LOM configured: >> > >> >> > >> """ >> > >> # nodeattrib dm-boot1 >> > >> dm-boot1: crypted.selfapikey: ******** >> > >> dm-boot1: deployment.apiarmed: >> > >> dm-boot1: deployment.pendingprofile: ubuntu-22.04.3-x86_64-test1 >> > >> dm-boot1: deployment.profile: >> > >> dm-boot1: deployment.sealedapikey: >> > >> dm-boot1: deployment.stagedprofile: >> > >> dm-boot1: deployment.state: >> > >> dm-boot1: deployment.state_detail: >> > >> dm-boot1: deployment.useinsecureprotocols: always >> > >> dm-boot1: dns.servers: 172.17.15.252,172.17.15.247,172.17.15.254 >> > >> dm-boot1: groups: everything >> > >> dm-boot1: net.hwaddr: 4e:78:df:d3:8d:59 >> > >> dm-boot1: net.ipv4_address: 172.17.15.222/21 >> > >> dm-boot1: net.ipv4_gateway: 172.17.8.254 >> > >> """ >> > >> >> > >> Running an strace(1) on the 'apiclient' that runs as part of the "post.sh" process, we have a continuous poll/read/write stream: >> > >> >> > >> """ >> > >> […] >> > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\221\377vZs\21\22S\10\351\232\321I7Y$R\370]\312"..., 254) = 254 >> > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> > >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) >> > >> read(3, "\27\3\3\0\226", 5) = 5 >> > >> read(3, "\0055\271\274&\2464\237\242h\341\30\231\274\327g\224\344g\306\313\206\326\355x\307\341\331C\366H\331"..., 150) = 150 >> > >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) >> > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\222\334e\336f\353u\343p\22\215:\264e\30a\3172\245\361"..., 254) = 254 >> > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> > >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) >> > >> read(3, "\27\3\3\0\226", 5) = 5 >> > >> read(3, "\0055\271\274&\2464\240\326\202\347(\213\311\260|\333\230\372A\235\341\273U\201\223\2209ah\325J"..., 150) = 150 >> > >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) >> > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\223\240\341<\3602\323\177Y\311\317/\371\336P/s\301t8"..., 254) = 254 >> > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> > >> poll([{fd=3, events=POLLIN}], 1, 15000^Cstrace: Process 27477 detached >> > >> <detached ...> >> > >> """ >> > >> >> > >> Per lsof(1), FD 3 is: >> > >> >> > >> """ >> > >> python3 27477 root 3u IPv6 158157 0t0 TCP [fe80::[EUI-64_client]]:44800->[fe80::[EUI-64_server]]:https (ESTABLISHED) >> > >> """ >> > >> >> > >> >> > >> >> > >> On Thu, January 25, 2024 16:34, Jarrod Johnson wrote: >> > >> > What is the OS of the deployment server? >> > >> > >> > >> > kill -USR1 $(cat /var/run/confluent/pid) >> > >> > >> > >> > This should produce a /var/log/confluennt/hangtraces >> > >> > >> > >> > Would be interesting to see if there's ansible related stacks in >> > >> > hangtraces that seem stuck... >> > >> > >> > >> > >> > >> > ________________________________ >> > >> > From: David Magda <dma...@ee...> >> > >> > Sent: Thursday, January 25, 2024 4:25 PM >> > >> > To: xCAT Users Mailing list <xca...@li...> >> > >> > Subject: Re: [xcat-user] [External] Ansible and Confluent >> > >> > >> > >> > First suggested command: >> > >> > >> > >> > """ >> > >> > # confluent_selfcheck >> > >> > OS Deployment: Initialized >> > >> > Confluent UUID: Consistent >> > >> > Web Server: Running >> > >> > Web Certificate: Traceback (most recent call last): >> > >> > File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> >> > >> > cert = certificates_missing_ips(conn) >> > >> > File "/opt/confluent/bin/confluent_selfcheck", line 57, in >> > >> > certificates_missing_ips >> > >> > ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) >> > >> > AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' >> > >> > """ >> > >> > >> > >> > On the being-installed system, ignoring the typical Linux stuff, the >> > >> > output of 'ps -elfH' has: >> > >> > >> > >> > """ >> > >> > >> > >> > 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? >> > >> > 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher >> > >> > --run-startup-triggers >> > >> > 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? >> > >> > 00:00:00 /usr/libexec/polkitd --no-debug >> > >> > 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? >> > >> > 00:00:00 /usr/sbin/rsyslogd -n -iNONE >> > >> > 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? >> > >> > 00:00:03 /usr/lib/snapd/snapd >> > >> > 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? >> > >> > 00:00:00 /lib/systemd/systemd-logind >> > >> > 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? >> > >> > 00:00:02 /usr/libexec/udisks2/udisksd >> > >> > 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? >> > >> > 00:00:00 /usr/bin/python3 >> > >> > /usr/share/unattended-upgrades/unattended-upgrade-shutdown >> > >> > --wait-for-signal >> > >> > 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? >> > >> > 00:00:00 /usr/sbin/ModemManager >> > >> > 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? >> > >> > 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server >> > >> > 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? >> > >> > 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m >> > >> > subiquity.cmd.server >> > >> > 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? >> > >> > 00:00:00 sh -c /custom-installation/post.sh >> > >> > 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? >> > >> > 00:00:00 /bin/bash /custom-installation/post.sh >> > >> > 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? >> > >> > 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient >> > >> > /confluent-api/self/remoteconfig/status -w 204 >> > >> > 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 >> > >> > 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 >> > >> > /snap/subiquity/5004/usr/bin/subiquity >> > >> > 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? >> > >> > 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups >> > >> > 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? >> > >> > 00:00:00 sshd: root@pts/0 >> > >> > 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? >> > >> > 00:00:00 -bash >> > >> > 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? >> > >> > 00:00:00 ps -elfH >> > >> > 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? >> > >> > 00:00:00 /lib/netplan/netplan-dbus >> > >> > 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? >> > >> > 00:00:00 /usr/libexec/packagekitd >> > >> > 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? >> > >> > 00:00:00 /lib/systemd/systemd --user >> > >> > 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? >> > >> > 00:00:00 (sd-pam) >> > >> > """ >> > >> > >> > >> > While 'ps axf' produces (trimmed): >> > >> > >> > >> > """ >> > >> > 2042 ? Ss 0:00 /bin/sh >> > >> > /snap/subiquity/5004/usr/bin/subiquity-server >> > >> > 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m >> > >> > subiquity.cmd.server >> > >> > 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh >> > >> > 27501 ? S 0:00 \_ /bin/bash >> > >> > /custom-installation/post.sh >> > >> > 27588 ? S 3:21 \_ /usr/bin/python3 >> > >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w >> > >> > 204 >> > >> > 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 >> > >> > /snap/subiquity/5004/usr/bin/subiquity >> > >> > """ >> > >> > >> > >> > Doing a "kill -9 27588" (on apiclient) causes the installation to >> > >> > 'finish'. After the reboot, and after "firshboot.sh" does its thing, we >> > >> > have the following from 'ps axf': >> > >> > >> > >> > """ >> > >> > 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules >> > >> > --mode=final >> > >> > 1376 ? S 0:00 \_ /bin/sh -c tee -a >> > >> > /var/log/cloud-init-output.log >> > >> > 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log >> > >> > 1378 ? S 0:00 \_ /bin/sh >> > >> > /var/lib/cloud/instance/scripts/runcmd >> > >> > 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh >> > >> > 1429 ? S 0:01 \_ /usr/bin/python3 >> > >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w >> > >> > 204 >> > >> > """ >> > >> > >> > >> > This causes the "/var/log/httpd/ssl_access_log" to start filling up. A >> > >> > subsequent reboot, where "firstboot.sh" is not run, has the the system >> > >> > coming up without "apiclient" running, and so there's no longer 'spam' in >> > >> > "ssl_access_log". >> > >> > >> > >> > Running "apiclient" manually from the CLI with the exact options causes a >> > >> > bunch of stuff in "ssl_access_log": >> > >> > >> > >> > """ >> > >> > fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET >> > >> > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> > """ >> > >> > >> > >> > at the same time as the above is being generated, there is nothing in >> > >> > "/var/log/confluent/trace" or "stderr�. >> > >> > >> > >> > >> > >> > On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: >> > >> >> Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also >> > >> >> would be tempted to see if 'confluent_selfcheck' has any suggestions. >> > >> >> You >> > >> >> can also ssh into the node during that phase to confirm what it is doing >> > >> >> while it is seemingly hung, e.g. looking at ps axf >> > >> >> ________________________________ >> > >> >> From: David Magda <dma...@ee...> >> > >> >> Sent: Wednesday, January 24, 2024 9:37 PM >> > >> >> To: xCA...@li... <xCA...@li...> >> > >> >> Subject: [External] [xcat-user] Ansible and Confluent >> > >> >> >> > >> >> Hello, >> > >> >> >> > >> >> I'm trying to get Ansible working with Confluent 3.8.0. (Using an older >> > >> >> version due to legacy OS reasons.) >> > >> >> >> > >> >> In /var/lib/confluent/public/os/ I created a new profile called >> > >> >> ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took >> > >> >> the >> > >> >> provided "autoinstall/user-data" file, added some partition stanzas, >> > >> >> some >> > >> >> packages, etc. >> > >> >> >> > >> >> Once I sorted out a 'basic' automated Ubuntu install I tried creating a >> > >> >> "ansible/post.d/01-packages.yaml" file with-in the profile directory >> > >> >> with >> > >> >> the following contents: >> > >> >> >> > >> >> """ >> > >> >> - name: install chrony >> > >> >> apt: >> > >> >> pkg: >> > >> >> - chrony >> > >> >> """ >> > >> >> >> > >> >> The Ubuntu (subiquity) installer seems to 'hang' at: >> > >> >> >> > >> >> """ >> > >> >> start: subiquity/Late/run/command_1: /custom-installation/post.sh >> > >> >> """ >> > >> >> >> > >> >> which probably corresponds to this part of the "user-data" file: >> > >> >> >> > >> >> """ >> > >> >> late-commands: >> > >> >> - chroot /target apt-get -y -q purge snapd modemmanager >> > >> >> - /custom-installation/post.sh >> > >> >> """ >> > >> >> >> > >> >> When the 'hang' occurs the following starts filling up the >> > >> >> "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: >> > >> >> >> > >> >> """ >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > >> >> """ >> > >> >> >> > >> >> When I force a restart of the system/VM, it can boot off the disk, and >> > >> >> goes through the regular start-up process, including a bunch of >> > >> >> cloud-init >> > >> >> stuff. Though after it runs "/etc/confluent/firstboot.sh", the >> > >> >> "ssl_access_log" file once again starts filling with the >> > >> >> "remoteconfig/status" stuff per above. >> > >> >> >> > >> >> Renaming "ansible/" to "ansible_off/" seems to make the problem go away. >> > >> >> Similar behaviour with Ubuntu 20.04. >> > >> >> >> > >> >> I'm wondering what's going with the 'hang' when "post.sh" is executed, >> > >> >> and >> > >> >> the flooding after "firstboot.sh". >> > >> >> >> > >> >> Regards, >> > >> >> David |
From: Jarrod J. <jjo...@le...> - 2024-01-26 21:00:17
|
Hmm, how odd.... # cat /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey /etc/confluent/ssh/automation.pub Do they match? ________________________________ From: David Magda <dma...@ee...> Sent: Friday, January 26, 2024 3:50 PM To: xCAT Users Mailing list <xca...@li...> Subject: Re: [xcat-user] [External] Ansible and Confluent """ # ls -lrth /var/lib/confluent/public/site/ssh/*pubkey -rw-r--r-- 1 confluent root 410 Oct 31 12:05 /var/lib/confluent/public/site/ssh/mp01.example.com.rootpubkey -rw-r--r-- 1 confluent root 129 Oct 31 12:05 /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey # ssh dm-boot1 'hostname -f; uptime' dm-boot1 15:47:19 up 21 min, 0 users, load average: 0.00, 0.00, 0.00 """ > On Jan 26, 2024, at 15:43, Jarrod Johnson <jjo...@le...> wrote: > > # ls /var/lib/confluent/public/site/ssh/*pubkey > > > From: David Magda <dma...@ee...> > Sent: Friday, January 26, 2024 3:40 PM > To: xCAT Users Mailing list <xca...@li...> > Subject: Re: [xcat-user] [External] Ansible and Confluent > > There’s no “syncfiles” in the default Ubuntu profile, nor anything in the web docs on its format, but I found a template in "/opt/confluent/lib/osdeploy/el9/profiles/default/syncfiles”. > > Created a file with the line: > > /etc/hosts -> /etc/hosts_test > > With the results: > > """ > # nodeapply -F dm-boot1 > dm-boot1: > dm-boot1: --------------------------------------------------------------------------- > dm-boot1: Running python script 'syncfileclient' from https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2F%5Bfe80%3A%3A749f%3A43ff%3Afe72%3A55e4%5D%2Fconfluent-public%2Fos%2Fubuntu-22.04.3-x86_64-test1%2Fscripts%2F&data=05%7C02%7Cjjohnson2%40lenovo.com%7C5c35e577e5704f7f951308dc1eb0a251%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418991206164688%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=jmxwT6AqR9GO6CZMBSs1HnR%2B1969RQMAHzwhYLX2E54%3D&reserved=0<https://[fe80::749f:43ff:fe72:55e4]/confluent-public/os/ubuntu-22.04.3-x86_64-test1/scripts/> > dm-boot1: Executing in /tmp/confluentscripts.HUGo3sMtt > dm-boot1: Traceback (most recent call last): > dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 286, in <module> > dm-boot1: synchronize() > dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 233, in synchronize > dm-boot1: status, rsp = ac.grab_url_with_status('/confluent-api/self/remotesyncfiles') > dm-boot1: File "/opt/confluent/bin/apiclient", line 413, in grab_url_with_status > dm-boot1: raise Exception(rsp.read()) > dm-boot1: Exception: b"500 - Command '['rsync', '-rvLD', '/tmp/tmpSUbmoD.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255" > dm-boot1: 'syncfileclient' exited with code 1 > """ > > In "/var/log/confluent/stderr” we have: > > """ > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): Traceback (most recent call last): > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 111, in wait > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): listener.cb(fileno) > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 53, in on_read > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): current.switch(([original], [], [])) > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): result = function(*args, **kwargs) > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): File "/opt/confluent/lib/python/confluent/util.py", line 45, in run > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): raise subprocess.CalledProcessError(retcode, process.args, output=stdout) > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 > Jan 26 15:28:53 File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 317, in squelch_exception > sys.stderr.write("Removing descriptor: %r\n" % (fileno,)): Removing descriptor: 65 > """ > > And in “trace” we have: > > """ > Jan 26 15:28:53 Traceback (most recent call last): > File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler > for rsp in resourcehandler_backend(env, start_response): > File "/opt/confluent/lib/python/confluent/httpapi.py", line 636, in resourcehandler_backend > for res in selfservice.handle_request(env, start_response): > File "/opt/confluent/lib/python/confluent/selfservice.py", line 502, in handle_request > status, output = syncfiles.get_syncresult(nodename) > File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in get_syncresult > result = syncrunners[nodename].wait() > File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 181, in wait > return self._exit_event.wait() > File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 132, in wait > current.throw(*self._exc) > File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main > result = function(*args, **kwargs) > File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node > ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] > File "/opt/confluent/lib/python/confluent/util.py", line 45, in run > raise subprocess.CalledProcessError(retcode, process.args, output=stdout) > CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 > > """ > > > On Jan 26, 2024, at 15:01, Jarrod Johnson <jjo...@le...> wrote: > > > > Ok, another track (trying to compensate for not being able to use selfcheck). > > > > Can you try sticking some file in the profile's syncfiles, then do: > > nodeapply -F <node> > > > > And see if any errors happen, either in output or in the /var/log/confluet area. > > > >> From: David Magda <dma...@ee...> > >> Sent: Friday, January 26, 2024 2:01 PM > >> To: xCAT Users Mailing list <xca...@li...> > >> Subject: Re: [xcat-user] [External] Ansible and Confluent > >> > >> We have Confluent installed on a RH/CentOS 7 system that originally had/has xCat installed for deployment of our Lenovo hardware/HPC solution. I just installed it there as it was/is our 'install server'. (We don't want to touch it too much, as it was a previous team of folks that set things up, and there's been a lot of team churn.) > >> > >> I've attached the "hangtraces" to this message; hopefully the mailing list software will pass it along. I noticed “ipmi” in some of the paths, and for the record this is a VM running under Proxmox, and does not have any LOM configured: > >> > >> """ > >> # nodeattrib dm-boot1 > >> dm-boot1: crypted.selfapikey: ******** > >> dm-boot1: deployment.apiarmed: > >> dm-boot1: deployment.pendingprofile: ubuntu-22.04.3-x86_64-test1 > >> dm-boot1: deployment.profile: > >> dm-boot1: deployment.sealedapikey: > >> dm-boot1: deployment.stagedprofile: > >> dm-boot1: deployment.state: > >> dm-boot1: deployment.state_detail: > >> dm-boot1: deployment.useinsecureprotocols: always > >> dm-boot1: dns.servers: 172.17.15.252,172.17.15.247,172.17.15.254 > >> dm-boot1: groups: everything > >> dm-boot1: net.hwaddr: 4e:78:df:d3:8d:59 > >> dm-boot1: net.ipv4_address: 172.17.15.222/21 > >> dm-boot1: net.ipv4_gateway: 172.17.8.254 > >> """ > >> > >> Running an strace(1) on the 'apiclient' that runs as part of the "post.sh" process, we have a continuous poll/read/write stream: > >> > >> """ > >> […] > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\221\377vZs\21\22S\10\351\232\321I7Y$R\370]\312"..., 254) = 254 > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) > >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) > >> read(3, "\27\3\3\0\226", 5) = 5 > >> read(3, "\0055\271\274&\2464\237\242h\341\30\231\274\327g\224\344g\306\313\206\326\355x\307\341\331C\366H\331"..., 150) = 150 > >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\222\334e\336f\353u\343p\22\215:\264e\30a\3172\245\361"..., 254) = 254 > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) > >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) > >> read(3, "\27\3\3\0\226", 5) = 5 > >> read(3, "\0055\271\274&\2464\240\326\202\347(\213\311\260|\333\230\372A\235\341\273U\201\223\2209ah\325J"..., 150) = 150 > >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\223\240\341<\3602\323\177Y\311\317/\371\336P/s\301t8"..., 254) = 254 > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) > >> poll([{fd=3, events=POLLIN}], 1, 15000^Cstrace: Process 27477 detached > >> <detached ...> > >> """ > >> > >> Per lsof(1), FD 3 is: > >> > >> """ > >> python3 27477 root 3u IPv6 158157 0t0 TCP [fe80::[EUI-64_client]]:44800->[fe80::[EUI-64_server]]:https (ESTABLISHED) > >> """ > >> > >> > >> > >> On Thu, January 25, 2024 16:34, Jarrod Johnson wrote: > >> > What is the OS of the deployment server? > >> > > >> > kill -USR1 $(cat /var/run/confluent/pid) > >> > > >> > This should produce a /var/log/confluennt/hangtraces > >> > > >> > Would be interesting to see if there's ansible related stacks in > >> > hangtraces that seem stuck... > >> > > >> > > >> > ________________________________ > >> > From: David Magda <dma...@ee...> > >> > Sent: Thursday, January 25, 2024 4:25 PM > >> > To: xCAT Users Mailing list <xca...@li...> > >> > Subject: Re: [xcat-user] [External] Ansible and Confluent > >> > > >> > First suggested command: > >> > > >> > """ > >> > # confluent_selfcheck > >> > OS Deployment: Initialized > >> > Confluent UUID: Consistent > >> > Web Server: Running > >> > Web Certificate: Traceback (most recent call last): > >> > File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> > >> > cert = certificates_missing_ips(conn) > >> > File "/opt/confluent/bin/confluent_selfcheck", line 57, in > >> > certificates_missing_ips > >> > ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) > >> > AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' > >> > """ > >> > > >> > On the being-installed system, ignoring the typical Linux stuff, the > >> > output of 'ps -elfH' has: > >> > > >> > """ > >> > > >> > 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? > >> > 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher > >> > --run-startup-triggers > >> > 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? > >> > 00:00:00 /usr/libexec/polkitd --no-debug > >> > 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? > >> > 00:00:00 /usr/sbin/rsyslogd -n -iNONE > >> > 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? > >> > 00:00:03 /usr/lib/snapd/snapd > >> > 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? > >> > 00:00:00 /lib/systemd/systemd-logind > >> > 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? > >> > 00:00:02 /usr/libexec/udisks2/udisksd > >> > 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? > >> > 00:00:00 /usr/bin/python3 > >> > /usr/share/unattended-upgrades/unattended-upgrade-shutdown > >> > --wait-for-signal > >> > 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? > >> > 00:00:00 /usr/sbin/ModemManager > >> > 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? > >> > 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server > >> > 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? > >> > 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m > >> > subiquity.cmd.server > >> > 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? > >> > 00:00:00 sh -c /custom-installation/post.sh > >> > 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? > >> > 00:00:00 /bin/bash /custom-installation/post.sh > >> > 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? > >> > 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient > >> > /confluent-api/self/remoteconfig/status -w 204 > >> > 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 > >> > 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 > >> > /snap/subiquity/5004/usr/bin/subiquity > >> > 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? > >> > 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups > >> > 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? > >> > 00:00:00 sshd: root@pts/0 > >> > 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? > >> > 00:00:00 -bash > >> > 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? > >> > 00:00:00 ps -elfH > >> > 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? > >> > 00:00:00 /lib/netplan/netplan-dbus > >> > 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? > >> > 00:00:00 /usr/libexec/packagekitd > >> > 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? > >> > 00:00:00 /lib/systemd/systemd --user > >> > 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? > >> > 00:00:00 (sd-pam) > >> > """ > >> > > >> > While 'ps axf' produces (trimmed): > >> > > >> > """ > >> > 2042 ? Ss 0:00 /bin/sh > >> > /snap/subiquity/5004/usr/bin/subiquity-server > >> > 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m > >> > subiquity.cmd.server > >> > 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh > >> > 27501 ? S 0:00 \_ /bin/bash > >> > /custom-installation/post.sh > >> > 27588 ? S 3:21 \_ /usr/bin/python3 > >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w > >> > 204 > >> > 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 > >> > /snap/subiquity/5004/usr/bin/subiquity > >> > """ > >> > > >> > Doing a "kill -9 27588" (on apiclient) causes the installation to > >> > 'finish'. After the reboot, and after "firshboot.sh" does its thing, we > >> > have the following from 'ps axf': > >> > > >> > """ > >> > 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules > >> > --mode=final > >> > 1376 ? S 0:00 \_ /bin/sh -c tee -a > >> > /var/log/cloud-init-output.log > >> > 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log > >> > 1378 ? S 0:00 \_ /bin/sh > >> > /var/lib/cloud/instance/scripts/runcmd > >> > 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh > >> > 1429 ? S 0:01 \_ /usr/bin/python3 > >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w > >> > 204 > >> > """ > >> > > >> > This causes the "/var/log/httpd/ssl_access_log" to start filling up. A > >> > subsequent reboot, where "firstboot.sh" is not run, has the the system > >> > coming up without "apiclient" running, and so there's no longer 'spam' in > >> > "ssl_access_log". > >> > > >> > Running "apiclient" manually from the CLI with the exact options causes a > >> > bunch of stuff in "ssl_access_log": > >> > > >> > """ > >> > fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET > >> > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> > """ > >> > > >> > at the same time as the above is being generated, there is nothing in > >> > "/var/log/confluent/trace" or "stderr�. > >> > > >> > > >> > On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: > >> >> Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also > >> >> would be tempted to see if 'confluent_selfcheck' has any suggestions. > >> >> You > >> >> can also ssh into the node during that phase to confirm what it is doing > >> >> while it is seemingly hung, e.g. looking at ps axf > >> >> ________________________________ > >> >> From: David Magda <dma...@ee...> > >> >> Sent: Wednesday, January 24, 2024 9:37 PM > >> >> To: xCA...@li... <xCA...@li...> > >> >> Subject: [External] [xcat-user] Ansible and Confluent > >> >> > >> >> Hello, > >> >> > >> >> I'm trying to get Ansible working with Confluent 3.8.0. (Using an older > >> >> version due to legacy OS reasons.) > >> >> > >> >> In /var/lib/confluent/public/os/ I created a new profile called > >> >> ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took > >> >> the > >> >> provided "autoinstall/user-data" file, added some partition stanzas, > >> >> some > >> >> packages, etc. > >> >> > >> >> Once I sorted out a 'basic' automated Ubuntu install I tried creating a > >> >> "ansible/post.d/01-packages.yaml" file with-in the profile directory > >> >> with > >> >> the following contents: > >> >> > >> >> """ > >> >> - name: install chrony > >> >> apt: > >> >> pkg: > >> >> - chrony > >> >> """ > >> >> > >> >> The Ubuntu (subiquity) installer seems to 'hang' at: > >> >> > >> >> """ > >> >> start: subiquity/Late/run/command_1: /custom-installation/post.sh > >> >> """ > >> >> > >> >> which probably corresponds to this part of the "user-data" file: > >> >> > >> >> """ > >> >> late-commands: > >> >> - chroot /target apt-get -y -q purge snapd modemmanager > >> >> - /custom-installation/post.sh > >> >> """ > >> >> > >> >> When the 'hang' occurs the following starts filling up the > >> >> "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: > >> >> > >> >> """ > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> """ > >> >> > >> >> When I force a restart of the system/VM, it can boot off the disk, and > >> >> goes through the regular start-up process, including a bunch of > >> >> cloud-init > >> >> stuff. Though after it runs "/etc/confluent/firstboot.sh", the > >> >> "ssl_access_log" file once again starts filling with the > >> >> "remoteconfig/status" stuff per above. > >> >> > >> >> Renaming "ansible/" to "ansible_off/" seems to make the problem go away. > >> >> Similar behaviour with Ubuntu 20.04. > >> >> > >> >> I'm wondering what's going with the 'hang' when "post.sh" is executed, > >> >> and > >> >> the flooding after "firstboot.sh". > >> >> > >> >> Regards, > >> >> David > >> > > >> > > >> > >> _______________________________________________ > >> xCAT-user mailing list > >> xCA...@li... > >> https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C5c35e577e5704f7f951308dc1eb0a251%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418991206198007%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=6ztF4QiPRhHPExBsMci4SF5YeNbZmKwVk0PT0U8kkps%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> > >> _______________________________________________ > >> xCAT-user mailing list > >> xCA...@li... > >> https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C5c35e577e5704f7f951308dc1eb0a251%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418991206205176%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=mfZEEvKnzwbODGbQpY4Il7awGKfhANfMhcoQ3tY9AgQ%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> > > > > > _______________________________________________ > xCAT-user mailing list > xCA...@li... > https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C5c35e577e5704f7f951308dc1eb0a251%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418991206212447%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=p4jZq9cS79D7q7BvgwDjO%2BtFRmu3nuwmTXYwiCxUdkg%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> > _______________________________________________ > xCAT-user mailing list > xCA...@li... > https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C5c35e577e5704f7f951308dc1eb0a251%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418991206350663%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=YEP4W9anb5H7SWLZbl1KsezWREOzpa0UkwL1zIzkwJI%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> _______________________________________________ xCAT-user mailing list xCA...@li... https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C5c35e577e5704f7f951308dc1eb0a251%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418991206357340%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=AAUwEUNup4tADmW%2FFUMBigZV5nfEALKxTZxOOCs%2FnUE%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> |
From: David M. <dma...@ee...> - 2024-01-26 20:50:50
|
""" # ls -lrth /var/lib/confluent/public/site/ssh/*pubkey -rw-r--r-- 1 confluent root 410 Oct 31 12:05 /var/lib/confluent/public/site/ssh/mp01.example.com.rootpubkey -rw-r--r-- 1 confluent root 129 Oct 31 12:05 /var/lib/confluent/public/site/ssh/mp01.example.com.automationpubkey # ssh dm-boot1 'hostname -f; uptime' dm-boot1 15:47:19 up 21 min, 0 users, load average: 0.00, 0.00, 0.00 """ > On Jan 26, 2024, at 15:43, Jarrod Johnson <jjo...@le...> wrote: > > # ls /var/lib/confluent/public/site/ssh/*pubkey > > > From: David Magda <dma...@ee...> > Sent: Friday, January 26, 2024 3:40 PM > To: xCAT Users Mailing list <xca...@li...> > Subject: Re: [xcat-user] [External] Ansible and Confluent > > There’s no “syncfiles” in the default Ubuntu profile, nor anything in the web docs on its format, but I found a template in "/opt/confluent/lib/osdeploy/el9/profiles/default/syncfiles”. > > Created a file with the line: > > /etc/hosts -> /etc/hosts_test > > With the results: > > """ > # nodeapply -F dm-boot1 > dm-boot1: > dm-boot1: --------------------------------------------------------------------------- > dm-boot1: Running python script 'syncfileclient' from https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2F%5Bfe80%3A%3A749f%3A43ff%3Afe72%3A55e4%5D%2Fconfluent-public%2Fos%2Fubuntu-22.04.3-x86_64-test1%2Fscripts%2F&data=05%7C02%7Cjjohnson2%40lenovo.com%7C4b50f514de4046c1cc2608dc1eaf3bb1%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418985686897777%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=ORHvf14ef0F6UnhPvwAbIUq%2F0x1xvfWuP%2FDrXdfJlzk%3D&reserved=0 > dm-boot1: Executing in /tmp/confluentscripts.HUGo3sMtt > dm-boot1: Traceback (most recent call last): > dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 286, in <module> > dm-boot1: synchronize() > dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 233, in synchronize > dm-boot1: status, rsp = ac.grab_url_with_status('/confluent-api/self/remotesyncfiles') > dm-boot1: File "/opt/confluent/bin/apiclient", line 413, in grab_url_with_status > dm-boot1: raise Exception(rsp.read()) > dm-boot1: Exception: b"500 - Command '['rsync', '-rvLD', '/tmp/tmpSUbmoD.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255" > dm-boot1: 'syncfileclient' exited with code 1 > """ > > In "/var/log/confluent/stderr” we have: > > """ > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): Traceback (most recent call last): > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 111, in wait > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): listener.cb(fileno) > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 53, in on_read > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): current.switch(([original], [], [])) > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): result = function(*args, **kwargs) > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): File "/opt/confluent/lib/python/confluent/util.py", line 45, in run > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): raise subprocess.CalledProcessError(retcode, process.args, output=stdout) > Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print > file.write(str+terminator): CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 > Jan 26 15:28:53 File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 317, in squelch_exception > sys.stderr.write("Removing descriptor: %r\n" % (fileno,)): Removing descriptor: 65 > """ > > And in “trace” we have: > > """ > Jan 26 15:28:53 Traceback (most recent call last): > File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler > for rsp in resourcehandler_backend(env, start_response): > File "/opt/confluent/lib/python/confluent/httpapi.py", line 636, in resourcehandler_backend > for res in selfservice.handle_request(env, start_response): > File "/opt/confluent/lib/python/confluent/selfservice.py", line 502, in handle_request > status, output = syncfiles.get_syncresult(nodename) > File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in get_syncresult > result = syncrunners[nodename].wait() > File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 181, in wait > return self._exit_event.wait() > File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 132, in wait > current.throw(*self._exc) > File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main > result = function(*args, **kwargs) > File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node > ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] > File "/opt/confluent/lib/python/confluent/util.py", line 45, in run > raise subprocess.CalledProcessError(retcode, process.args, output=stdout) > CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 > > """ > > > On Jan 26, 2024, at 15:01, Jarrod Johnson <jjo...@le...> wrote: > > > > Ok, another track (trying to compensate for not being able to use selfcheck). > > > > Can you try sticking some file in the profile's syncfiles, then do: > > nodeapply -F <node> > > > > And see if any errors happen, either in output or in the /var/log/confluet area. > > > >> From: David Magda <dma...@ee...> > >> Sent: Friday, January 26, 2024 2:01 PM > >> To: xCAT Users Mailing list <xca...@li...> > >> Subject: Re: [xcat-user] [External] Ansible and Confluent > >> > >> We have Confluent installed on a RH/CentOS 7 system that originally had/has xCat installed for deployment of our Lenovo hardware/HPC solution. I just installed it there as it was/is our 'install server'. (We don't want to touch it too much, as it was a previous team of folks that set things up, and there's been a lot of team churn.) > >> > >> I've attached the "hangtraces" to this message; hopefully the mailing list software will pass it along. I noticed “ipmi” in some of the paths, and for the record this is a VM running under Proxmox, and does not have any LOM configured: > >> > >> """ > >> # nodeattrib dm-boot1 > >> dm-boot1: crypted.selfapikey: ******** > >> dm-boot1: deployment.apiarmed: > >> dm-boot1: deployment.pendingprofile: ubuntu-22.04.3-x86_64-test1 > >> dm-boot1: deployment.profile: > >> dm-boot1: deployment.sealedapikey: > >> dm-boot1: deployment.stagedprofile: > >> dm-boot1: deployment.state: > >> dm-boot1: deployment.state_detail: > >> dm-boot1: deployment.useinsecureprotocols: always > >> dm-boot1: dns.servers: 172.17.15.252,172.17.15.247,172.17.15.254 > >> dm-boot1: groups: everything > >> dm-boot1: net.hwaddr: 4e:78:df:d3:8d:59 > >> dm-boot1: net.ipv4_address: 172.17.15.222/21 > >> dm-boot1: net.ipv4_gateway: 172.17.8.254 > >> """ > >> > >> Running an strace(1) on the 'apiclient' that runs as part of the "post.sh" process, we have a continuous poll/read/write stream: > >> > >> """ > >> […] > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\221\377vZs\21\22S\10\351\232\321I7Y$R\370]\312"..., 254) = 254 > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) > >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) > >> read(3, "\27\3\3\0\226", 5) = 5 > >> read(3, "\0055\271\274&\2464\237\242h\341\30\231\274\327g\224\344g\306\313\206\326\355x\307\341\331C\366H\331"..., 150) = 150 > >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\222\334e\336f\353u\343p\22\215:\264e\30a\3172\245\361"..., 254) = 254 > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) > >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) > >> read(3, "\27\3\3\0\226", 5) = 5 > >> read(3, "\0055\271\274&\2464\240\326\202\347(\213\311\260|\333\230\372A\235\341\273U\201\223\2209ah\325J"..., 150) = 150 > >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) > >> write(3, "\27\3\3\0\371Sm2\233\337\222n\223\240\341<\3602\323\177Y\311\317/\371\336P/s\301t8"..., 254) = 254 > >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) > >> poll([{fd=3, events=POLLIN}], 1, 15000^Cstrace: Process 27477 detached > >> <detached ...> > >> """ > >> > >> Per lsof(1), FD 3 is: > >> > >> """ > >> python3 27477 root 3u IPv6 158157 0t0 TCP [fe80::[EUI-64_client]]:44800->[fe80::[EUI-64_server]]:https (ESTABLISHED) > >> """ > >> > >> > >> > >> On Thu, January 25, 2024 16:34, Jarrod Johnson wrote: > >> > What is the OS of the deployment server? > >> > > >> > kill -USR1 $(cat /var/run/confluent/pid) > >> > > >> > This should produce a /var/log/confluennt/hangtraces > >> > > >> > Would be interesting to see if there's ansible related stacks in > >> > hangtraces that seem stuck... > >> > > >> > > >> > ________________________________ > >> > From: David Magda <dma...@ee...> > >> > Sent: Thursday, January 25, 2024 4:25 PM > >> > To: xCAT Users Mailing list <xca...@li...> > >> > Subject: Re: [xcat-user] [External] Ansible and Confluent > >> > > >> > First suggested command: > >> > > >> > """ > >> > # confluent_selfcheck > >> > OS Deployment: Initialized > >> > Confluent UUID: Consistent > >> > Web Server: Running > >> > Web Certificate: Traceback (most recent call last): > >> > File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> > >> > cert = certificates_missing_ips(conn) > >> > File "/opt/confluent/bin/confluent_selfcheck", line 57, in > >> > certificates_missing_ips > >> > ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) > >> > AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' > >> > """ > >> > > >> > On the being-installed system, ignoring the typical Linux stuff, the > >> > output of 'ps -elfH' has: > >> > > >> > """ > >> > > >> > 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? > >> > 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher > >> > --run-startup-triggers > >> > 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? > >> > 00:00:00 /usr/libexec/polkitd --no-debug > >> > 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? > >> > 00:00:00 /usr/sbin/rsyslogd -n -iNONE > >> > 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? > >> > 00:00:03 /usr/lib/snapd/snapd > >> > 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? > >> > 00:00:00 /lib/systemd/systemd-logind > >> > 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? > >> > 00:00:02 /usr/libexec/udisks2/udisksd > >> > 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? > >> > 00:00:00 /usr/bin/python3 > >> > /usr/share/unattended-upgrades/unattended-upgrade-shutdown > >> > --wait-for-signal > >> > 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? > >> > 00:00:00 /usr/sbin/ModemManager > >> > 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? > >> > 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server > >> > 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? > >> > 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m > >> > subiquity.cmd.server > >> > 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? > >> > 00:00:00 sh -c /custom-installation/post.sh > >> > 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? > >> > 00:00:00 /bin/bash /custom-installation/post.sh > >> > 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? > >> > 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient > >> > /confluent-api/self/remoteconfig/status -w 204 > >> > 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 > >> > 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 > >> > /snap/subiquity/5004/usr/bin/subiquity > >> > 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? > >> > 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups > >> > 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? > >> > 00:00:00 sshd: root@pts/0 > >> > 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? > >> > 00:00:00 -bash > >> > 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? > >> > 00:00:00 ps -elfH > >> > 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? > >> > 00:00:00 /lib/netplan/netplan-dbus > >> > 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? > >> > 00:00:00 /usr/libexec/packagekitd > >> > 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? > >> > 00:00:00 /lib/systemd/systemd --user > >> > 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? > >> > 00:00:00 (sd-pam) > >> > """ > >> > > >> > While 'ps axf' produces (trimmed): > >> > > >> > """ > >> > 2042 ? Ss 0:00 /bin/sh > >> > /snap/subiquity/5004/usr/bin/subiquity-server > >> > 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m > >> > subiquity.cmd.server > >> > 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh > >> > 27501 ? S 0:00 \_ /bin/bash > >> > /custom-installation/post.sh > >> > 27588 ? S 3:21 \_ /usr/bin/python3 > >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w > >> > 204 > >> > 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 > >> > /snap/subiquity/5004/usr/bin/subiquity > >> > """ > >> > > >> > Doing a "kill -9 27588" (on apiclient) causes the installation to > >> > 'finish'. After the reboot, and after "firshboot.sh" does its thing, we > >> > have the following from 'ps axf': > >> > > >> > """ > >> > 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules > >> > --mode=final > >> > 1376 ? S 0:00 \_ /bin/sh -c tee -a > >> > /var/log/cloud-init-output.log > >> > 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log > >> > 1378 ? S 0:00 \_ /bin/sh > >> > /var/lib/cloud/instance/scripts/runcmd > >> > 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh > >> > 1429 ? S 0:01 \_ /usr/bin/python3 > >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w > >> > 204 > >> > """ > >> > > >> > This causes the "/var/log/httpd/ssl_access_log" to start filling up. A > >> > subsequent reboot, where "firstboot.sh" is not run, has the the system > >> > coming up without "apiclient" running, and so there's no longer 'spam' in > >> > "ssl_access_log". > >> > > >> > Running "apiclient" manually from the CLI with the exact options causes a > >> > bunch of stuff in "ssl_access_log": > >> > > >> > """ > >> > fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET > >> > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> > """ > >> > > >> > at the same time as the above is being generated, there is nothing in > >> > "/var/log/confluent/trace" or "stderr�. > >> > > >> > > >> > On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: > >> >> Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also > >> >> would be tempted to see if 'confluent_selfcheck' has any suggestions. > >> >> You > >> >> can also ssh into the node during that phase to confirm what it is doing > >> >> while it is seemingly hung, e.g. looking at ps axf > >> >> ________________________________ > >> >> From: David Magda <dma...@ee...> > >> >> Sent: Wednesday, January 24, 2024 9:37 PM > >> >> To: xCA...@li... <xCA...@li...> > >> >> Subject: [External] [xcat-user] Ansible and Confluent > >> >> > >> >> Hello, > >> >> > >> >> I'm trying to get Ansible working with Confluent 3.8.0. (Using an older > >> >> version due to legacy OS reasons.) > >> >> > >> >> In /var/lib/confluent/public/os/ I created a new profile called > >> >> ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took > >> >> the > >> >> provided "autoinstall/user-data" file, added some partition stanzas, > >> >> some > >> >> packages, etc. > >> >> > >> >> Once I sorted out a 'basic' automated Ubuntu install I tried creating a > >> >> "ansible/post.d/01-packages.yaml" file with-in the profile directory > >> >> with > >> >> the following contents: > >> >> > >> >> """ > >> >> - name: install chrony > >> >> apt: > >> >> pkg: > >> >> - chrony > >> >> """ > >> >> > >> >> The Ubuntu (subiquity) installer seems to 'hang' at: > >> >> > >> >> """ > >> >> start: subiquity/Late/run/command_1: /custom-installation/post.sh > >> >> """ > >> >> > >> >> which probably corresponds to this part of the "user-data" file: > >> >> > >> >> """ > >> >> late-commands: > >> >> - chroot /target apt-get -y -q purge snapd modemmanager > >> >> - /custom-installation/post.sh > >> >> """ > >> >> > >> >> When the 'hang' occurs the following starts filling up the > >> >> "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: > >> >> > >> >> """ > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > >> >> """ > >> >> > >> >> When I force a restart of the system/VM, it can boot off the disk, and > >> >> goes through the regular start-up process, including a bunch of > >> >> cloud-init > >> >> stuff. Though after it runs "/etc/confluent/firstboot.sh", the > >> >> "ssl_access_log" file once again starts filling with the > >> >> "remoteconfig/status" stuff per above. > >> >> > >> >> Renaming "ansible/" to "ansible_off/" seems to make the problem go away. > >> >> Similar behaviour with Ubuntu 20.04. > >> >> > >> >> I'm wondering what's going with the 'hang' when "post.sh" is executed, > >> >> and > >> >> the flooding after "firstboot.sh". > >> >> > >> >> Regards, > >> >> David > >> > > >> > > >> > >> _______________________________________________ > >> xCAT-user mailing list > >> xCA...@li... > >> https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C4b50f514de4046c1cc2608dc1eaf3bb1%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418985686930112%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=fn0isS3QEIr4jMt9uQtlHbzVxxi6q3aBZ77sNkFZ3G0%3D&reserved=0 > >> _______________________________________________ > >> xCAT-user mailing list > >> xCA...@li... > >> https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C4b50f514de4046c1cc2608dc1eaf3bb1%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418985686937745%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=9oGMoCrvvU%2FQI2FcTXcbtUoEQkJ%2BLsV7uyEvEf3%2BvXk%3D&reserved=0 > > > > > _______________________________________________ > xCAT-user mailing list > xCA...@li... > https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C4b50f514de4046c1cc2608dc1eaf3bb1%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418985686942946%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=afjuil3hoGgLZGjxw3l%2Fn576nJANqcHuX4f6l16xNYY%3D&reserved=0 > _______________________________________________ > xCAT-user mailing list > xCA...@li... > https://lists.sourceforge.net/lists/listinfo/xcat-user |
From: Jarrod J. <jjo...@le...> - 2024-01-26 20:44:03
|
# ls /var/lib/confluent/public/site/ssh/*pubkey ________________________________ From: David Magda <dma...@ee...> Sent: Friday, January 26, 2024 3:40 PM To: xCAT Users Mailing list <xca...@li...> Subject: Re: [xcat-user] [External] Ansible and Confluent There’s no “syncfiles” in the default Ubuntu profile, nor anything in the web docs on its format, but I found a template in "/opt/confluent/lib/osdeploy/el9/profiles/default/syncfiles”. Created a file with the line: /etc/hosts -> /etc/hosts_test With the results: """ # nodeapply -F dm-boot1 dm-boot1: dm-boot1: --------------------------------------------------------------------------- dm-boot1: Running python script 'syncfileclient' from https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2F%5Bfe80%3A%3A749f%3A43ff%3Afe72%3A55e4%5D%2Fconfluent-public%2Fos%2Fubuntu-22.04.3-x86_64-test1%2Fscripts%2F&data=05%7C02%7Cjjohnson2%40lenovo.com%7C4b50f514de4046c1cc2608dc1eaf3bb1%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418985686897777%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=ORHvf14ef0F6UnhPvwAbIUq%2F0x1xvfWuP%2FDrXdfJlzk%3D&reserved=0<https://[fe80::749f:43ff:fe72:55e4]/confluent-public/os/ubuntu-22.04.3-x86_64-test1/scripts/> dm-boot1: Executing in /tmp/confluentscripts.HUGo3sMtt dm-boot1: Traceback (most recent call last): dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 286, in <module> dm-boot1: synchronize() dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 233, in synchronize dm-boot1: status, rsp = ac.grab_url_with_status('/confluent-api/self/remotesyncfiles') dm-boot1: File "/opt/confluent/bin/apiclient", line 413, in grab_url_with_status dm-boot1: raise Exception(rsp.read()) dm-boot1: Exception: b"500 - Command '['rsync', '-rvLD', '/tmp/tmpSUbmoD.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255" dm-boot1: 'syncfileclient' exited with code 1 """ In "/var/log/confluent/stderr” we have: """ Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): Traceback (most recent call last): Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 111, in wait Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): listener.cb(fileno) Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 53, in on_read Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): current.switch(([original], [], [])) Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): result = function(*args, **kwargs) Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): File "/opt/confluent/lib/python/confluent/util.py", line 45, in run Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): raise subprocess.CalledProcessError(retcode, process.args, output=stdout) Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 Jan 26 15:28:53 File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 317, in squelch_exception sys.stderr.write("Removing descriptor: %r\n" % (fileno,)): Removing descriptor: 65 """ And in “trace” we have: """ Jan 26 15:28:53 Traceback (most recent call last): File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler for rsp in resourcehandler_backend(env, start_response): File "/opt/confluent/lib/python/confluent/httpapi.py", line 636, in resourcehandler_backend for res in selfservice.handle_request(env, start_response): File "/opt/confluent/lib/python/confluent/selfservice.py", line 502, in handle_request status, output = syncfiles.get_syncresult(nodename) File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in get_syncresult result = syncrunners[nodename].wait() File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 181, in wait return self._exit_event.wait() File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 132, in wait current.throw(*self._exc) File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main result = function(*args, **kwargs) File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] File "/opt/confluent/lib/python/confluent/util.py", line 45, in run raise subprocess.CalledProcessError(retcode, process.args, output=stdout) CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 """ > On Jan 26, 2024, at 15:01, Jarrod Johnson <jjo...@le...> wrote: > > Ok, another track (trying to compensate for not being able to use selfcheck). > > Can you try sticking some file in the profile's syncfiles, then do: > nodeapply -F <node> > > And see if any errors happen, either in output or in the /var/log/confluet area. > >> From: David Magda <dma...@ee...> >> Sent: Friday, January 26, 2024 2:01 PM >> To: xCAT Users Mailing list <xca...@li...> >> Subject: Re: [xcat-user] [External] Ansible and Confluent >> >> We have Confluent installed on a RH/CentOS 7 system that originally had/has xCat installed for deployment of our Lenovo hardware/HPC solution. I just installed it there as it was/is our 'install server'. (We don't want to touch it too much, as it was a previous team of folks that set things up, and there's been a lot of team churn.) >> >> I've attached the "hangtraces" to this message; hopefully the mailing list software will pass it along. I noticed “ipmi” in some of the paths, and for the record this is a VM running under Proxmox, and does not have any LOM configured: >> >> """ >> # nodeattrib dm-boot1 >> dm-boot1: crypted.selfapikey: ******** >> dm-boot1: deployment.apiarmed: >> dm-boot1: deployment.pendingprofile: ubuntu-22.04.3-x86_64-test1 >> dm-boot1: deployment.profile: >> dm-boot1: deployment.sealedapikey: >> dm-boot1: deployment.stagedprofile: >> dm-boot1: deployment.state: >> dm-boot1: deployment.state_detail: >> dm-boot1: deployment.useinsecureprotocols: always >> dm-boot1: dns.servers: 172.17.15.252,172.17.15.247,172.17.15.254 >> dm-boot1: groups: everything >> dm-boot1: net.hwaddr: 4e:78:df:d3:8d:59 >> dm-boot1: net.ipv4_address: 172.17.15.222/21 >> dm-boot1: net.ipv4_gateway: 172.17.8.254 >> """ >> >> Running an strace(1) on the 'apiclient' that runs as part of the "post.sh" process, we have a continuous poll/read/write stream: >> >> """ >> […] >> write(3, "\27\3\3\0\371Sm2\233\337\222n\221\377vZs\21\22S\10\351\232\321I7Y$R\370]\312"..., 254) = 254 >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) >> read(3, "\27\3\3\0\226", 5) = 5 >> read(3, "\0055\271\274&\2464\237\242h\341\30\231\274\327g\224\344g\306\313\206\326\355x\307\341\331C\366H\331"..., 150) = 150 >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) >> write(3, "\27\3\3\0\371Sm2\233\337\222n\222\334e\336f\353u\343p\22\215:\264e\30a\3172\245\361"..., 254) = 254 >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) >> read(3, "\27\3\3\0\226", 5) = 5 >> read(3, "\0055\271\274&\2464\240\326\202\347(\213\311\260|\333\230\372A\235\341\273U\201\223\2209ah\325J"..., 150) = 150 >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) >> write(3, "\27\3\3\0\371Sm2\233\337\222n\223\240\341<\3602\323\177Y\311\317/\371\336P/s\301t8"..., 254) = 254 >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> poll([{fd=3, events=POLLIN}], 1, 15000^Cstrace: Process 27477 detached >> <detached ...> >> """ >> >> Per lsof(1), FD 3 is: >> >> """ >> python3 27477 root 3u IPv6 158157 0t0 TCP [fe80::[EUI-64_client]]:44800->[fe80::[EUI-64_server]]:https (ESTABLISHED) >> """ >> >> >> >> On Thu, January 25, 2024 16:34, Jarrod Johnson wrote: >> > What is the OS of the deployment server? >> > >> > kill -USR1 $(cat /var/run/confluent/pid) >> > >> > This should produce a /var/log/confluennt/hangtraces >> > >> > Would be interesting to see if there's ansible related stacks in >> > hangtraces that seem stuck... >> > >> > >> > ________________________________ >> > From: David Magda <dma...@ee...> >> > Sent: Thursday, January 25, 2024 4:25 PM >> > To: xCAT Users Mailing list <xca...@li...> >> > Subject: Re: [xcat-user] [External] Ansible and Confluent >> > >> > First suggested command: >> > >> > """ >> > # confluent_selfcheck >> > OS Deployment: Initialized >> > Confluent UUID: Consistent >> > Web Server: Running >> > Web Certificate: Traceback (most recent call last): >> > File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> >> > cert = certificates_missing_ips(conn) >> > File "/opt/confluent/bin/confluent_selfcheck", line 57, in >> > certificates_missing_ips >> > ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) >> > AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' >> > """ >> > >> > On the being-installed system, ignoring the typical Linux stuff, the >> > output of 'ps -elfH' has: >> > >> > """ >> > >> > 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? >> > 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher >> > --run-startup-triggers >> > 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? >> > 00:00:00 /usr/libexec/polkitd --no-debug >> > 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? >> > 00:00:00 /usr/sbin/rsyslogd -n -iNONE >> > 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? >> > 00:00:03 /usr/lib/snapd/snapd >> > 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? >> > 00:00:00 /lib/systemd/systemd-logind >> > 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? >> > 00:00:02 /usr/libexec/udisks2/udisksd >> > 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? >> > 00:00:00 /usr/bin/python3 >> > /usr/share/unattended-upgrades/unattended-upgrade-shutdown >> > --wait-for-signal >> > 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? >> > 00:00:00 /usr/sbin/ModemManager >> > 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? >> > 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server >> > 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? >> > 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m >> > subiquity.cmd.server >> > 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? >> > 00:00:00 sh -c /custom-installation/post.sh >> > 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? >> > 00:00:00 /bin/bash /custom-installation/post.sh >> > 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? >> > 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient >> > /confluent-api/self/remoteconfig/status -w 204 >> > 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 >> > 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 >> > /snap/subiquity/5004/usr/bin/subiquity >> > 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? >> > 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups >> > 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? >> > 00:00:00 sshd: root@pts/0 >> > 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? >> > 00:00:00 -bash >> > 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? >> > 00:00:00 ps -elfH >> > 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? >> > 00:00:00 /lib/netplan/netplan-dbus >> > 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? >> > 00:00:00 /usr/libexec/packagekitd >> > 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? >> > 00:00:00 /lib/systemd/systemd --user >> > 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? >> > 00:00:00 (sd-pam) >> > """ >> > >> > While 'ps axf' produces (trimmed): >> > >> > """ >> > 2042 ? Ss 0:00 /bin/sh >> > /snap/subiquity/5004/usr/bin/subiquity-server >> > 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m >> > subiquity.cmd.server >> > 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh >> > 27501 ? S 0:00 \_ /bin/bash >> > /custom-installation/post.sh >> > 27588 ? S 3:21 \_ /usr/bin/python3 >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w >> > 204 >> > 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 >> > /snap/subiquity/5004/usr/bin/subiquity >> > """ >> > >> > Doing a "kill -9 27588" (on apiclient) causes the installation to >> > 'finish'. After the reboot, and after "firshboot.sh" does its thing, we >> > have the following from 'ps axf': >> > >> > """ >> > 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules >> > --mode=final >> > 1376 ? S 0:00 \_ /bin/sh -c tee -a >> > /var/log/cloud-init-output.log >> > 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log >> > 1378 ? S 0:00 \_ /bin/sh >> > /var/lib/cloud/instance/scripts/runcmd >> > 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh >> > 1429 ? S 0:01 \_ /usr/bin/python3 >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w >> > 204 >> > """ >> > >> > This causes the "/var/log/httpd/ssl_access_log" to start filling up. A >> > subsequent reboot, where "firstboot.sh" is not run, has the the system >> > coming up without "apiclient" running, and so there's no longer 'spam' in >> > "ssl_access_log". >> > >> > Running "apiclient" manually from the CLI with the exact options causes a >> > bunch of stuff in "ssl_access_log": >> > >> > """ >> > fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET >> > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > """ >> > >> > at the same time as the above is being generated, there is nothing in >> > "/var/log/confluent/trace" or "stderr�. >> > >> > >> > On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: >> >> Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also >> >> would be tempted to see if 'confluent_selfcheck' has any suggestions. >> >> You >> >> can also ssh into the node during that phase to confirm what it is doing >> >> while it is seemingly hung, e.g. looking at ps axf >> >> ________________________________ >> >> From: David Magda <dma...@ee...> >> >> Sent: Wednesday, January 24, 2024 9:37 PM >> >> To: xCA...@li... <xCA...@li...> >> >> Subject: [External] [xcat-user] Ansible and Confluent >> >> >> >> Hello, >> >> >> >> I'm trying to get Ansible working with Confluent 3.8.0. (Using an older >> >> version due to legacy OS reasons.) >> >> >> >> In /var/lib/confluent/public/os/ I created a new profile called >> >> ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took >> >> the >> >> provided "autoinstall/user-data" file, added some partition stanzas, >> >> some >> >> packages, etc. >> >> >> >> Once I sorted out a 'basic' automated Ubuntu install I tried creating a >> >> "ansible/post.d/01-packages.yaml" file with-in the profile directory >> >> with >> >> the following contents: >> >> >> >> """ >> >> - name: install chrony >> >> apt: >> >> pkg: >> >> - chrony >> >> """ >> >> >> >> The Ubuntu (subiquity) installer seems to 'hang' at: >> >> >> >> """ >> >> start: subiquity/Late/run/command_1: /custom-installation/post.sh >> >> """ >> >> >> >> which probably corresponds to this part of the "user-data" file: >> >> >> >> """ >> >> late-commands: >> >> - chroot /target apt-get -y -q purge snapd modemmanager >> >> - /custom-installation/post.sh >> >> """ >> >> >> >> When the 'hang' occurs the following starts filling up the >> >> "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: >> >> >> >> """ >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> """ >> >> >> >> When I force a restart of the system/VM, it can boot off the disk, and >> >> goes through the regular start-up process, including a bunch of >> >> cloud-init >> >> stuff. Though after it runs "/etc/confluent/firstboot.sh", the >> >> "ssl_access_log" file once again starts filling with the >> >> "remoteconfig/status" stuff per above. >> >> >> >> Renaming "ansible/" to "ansible_off/" seems to make the problem go away. >> >> Similar behaviour with Ubuntu 20.04. >> >> >> >> I'm wondering what's going with the 'hang' when "post.sh" is executed, >> >> and >> >> the flooding after "firstboot.sh". >> >> >> >> Regards, >> >> David >> > >> > >> >> _______________________________________________ >> xCAT-user mailing list >> xCA...@li... >> https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C4b50f514de4046c1cc2608dc1eaf3bb1%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418985686930112%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=fn0isS3QEIr4jMt9uQtlHbzVxxi6q3aBZ77sNkFZ3G0%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> >> _______________________________________________ >> xCAT-user mailing list >> xCA...@li... >> https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C4b50f514de4046c1cc2608dc1eaf3bb1%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418985686937745%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=9oGMoCrvvU%2FQI2FcTXcbtUoEQkJ%2BLsV7uyEvEf3%2BvXk%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> > _______________________________________________ xCAT-user mailing list xCA...@li... https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C4b50f514de4046c1cc2608dc1eaf3bb1%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418985686942946%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=afjuil3hoGgLZGjxw3l%2Fn576nJANqcHuX4f6l16xNYY%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> |
From: David M. <dma...@ee...> - 2024-01-26 20:40:39
|
There’s no “syncfiles” in the default Ubuntu profile, nor anything in the web docs on its format, but I found a template in "/opt/confluent/lib/osdeploy/el9/profiles/default/syncfiles”. Created a file with the line: /etc/hosts -> /etc/hosts_test With the results: """ # nodeapply -F dm-boot1 dm-boot1: dm-boot1: --------------------------------------------------------------------------- dm-boot1: Running python script 'syncfileclient' from https://[fe80::749f:43ff:fe72:55e4%2]/confluent-public/os/ubuntu-22.04.3-x86_64-test1/scripts/ dm-boot1: Executing in /tmp/confluentscripts.HUGo3sMtt dm-boot1: Traceback (most recent call last): dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 286, in <module> dm-boot1: synchronize() dm-boot1: File "/tmp/confluentscripts.HUGo3sMtt/syncfileclient", line 233, in synchronize dm-boot1: status, rsp = ac.grab_url_with_status('/confluent-api/self/remotesyncfiles') dm-boot1: File "/opt/confluent/bin/apiclient", line 413, in grab_url_with_status dm-boot1: raise Exception(rsp.read()) dm-boot1: Exception: b"500 - Command '['rsync', '-rvLD', '/tmp/tmpSUbmoD.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255" dm-boot1: 'syncfileclient' exited with code 1 """ In "/var/log/confluent/stderr” we have: """ Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): Traceback (most recent call last): Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 111, in wait Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): listener.cb(fileno) Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 53, in on_read Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): current.switch(([original], [], [])) Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): result = function(*args, **kwargs) Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): File "/opt/confluent/lib/python/confluent/util.py", line 45, in run Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): raise subprocess.CalledProcessError(retcode, process.args, output=stdout) Jan 26 15:28:53 File "/usr/lib64/python2.7/traceback.py", line 13, in _print file.write(str+terminator): CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 Jan 26 15:28:53 File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 317, in squelch_exception sys.stderr.write("Removing descriptor: %r\n" % (fileno,)): Removing descriptor: 65 """ And in “trace” we have: """ Jan 26 15:28:53 Traceback (most recent call last): File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler for rsp in resourcehandler_backend(env, start_response): File "/opt/confluent/lib/python/confluent/httpapi.py", line 636, in resourcehandler_backend for res in selfservice.handle_request(env, start_response): File "/opt/confluent/lib/python/confluent/selfservice.py", line 502, in handle_request status, output = syncfiles.get_syncresult(nodename) File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in get_syncresult result = syncrunners[nodename].wait() File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 181, in wait return self._exit_event.wait() File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 132, in wait current.throw(*self._exc) File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 221, in main result = function(*args, **kwargs) File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node ['rsync', '-rvLD', targdir + '/', 'root@[{}]:/'.format(targip)])[0] File "/opt/confluent/lib/python/confluent/util.py", line 45, in run raise subprocess.CalledProcessError(retcode, process.args, output=stdout) CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpVbi9YY.synctodm-boot1/', 'root@[172.17.15.222]:/']' returned non-zero exit status 255 """ > On Jan 26, 2024, at 15:01, Jarrod Johnson <jjo...@le...> wrote: > > Ok, another track (trying to compensate for not being able to use selfcheck). > > Can you try sticking some file in the profile's syncfiles, then do: > nodeapply -F <node> > > And see if any errors happen, either in output or in the /var/log/confluet area. > >> From: David Magda <dma...@ee...> >> Sent: Friday, January 26, 2024 2:01 PM >> To: xCAT Users Mailing list <xca...@li...> >> Subject: Re: [xcat-user] [External] Ansible and Confluent >> >> We have Confluent installed on a RH/CentOS 7 system that originally had/has xCat installed for deployment of our Lenovo hardware/HPC solution. I just installed it there as it was/is our 'install server'. (We don't want to touch it too much, as it was a previous team of folks that set things up, and there's been a lot of team churn.) >> >> I've attached the "hangtraces" to this message; hopefully the mailing list software will pass it along. I noticed “ipmi” in some of the paths, and for the record this is a VM running under Proxmox, and does not have any LOM configured: >> >> """ >> # nodeattrib dm-boot1 >> dm-boot1: crypted.selfapikey: ******** >> dm-boot1: deployment.apiarmed: >> dm-boot1: deployment.pendingprofile: ubuntu-22.04.3-x86_64-test1 >> dm-boot1: deployment.profile: >> dm-boot1: deployment.sealedapikey: >> dm-boot1: deployment.stagedprofile: >> dm-boot1: deployment.state: >> dm-boot1: deployment.state_detail: >> dm-boot1: deployment.useinsecureprotocols: always >> dm-boot1: dns.servers: 172.17.15.252,172.17.15.247,172.17.15.254 >> dm-boot1: groups: everything >> dm-boot1: net.hwaddr: 4e:78:df:d3:8d:59 >> dm-boot1: net.ipv4_address: 172.17.15.222/21 >> dm-boot1: net.ipv4_gateway: 172.17.8.254 >> """ >> >> Running an strace(1) on the 'apiclient' that runs as part of the "post.sh" process, we have a continuous poll/read/write stream: >> >> """ >> […] >> write(3, "\27\3\3\0\371Sm2\233\337\222n\221\377vZs\21\22S\10\351\232\321I7Y$R\370]\312"..., 254) = 254 >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) >> read(3, "\27\3\3\0\226", 5) = 5 >> read(3, "\0055\271\274&\2464\237\242h\341\30\231\274\327g\224\344g\306\313\206\326\355x\307\341\331C\366H\331"..., 150) = 150 >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) >> write(3, "\27\3\3\0\371Sm2\233\337\222n\222\334e\336f\353u\343p\22\215:\264e\30a\3172\245\361"..., 254) = 254 >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) >> read(3, "\27\3\3\0\226", 5) = 5 >> read(3, "\0055\271\274&\2464\240\326\202\347(\213\311\260|\333\230\372A\235\341\273U\201\223\2209ah\325J"..., 150) = 150 >> poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) >> write(3, "\27\3\3\0\371Sm2\233\337\222n\223\240\341<\3602\323\177Y\311\317/\371\336P/s\301t8"..., 254) = 254 >> read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) >> poll([{fd=3, events=POLLIN}], 1, 15000^Cstrace: Process 27477 detached >> <detached ...> >> """ >> >> Per lsof(1), FD 3 is: >> >> """ >> python3 27477 root 3u IPv6 158157 0t0 TCP [fe80::[EUI-64_client]]:44800->[fe80::[EUI-64_server]]:https (ESTABLISHED) >> """ >> >> >> >> On Thu, January 25, 2024 16:34, Jarrod Johnson wrote: >> > What is the OS of the deployment server? >> > >> > kill -USR1 $(cat /var/run/confluent/pid) >> > >> > This should produce a /var/log/confluennt/hangtraces >> > >> > Would be interesting to see if there's ansible related stacks in >> > hangtraces that seem stuck... >> > >> > >> > ________________________________ >> > From: David Magda <dma...@ee...> >> > Sent: Thursday, January 25, 2024 4:25 PM >> > To: xCAT Users Mailing list <xca...@li...> >> > Subject: Re: [xcat-user] [External] Ansible and Confluent >> > >> > First suggested command: >> > >> > """ >> > # confluent_selfcheck >> > OS Deployment: Initialized >> > Confluent UUID: Consistent >> > Web Server: Running >> > Web Certificate: Traceback (most recent call last): >> > File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> >> > cert = certificates_missing_ips(conn) >> > File "/opt/confluent/bin/confluent_selfcheck", line 57, in >> > certificates_missing_ips >> > ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) >> > AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' >> > """ >> > >> > On the being-installed system, ignoring the typical Linux stuff, the >> > output of 'ps -elfH' has: >> > >> > """ >> > >> > 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? >> > 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher >> > --run-startup-triggers >> > 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? >> > 00:00:00 /usr/libexec/polkitd --no-debug >> > 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? >> > 00:00:00 /usr/sbin/rsyslogd -n -iNONE >> > 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? >> > 00:00:03 /usr/lib/snapd/snapd >> > 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? >> > 00:00:00 /lib/systemd/systemd-logind >> > 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? >> > 00:00:02 /usr/libexec/udisks2/udisksd >> > 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? >> > 00:00:00 /usr/bin/python3 >> > /usr/share/unattended-upgrades/unattended-upgrade-shutdown >> > --wait-for-signal >> > 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? >> > 00:00:00 /usr/sbin/ModemManager >> > 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? >> > 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server >> > 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? >> > 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m >> > subiquity.cmd.server >> > 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? >> > 00:00:00 sh -c /custom-installation/post.sh >> > 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? >> > 00:00:00 /bin/bash /custom-installation/post.sh >> > 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? >> > 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient >> > /confluent-api/self/remoteconfig/status -w 204 >> > 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 >> > 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 >> > /snap/subiquity/5004/usr/bin/subiquity >> > 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? >> > 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups >> > 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? >> > 00:00:00 sshd: root@pts/0 >> > 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? >> > 00:00:00 -bash >> > 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? >> > 00:00:00 ps -elfH >> > 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? >> > 00:00:00 /lib/netplan/netplan-dbus >> > 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? >> > 00:00:00 /usr/libexec/packagekitd >> > 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? >> > 00:00:00 /lib/systemd/systemd --user >> > 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? >> > 00:00:00 (sd-pam) >> > """ >> > >> > While 'ps axf' produces (trimmed): >> > >> > """ >> > 2042 ? Ss 0:00 /bin/sh >> > /snap/subiquity/5004/usr/bin/subiquity-server >> > 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m >> > subiquity.cmd.server >> > 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh >> > 27501 ? S 0:00 \_ /bin/bash >> > /custom-installation/post.sh >> > 27588 ? S 3:21 \_ /usr/bin/python3 >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w >> > 204 >> > 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 >> > /snap/subiquity/5004/usr/bin/subiquity >> > """ >> > >> > Doing a "kill -9 27588" (on apiclient) causes the installation to >> > 'finish'. After the reboot, and after "firshboot.sh" does its thing, we >> > have the following from 'ps axf': >> > >> > """ >> > 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules >> > --mode=final >> > 1376 ? S 0:00 \_ /bin/sh -c tee -a >> > /var/log/cloud-init-output.log >> > 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log >> > 1378 ? S 0:00 \_ /bin/sh >> > /var/lib/cloud/instance/scripts/runcmd >> > 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh >> > 1429 ? S 0:01 \_ /usr/bin/python3 >> > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w >> > 204 >> > """ >> > >> > This causes the "/var/log/httpd/ssl_access_log" to start filling up. A >> > subsequent reboot, where "firstboot.sh" is not run, has the the system >> > coming up without "apiclient" running, and so there's no longer 'spam' in >> > "ssl_access_log". >> > >> > Running "apiclient" manually from the CLI with the exact options causes a >> > bunch of stuff in "ssl_access_log": >> > >> > """ >> > fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET >> > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> > """ >> > >> > at the same time as the above is being generated, there is nothing in >> > "/var/log/confluent/trace" or "stderr�. >> > >> > >> > On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: >> >> Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also >> >> would be tempted to see if 'confluent_selfcheck' has any suggestions. >> >> You >> >> can also ssh into the node during that phase to confirm what it is doing >> >> while it is seemingly hung, e.g. looking at ps axf >> >> ________________________________ >> >> From: David Magda <dma...@ee...> >> >> Sent: Wednesday, January 24, 2024 9:37 PM >> >> To: xCA...@li... <xCA...@li...> >> >> Subject: [External] [xcat-user] Ansible and Confluent >> >> >> >> Hello, >> >> >> >> I'm trying to get Ansible working with Confluent 3.8.0. (Using an older >> >> version due to legacy OS reasons.) >> >> >> >> In /var/lib/confluent/public/os/ I created a new profile called >> >> ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took >> >> the >> >> provided "autoinstall/user-data" file, added some partition stanzas, >> >> some >> >> packages, etc. >> >> >> >> Once I sorted out a 'basic' automated Ubuntu install I tried creating a >> >> "ansible/post.d/01-packages.yaml" file with-in the profile directory >> >> with >> >> the following contents: >> >> >> >> """ >> >> - name: install chrony >> >> apt: >> >> pkg: >> >> - chrony >> >> """ >> >> >> >> The Ubuntu (subiquity) installer seems to 'hang' at: >> >> >> >> """ >> >> start: subiquity/Late/run/command_1: /custom-installation/post.sh >> >> """ >> >> >> >> which probably corresponds to this part of the "user-data" file: >> >> >> >> """ >> >> late-commands: >> >> - chroot /target apt-get -y -q purge snapd modemmanager >> >> - /custom-installation/post.sh >> >> """ >> >> >> >> When the 'hang' occurs the following starts filling up the >> >> "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: >> >> >> >> """ >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> >> """ >> >> >> >> When I force a restart of the system/VM, it can boot off the disk, and >> >> goes through the regular start-up process, including a bunch of >> >> cloud-init >> >> stuff. Though after it runs "/etc/confluent/firstboot.sh", the >> >> "ssl_access_log" file once again starts filling with the >> >> "remoteconfig/status" stuff per above. >> >> >> >> Renaming "ansible/" to "ansible_off/" seems to make the problem go away. >> >> Similar behaviour with Ubuntu 20.04. >> >> >> >> I'm wondering what's going with the 'hang' when "post.sh" is executed, >> >> and >> >> the flooding after "firstboot.sh". >> >> >> >> Regards, >> >> David >> > >> > >> >> _______________________________________________ >> xCAT-user mailing list >> xCA...@li... >> https://lists.sourceforge.net/lists/listinfo/xcat-user >> _______________________________________________ >> xCAT-user mailing list >> xCA...@li... >> https://lists.sourceforge.net/lists/listinfo/xcat-user > |
From: Jarrod J. <jjo...@le...> - 2024-01-26 20:01:17
|
Ok, another track (trying to compensate for not being able to use selfcheck). Can you try sticking some file in the profile's syncfiles, then do: nodeapply -F <node> And see if any errors happen, either in output or in the /var/log/confluet area. ________________________________ From: David Magda <dma...@ee...> Sent: Friday, January 26, 2024 2:01 PM To: xCAT Users Mailing list <xca...@li...> Subject: Re: [xcat-user] [External] Ansible and Confluent We have Confluent installed on a RH/CentOS 7 system that originally had/has xCat installed for deployment of our Lenovo hardware/HPC solution. I just installed it there as it was/is our 'install server'. (We don't want to touch it too much, as it was a previous team of folks that set things up, and there's been a lot of team churn.) I've attached the "hangtraces" to this message; hopefully the mailing list software will pass it along. I noticed “ipmi” in some of the paths, and for the record this is a VM running under Proxmox, and does not have any LOM configured: """ # nodeattrib dm-boot1 dm-boot1: crypted.selfapikey: ******** dm-boot1: deployment.apiarmed: dm-boot1: deployment.pendingprofile: ubuntu-22.04.3-x86_64-test1 dm-boot1: deployment.profile: dm-boot1: deployment.sealedapikey: dm-boot1: deployment.stagedprofile: dm-boot1: deployment.state: dm-boot1: deployment.state_detail: dm-boot1: deployment.useinsecureprotocols: always dm-boot1: dns.servers: 172.17.15.252,172.17.15.247,172.17.15.254 dm-boot1: groups: everything dm-boot1: net.hwaddr: 4e:78:df:d3:8d:59 dm-boot1: net.ipv4_address: 172.17.15.222/21 dm-boot1: net.ipv4_gateway: 172.17.8.254 """ Running an strace(1) on the 'apiclient' that runs as part of the "post.sh" process, we have a continuous poll/read/write stream: """ […] write(3, "\27\3\3\0\371Sm2\233\337\222n\221\377vZs\21\22S\10\351\232\321I7Y$R\370]\312"..., 254) = 254 read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) read(3, "\27\3\3\0\226", 5) = 5 read(3, "\0055\271\274&\2464\237\242h\341\30\231\274\327g\224\344g\306\313\206\326\355x\307\341\331C\366H\331"..., 150) = 150 poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) write(3, "\27\3\3\0\371Sm2\233\337\222n\222\334e\336f\353u\343p\22\215:\264e\30a\3172\245\361"..., 254) = 254 read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) read(3, "\27\3\3\0\226", 5) = 5 read(3, "\0055\271\274&\2464\240\326\202\347(\213\311\260|\333\230\372A\235\341\273U\201\223\2209ah\325J"..., 150) = 150 poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) write(3, "\27\3\3\0\371Sm2\233\337\222n\223\240\341<\3602\323\177Y\311\317/\371\336P/s\301t8"..., 254) = 254 read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=3, events=POLLIN}], 1, 15000^Cstrace: Process 27477 detached <detached ...> """ Per lsof(1), FD 3 is: """ python3 27477 root 3u IPv6 158157 0t0 TCP [fe80::[EUI-64_client]]:44800->[fe80::[EUI-64_server]]:https (ESTABLISHED) """ On Thu, January 25, 2024 16:34, Jarrod Johnson wrote: > What is the OS of the deployment server? > > kill -USR1 $(cat /var/run/confluent/pid) > > This should produce a /var/log/confluennt/hangtraces > > Would be interesting to see if there's ansible related stacks in > hangtraces that seem stuck... > > > ________________________________ > From: David Magda <dma...@ee...> > Sent: Thursday, January 25, 2024 4:25 PM > To: xCAT Users Mailing list <xca...@li...> > Subject: Re: [xcat-user] [External] Ansible and Confluent > > First suggested command: > > """ > # confluent_selfcheck > OS Deployment: Initialized > Confluent UUID: Consistent > Web Server: Running > Web Certificate: Traceback (most recent call last): > File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> > cert = certificates_missing_ips(conn) > File "/opt/confluent/bin/confluent_selfcheck", line 57, in > certificates_missing_ips > ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) > AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' > """ > > On the being-installed system, ignoring the typical Linux stuff, the > output of 'ps -elfH' has: > > """ > > 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? > 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher > --run-startup-triggers > 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? > 00:00:00 /usr/libexec/polkitd --no-debug > 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? > 00:00:00 /usr/sbin/rsyslogd -n -iNONE > 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? > 00:00:03 /usr/lib/snapd/snapd > 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? > 00:00:00 /lib/systemd/systemd-logind > 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? > 00:00:02 /usr/libexec/udisks2/udisksd > 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? > 00:00:00 /usr/bin/python3 > /usr/share/unattended-upgrades/unattended-upgrade-shutdown > --wait-for-signal > 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? > 00:00:00 /usr/sbin/ModemManager > 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? > 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server > 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? > 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m > subiquity.cmd.server > 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? > 00:00:00 sh -c /custom-installation/post.sh > 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? > 00:00:00 /bin/bash /custom-installation/post.sh > 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? > 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient > /confluent-api/self/remoteconfig/status -w 204 > 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 > 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 > /snap/subiquity/5004/usr/bin/subiquity > 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? > 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups > 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? > 00:00:00 sshd: root@pts/0 > 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? > 00:00:00 -bash > 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? > 00:00:00 ps -elfH > 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? > 00:00:00 /lib/netplan/netplan-dbus > 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? > 00:00:00 /usr/libexec/packagekitd > 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? > 00:00:00 /lib/systemd/systemd --user > 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? > 00:00:00 (sd-pam) > """ > > While 'ps axf' produces (trimmed): > > """ > 2042 ? Ss 0:00 /bin/sh > /snap/subiquity/5004/usr/bin/subiquity-server > 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m > subiquity.cmd.server > 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh > 27501 ? S 0:00 \_ /bin/bash > /custom-installation/post.sh > 27588 ? S 3:21 \_ /usr/bin/python3 > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w > 204 > 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 > /snap/subiquity/5004/usr/bin/subiquity > """ > > Doing a "kill -9 27588" (on apiclient) causes the installation to > 'finish'. After the reboot, and after "firshboot.sh" does its thing, we > have the following from 'ps axf': > > """ > 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules > --mode=final > 1376 ? S 0:00 \_ /bin/sh -c tee -a > /var/log/cloud-init-output.log > 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log > 1378 ? S 0:00 \_ /bin/sh > /var/lib/cloud/instance/scripts/runcmd > 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh > 1429 ? S 0:01 \_ /usr/bin/python3 > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w > 204 > """ > > This causes the "/var/log/httpd/ssl_access_log" to start filling up. A > subsequent reboot, where "firstboot.sh" is not run, has the the system > coming up without "apiclient" running, and so there's no longer 'spam' in > "ssl_access_log". > > Running "apiclient" manually from the CLI with the exact options causes a > bunch of stuff in "ssl_access_log": > > """ > fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > """ > > at the same time as the above is being generated, there is nothing in > "/var/log/confluent/trace" or "stderr�. > > > On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: >> Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also >> would be tempted to see if 'confluent_selfcheck' has any suggestions. >> You >> can also ssh into the node during that phase to confirm what it is doing >> while it is seemingly hung, e.g. looking at ps axf >> ________________________________ >> From: David Magda <dma...@ee...> >> Sent: Wednesday, January 24, 2024 9:37 PM >> To: xCA...@li... <xCA...@li...> >> Subject: [External] [xcat-user] Ansible and Confluent >> >> Hello, >> >> I'm trying to get Ansible working with Confluent 3.8.0. (Using an older >> version due to legacy OS reasons.) >> >> In /var/lib/confluent/public/os/ I created a new profile called >> ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took >> the >> provided "autoinstall/user-data" file, added some partition stanzas, >> some >> packages, etc. >> >> Once I sorted out a 'basic' automated Ubuntu install I tried creating a >> "ansible/post.d/01-packages.yaml" file with-in the profile directory >> with >> the following contents: >> >> """ >> - name: install chrony >> apt: >> pkg: >> - chrony >> """ >> >> The Ubuntu (subiquity) installer seems to 'hang' at: >> >> """ >> start: subiquity/Late/run/command_1: /custom-installation/post.sh >> """ >> >> which probably corresponds to this part of the "user-data" file: >> >> """ >> late-commands: >> - chroot /target apt-get -y -q purge snapd modemmanager >> - /custom-installation/post.sh >> """ >> >> When the 'hang' occurs the following starts filling up the >> "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: >> >> """ >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> """ >> >> When I force a restart of the system/VM, it can boot off the disk, and >> goes through the regular start-up process, including a bunch of >> cloud-init >> stuff. Though after it runs "/etc/confluent/firstboot.sh", the >> "ssl_access_log" file once again starts filling with the >> "remoteconfig/status" stuff per above. >> >> Renaming "ansible/" to "ansible_off/" seems to make the problem go away. >> Similar behaviour with Ubuntu 20.04. >> >> I'm wondering what's going with the 'hang' when "post.sh" is executed, >> and >> the flooding after "firstboot.sh". >> >> Regards, >> David > > _______________________________________________ xCAT-user mailing list xCA...@li... https://lists.sourceforge.net/lists/listinfo/xcat-user |
From: David M. <dma...@ee...> - 2024-01-26 19:01:47
|
We have Confluent installed on a RH/CentOS 7 system that originally had/has xCat installed for deployment of our Lenovo hardware/HPC solution. I just installed it there as it was/is our 'install server'. (We don't want to touch it too much, as it was a previous team of folks that set things up, and there's been a lot of team churn.) I've attached the "hangtraces" to this message; hopefully the mailing list software will pass it along. I noticed “ipmi” in some of the paths, and for the record this is a VM running under Proxmox, and does not have any LOM configured: """ # nodeattrib dm-boot1 dm-boot1: crypted.selfapikey: ******** dm-boot1: deployment.apiarmed: dm-boot1: deployment.pendingprofile: ubuntu-22.04.3-x86_64-test1 dm-boot1: deployment.profile: dm-boot1: deployment.sealedapikey: dm-boot1: deployment.stagedprofile: dm-boot1: deployment.state: dm-boot1: deployment.state_detail: dm-boot1: deployment.useinsecureprotocols: always dm-boot1: dns.servers: 172.17.15.252,172.17.15.247,172.17.15.254 dm-boot1: groups: everything dm-boot1: net.hwaddr: 4e:78:df:d3:8d:59 dm-boot1: net.ipv4_address: 172.17.15.222/21 dm-boot1: net.ipv4_gateway: 172.17.8.254 """ Running an strace(1) on the 'apiclient' that runs as part of the "post.sh" process, we have a continuous poll/read/write stream: """ […] write(3, "\27\3\3\0\371Sm2\233\337\222n\221\377vZs\21\22S\10\351\232\321I7Y$R\370]\312"..., 254) = 254 read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) read(3, "\27\3\3\0\226", 5) = 5 read(3, "\0055\271\274&\2464\237\242h\341\30\231\274\327g\224\344g\306\313\206\326\355x\307\341\331C\366H\331"..., 150) = 150 poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) write(3, "\27\3\3\0\371Sm2\233\337\222n\222\334e\336f\353u\343p\22\215:\264e\30a\3172\245\361"..., 254) = 254 read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=3, events=POLLIN}], 1, 15000) = 1 ([{fd=3, revents=POLLIN}]) read(3, "\27\3\3\0\226", 5) = 5 read(3, "\0055\271\274&\2464\240\326\202\347(\213\311\260|\333\230\372A\235\341\273U\201\223\2209ah\325J"..., 150) = 150 poll([{fd=3, events=POLLOUT}], 1, 15000) = 1 ([{fd=3, revents=POLLOUT}]) write(3, "\27\3\3\0\371Sm2\233\337\222n\223\240\341<\3602\323\177Y\311\317/\371\336P/s\301t8"..., 254) = 254 read(3, 0x560b6949e8f3, 5) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=3, events=POLLIN}], 1, 15000^Cstrace: Process 27477 detached <detached ...> """ Per lsof(1), FD 3 is: """ python3 27477 root 3u IPv6 158157 0t0 TCP [fe80::[EUI-64_client]]:44800->[fe80::[EUI-64_server]]:https (ESTABLISHED) """ On Thu, January 25, 2024 16:34, Jarrod Johnson wrote: > What is the OS of the deployment server? > > kill -USR1 $(cat /var/run/confluent/pid) > > This should produce a /var/log/confluennt/hangtraces > > Would be interesting to see if there's ansible related stacks in > hangtraces that seem stuck... > > > ________________________________ > From: David Magda <dma...@ee...> > Sent: Thursday, January 25, 2024 4:25 PM > To: xCAT Users Mailing list <xca...@li...> > Subject: Re: [xcat-user] [External] Ansible and Confluent > > First suggested command: > > """ > # confluent_selfcheck > OS Deployment: Initialized > Confluent UUID: Consistent > Web Server: Running > Web Certificate: Traceback (most recent call last): > File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> > cert = certificates_missing_ips(conn) > File "/opt/confluent/bin/confluent_selfcheck", line 57, in > certificates_missing_ips > ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) > AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' > """ > > On the being-installed system, ignoring the typical Linux stuff, the > output of 'ps -elfH' has: > > """ > > 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? > 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher > --run-startup-triggers > 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? > 00:00:00 /usr/libexec/polkitd --no-debug > 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? > 00:00:00 /usr/sbin/rsyslogd -n -iNONE > 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? > 00:00:03 /usr/lib/snapd/snapd > 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? > 00:00:00 /lib/systemd/systemd-logind > 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? > 00:00:02 /usr/libexec/udisks2/udisksd > 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? > 00:00:00 /usr/bin/python3 > /usr/share/unattended-upgrades/unattended-upgrade-shutdown > --wait-for-signal > 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? > 00:00:00 /usr/sbin/ModemManager > 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? > 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server > 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? > 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m > subiquity.cmd.server > 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? > 00:00:00 sh -c /custom-installation/post.sh > 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? > 00:00:00 /bin/bash /custom-installation/post.sh > 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? > 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient > /confluent-api/self/remoteconfig/status -w 204 > 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 > 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 > /snap/subiquity/5004/usr/bin/subiquity > 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? > 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups > 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? > 00:00:00 sshd: root@pts/0 > 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? > 00:00:00 -bash > 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? > 00:00:00 ps -elfH > 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? > 00:00:00 /lib/netplan/netplan-dbus > 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? > 00:00:00 /usr/libexec/packagekitd > 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? > 00:00:00 /lib/systemd/systemd --user > 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? > 00:00:00 (sd-pam) > """ > > While 'ps axf' produces (trimmed): > > """ > 2042 ? Ss 0:00 /bin/sh > /snap/subiquity/5004/usr/bin/subiquity-server > 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m > subiquity.cmd.server > 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh > 27501 ? S 0:00 \_ /bin/bash > /custom-installation/post.sh > 27588 ? S 3:21 \_ /usr/bin/python3 > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w > 204 > 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 > /snap/subiquity/5004/usr/bin/subiquity > """ > > Doing a "kill -9 27588" (on apiclient) causes the installation to > 'finish'. After the reboot, and after "firshboot.sh" does its thing, we > have the following from 'ps axf': > > """ > 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules > --mode=final > 1376 ? S 0:00 \_ /bin/sh -c tee -a > /var/log/cloud-init-output.log > 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log > 1378 ? S 0:00 \_ /bin/sh > /var/lib/cloud/instance/scripts/runcmd > 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh > 1429 ? S 0:01 \_ /usr/bin/python3 > /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w > 204 > """ > > This causes the "/var/log/httpd/ssl_access_log" to start filling up. A > subsequent reboot, where "firstboot.sh" is not run, has the the system > coming up without "apiclient" running, and so there's no longer 'spam' in > "ssl_access_log". > > Running "apiclient" manually from the CLI with the exact options causes a > bunch of stuff in "ssl_access_log": > > """ > fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > """ > > at the same time as the above is being generated, there is nothing in > "/var/log/confluent/trace" or "stderr�. > > > On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: >> Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also >> would be tempted to see if 'confluent_selfcheck' has any suggestions. >> You >> can also ssh into the node during that phase to confirm what it is doing >> while it is seemingly hung, e.g. looking at ps axf >> ________________________________ >> From: David Magda <dma...@ee...> >> Sent: Wednesday, January 24, 2024 9:37 PM >> To: xCA...@li... <xCA...@li...> >> Subject: [External] [xcat-user] Ansible and Confluent >> >> Hello, >> >> I'm trying to get Ansible working with Confluent 3.8.0. (Using an older >> version due to legacy OS reasons.) >> >> In /var/lib/confluent/public/os/ I created a new profile called >> ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took >> the >> provided "autoinstall/user-data" file, added some partition stanzas, >> some >> packages, etc. >> >> Once I sorted out a 'basic' automated Ubuntu install I tried creating a >> "ansible/post.d/01-packages.yaml" file with-in the profile directory >> with >> the following contents: >> >> """ >> - name: install chrony >> apt: >> pkg: >> - chrony >> """ >> >> The Ubuntu (subiquity) installer seems to 'hang' at: >> >> """ >> start: subiquity/Late/run/command_1: /custom-installation/post.sh >> """ >> >> which probably corresponds to this part of the "user-data" file: >> >> """ >> late-commands: >> - chroot /target apt-get -y -q purge snapd modemmanager >> - /custom-installation/post.sh >> """ >> >> When the 'hang' occurs the following starts filling up the >> "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: >> >> """ >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET >> /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - >> """ >> >> When I force a restart of the system/VM, it can boot off the disk, and >> goes through the regular start-up process, including a bunch of >> cloud-init >> stuff. Though after it runs "/etc/confluent/firstboot.sh", the >> "ssl_access_log" file once again starts filling with the >> "remoteconfig/status" stuff per above. >> >> Renaming "ansible/" to "ansible_off/" seems to make the problem go away. >> Similar behaviour with Ubuntu 20.04. >> >> I'm wondering what's going with the 'hang' when "post.sh" is executed, >> and >> the flooding after "firstboot.sh". >> >> Regards, >> David > > |
From: Ryan N. <nov...@ru...> - 2024-01-26 04:53:24
|
One thing to mention, since I’m in the exact same sinking boat (not doing deployments though): confluent_selfcheck doesn’t work that reliably on RHEL7. -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - nov...@ru... || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Jan 25, 2024, at 16:25, David Magda <dma...@ee...> wrote: First suggested command: """ # confluent_selfcheck OS Deployment: Initialized Confluent UUID: Consistent Web Server: Running Web Certificate: Traceback (most recent call last): File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> cert = certificates_missing_ips(conn) File "/opt/confluent/bin/confluent_selfcheck", line 57, in certificates_missing_ips ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' """ On the being-installed system, ignoring the typical Linux stuff, the output of 'ps -elfH' has: """ 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? 00:00:00 /usr/libexec/polkitd --no-debug 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? 00:00:00 /usr/sbin/rsyslogd -n -iNONE 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? 00:00:03 /usr/lib/snapd/snapd 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? 00:00:00 /lib/systemd/systemd-logind 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? 00:00:02 /usr/libexec/udisks2/udisksd 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? 00:00:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? 00:00:00 /usr/sbin/ModemManager 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m subiquity.cmd.server 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? 00:00:00 sh -c /custom-installation/post.sh 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? 00:00:00 /bin/bash /custom-installation/post.sh 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 /snap/subiquity/5004/usr/bin/subiquity 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? 00:00:00 sshd: root@pts/0 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? 00:00:00 -bash 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? 00:00:00 ps -elfH 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? 00:00:00 /lib/netplan/netplan-dbus 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? 00:00:00 /usr/libexec/packagekitd 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? 00:00:00 /lib/systemd/systemd --user 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? 00:00:00 (sd-pam) """ While 'ps axf' produces (trimmed): """ 2042 ? Ss 0:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m subiquity.cmd.server 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh 27501 ? S 0:00 \_ /bin/bash /custom-installation/post.sh 27588 ? S 3:21 \_ /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 /snap/subiquity/5004/usr/bin/subiquity """ Doing a "kill -9 27588" (on apiclient) causes the installation to 'finish'. After the reboot, and after "firshboot.sh" does its thing, we have the following from 'ps axf': """ 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules --mode=final 1376 ? S 0:00 \_ /bin/sh -c tee -a /var/log/cloud-init-output.log 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log 1378 ? S 0:00 \_ /bin/sh /var/lib/cloud/instance/scripts/runcmd 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh 1429 ? S 0:01 \_ /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 """ This causes the "/var/log/httpd/ssl_access_log" to start filling up. A subsequent reboot, where "firstboot.sh" is not run, has the the system coming up without "apiclient" running, and so there's no longer 'spam' in "ssl_access_log". Running "apiclient" manually from the CLI with the exact options causes a bunch of stuff in "ssl_access_log": """ fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - """ at the same time as the above is being generated, there is nothing in "/var/log/confluent/trace" or "stderr”. On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also would be tempted to see if 'confluent_selfcheck' has any suggestions. You can also ssh into the node during that phase to confirm what it is doing while it is seemingly hung, e.g. looking at ps axf ________________________________ From: David Magda <dma...@ee...> Sent: Wednesday, January 24, 2024 9:37 PM To: xCA...@li... <xCA...@li...> Subject: [External] [xcat-user] Ansible and Confluent Hello, I'm trying to get Ansible working with Confluent 3.8.0. (Using an older version due to legacy OS reasons.) In /var/lib/confluent/public/os/ I created a new profile called ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took the provided "autoinstall/user-data" file, added some partition stanzas, some packages, etc. Once I sorted out a 'basic' automated Ubuntu install I tried creating a "ansible/post.d/01-packages.yaml" file with-in the profile directory with the following contents: """ - name: install chrony apt: pkg: - chrony """ The Ubuntu (subiquity) installer seems to 'hang' at: """ start: subiquity/Late/run/command_1: /custom-installation/post.sh """ which probably corresponds to this part of the "user-data" file: """ late-commands: - chroot /target apt-get -y -q purge snapd modemmanager - /custom-installation/post.sh """ When the 'hang' occurs the following starts filling up the "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: """ fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - """ When I force a restart of the system/VM, it can boot off the disk, and goes through the regular start-up process, including a bunch of cloud-init stuff. Though after it runs "/etc/confluent/firstboot.sh", the "ssl_access_log" file once again starts filling with the "remoteconfig/status" stuff per above. Renaming "ansible/" to "ansible_off/" seems to make the problem go away. Similar behaviour with Ubuntu 20.04. I'm wondering what's going with the 'hang' when "post.sh" is executed, and the flooding after "firstboot.sh". Regards, David _______________________________________________ xCAT-user mailing list xCA...@li... https://lists.sourceforge.net/lists/listinfo/xcat-user |
From: Jarrod J. <jjo...@le...> - 2024-01-25 21:34:28
|
What is the OS of the deployment server? kill -USR1 $(cat /var/run/confluent/pid) This should produce a /var/log/confluennt/hangtraces Would be interesting to see if there's ansible related stacks in hangtraces that seem stuck... ________________________________ From: David Magda <dma...@ee...> Sent: Thursday, January 25, 2024 4:25 PM To: xCAT Users Mailing list <xca...@li...> Subject: Re: [xcat-user] [External] Ansible and Confluent First suggested command: """ # confluent_selfcheck OS Deployment: Initialized Confluent UUID: Consistent Web Server: Running Web Certificate: Traceback (most recent call last): File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> cert = certificates_missing_ips(conn) File "/opt/confluent/bin/confluent_selfcheck", line 57, in certificates_missing_ips ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' """ On the being-installed system, ignoring the typical Linux stuff, the output of 'ps -elfH' has: """ 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? 00:00:00 /usr/libexec/polkitd --no-debug 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? 00:00:00 /usr/sbin/rsyslogd -n -iNONE 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? 00:00:03 /usr/lib/snapd/snapd 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? 00:00:00 /lib/systemd/systemd-logind 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? 00:00:02 /usr/libexec/udisks2/udisksd 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? 00:00:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? 00:00:00 /usr/sbin/ModemManager 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m subiquity.cmd.server 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? 00:00:00 sh -c /custom-installation/post.sh 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? 00:00:00 /bin/bash /custom-installation/post.sh 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 /snap/subiquity/5004/usr/bin/subiquity 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? 00:00:00 sshd: root@pts/0 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? 00:00:00 -bash 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? 00:00:00 ps -elfH 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? 00:00:00 /lib/netplan/netplan-dbus 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? 00:00:00 /usr/libexec/packagekitd 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? 00:00:00 /lib/systemd/systemd --user 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? 00:00:00 (sd-pam) """ While 'ps axf' produces (trimmed): """ 2042 ? Ss 0:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m subiquity.cmd.server 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh 27501 ? S 0:00 \_ /bin/bash /custom-installation/post.sh 27588 ? S 3:21 \_ /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 /snap/subiquity/5004/usr/bin/subiquity """ Doing a "kill -9 27588" (on apiclient) causes the installation to 'finish'. After the reboot, and after "firshboot.sh" does its thing, we have the following from 'ps axf': """ 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules --mode=final 1376 ? S 0:00 \_ /bin/sh -c tee -a /var/log/cloud-init-output.log 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log 1378 ? S 0:00 \_ /bin/sh /var/lib/cloud/instance/scripts/runcmd 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh 1429 ? S 0:01 \_ /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 """ This causes the "/var/log/httpd/ssl_access_log" to start filling up. A subsequent reboot, where "firstboot.sh" is not run, has the the system coming up without "apiclient" running, and so there's no longer 'spam' in "ssl_access_log". Running "apiclient" manually from the CLI with the exact options causes a bunch of stuff in "ssl_access_log": """ fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - """ at the same time as the above is being generated, there is nothing in "/var/log/confluent/trace" or "stderr”. On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: > Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also > would be tempted to see if 'confluent_selfcheck' has any suggestions. You > can also ssh into the node during that phase to confirm what it is doing > while it is seemingly hung, e.g. looking at ps axf > ________________________________ > From: David Magda <dma...@ee...> > Sent: Wednesday, January 24, 2024 9:37 PM > To: xCA...@li... <xCA...@li...> > Subject: [External] [xcat-user] Ansible and Confluent > > Hello, > > I'm trying to get Ansible working with Confluent 3.8.0. (Using an older > version due to legacy OS reasons.) > > In /var/lib/confluent/public/os/ I created a new profile called > ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took the > provided "autoinstall/user-data" file, added some partition stanzas, some > packages, etc. > > Once I sorted out a 'basic' automated Ubuntu install I tried creating a > "ansible/post.d/01-packages.yaml" file with-in the profile directory with > the following contents: > > """ > - name: install chrony > apt: > pkg: > - chrony > """ > > The Ubuntu (subiquity) installer seems to 'hang' at: > > """ > start: subiquity/Late/run/command_1: /custom-installation/post.sh > """ > > which probably corresponds to this part of the "user-data" file: > > """ > late-commands: > - chroot /target apt-get -y -q purge snapd modemmanager > - /custom-installation/post.sh > """ > > When the 'hang' occurs the following starts filling up the > "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: > > """ > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > """ > > When I force a restart of the system/VM, it can boot off the disk, and > goes through the regular start-up process, including a bunch of cloud-init > stuff. Though after it runs "/etc/confluent/firstboot.sh", the > "ssl_access_log" file once again starts filling with the > "remoteconfig/status" stuff per above. > > Renaming "ansible/" to "ansible_off/" seems to make the problem go away. > Similar behaviour with Ubuntu 20.04. > > I'm wondering what's going with the 'hang' when "post.sh" is executed, and > the flooding after "firstboot.sh". > > Regards, > David _______________________________________________ xCAT-user mailing list xCA...@li... https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C19f3a540a0bc4a2ca42c08dc1dec6e5e%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638418148525412338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=45rqrSCFhmih33jrSi9cDz4vjZmDJq7fWnbRNEKV3b4%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> |
From: David M. <dma...@ee...> - 2024-01-25 21:26:10
|
First suggested command: """ # confluent_selfcheck OS Deployment: Initialized Confluent UUID: Consistent Web Server: Running Web Certificate: Traceback (most recent call last): File "/opt/confluent/bin/confluent_selfcheck", line 178, in <module> cert = certificates_missing_ips(conn) File "/opt/confluent/bin/confluent_selfcheck", line 57, in certificates_missing_ips ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) AttributeError: 'module' object has no attribute 'PROTOCOL_TLS_CLIENT' """ On the being-installed system, ignoring the typical Linux stuff, the output of 'ps -elfH' has: """ 4 S root 1247 1 0 80 0 - 7499 do_pol 17:53 ? 00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers 4 S root 1248 1 0 80 0 - 58623 do_pol 17:53 ? 00:00:00 /usr/libexec/polkitd --no-debug 4 S syslog 1250 1 0 80 0 - 55600 do_sel 17:53 ? 00:00:00 /usr/sbin/rsyslogd -n -iNONE 4 S root 1252 1 0 80 0 - 385081 futex_ 17:53 ? 00:00:03 /usr/lib/snapd/snapd 4 S root 1253 1 0 80 0 - 3831 ep_pol 17:53 ? 00:00:00 /lib/systemd/systemd-logind 4 S root 1255 1 0 80 0 - 98198 do_pol 17:53 ? 00:00:02 /usr/libexec/udisks2/udisksd 4 S root 1283 1 0 80 0 - 26778 do_pol 17:53 ? 00:00:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal 4 S root 1291 1 0 80 0 - 61055 do_pol 17:53 ? 00:00:00 /usr/sbin/ModemManager 4 S root 2042 1 0 80 0 - 722 do_wai 17:53 ? 00:00:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server 4 S root 2086 2042 0 80 0 - 149574 ep_pol 17:53 ? 00:00:07 /snap/subiquity/5004/usr/bin/python3.10 -m subiquity.cmd.server 4 S root 27499 2086 0 80 0 - 722 do_wai 18:09 ? 00:00:00 sh -c /custom-installation/post.sh 4 S root 27501 27499 0 80 0 - 1150 do_wai 18:09 ? 00:00:00 /bin/bash /custom-installation/post.sh 4 S root 27588 27501 4 80 0 - 7403 do_pol 18:09 ? 00:03:16 /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 4 S root 2049 1 0 80 0 - 24167 ep_pol 17:53 tty1 00:00:05 /snap/subiquity/5004/usr/bin/python3.10 /snap/subiquity/5004/usr/bin/subiquity 4 S root 2137 1 0 80 0 - 3855 do_pol 17:53 ? 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups 4 S root 37842 2137 0 80 0 - 4310 - 19:15 ? 00:00:00 sshd: root@pts/0 4 S root 37952 37842 0 80 0 - 1543 do_wai 19:15 ? 00:00:00 -bash 4 R root 38032 37952 0 80 0 - 1911 - 19:16 ? 00:00:00 ps -elfH 4 S root 2206 1 0 80 0 - 3266 ep_pol 17:53 ? 00:00:00 /lib/netplan/netplan-dbus 4 S root 2570 1 0 80 0 - 73244 do_pol 17:53 ? 00:00:00 /usr/libexec/packagekitd 4 S root 37848 1 1 80 0 - 4301 ep_pol 19:15 ? 00:00:00 /lib/systemd/systemd --user 5 S root 37850 37848 0 80 0 - 26271 do_sig 19:15 ? 00:00:00 (sd-pam) """ While 'ps axf' produces (trimmed): """ 2042 ? Ss 0:00 /bin/sh /snap/subiquity/5004/usr/bin/subiquity-server 2086 ? Sl 0:07 \_ /snap/subiquity/5004/usr/bin/python3.10 -m subiquity.cmd.server 27499 ? S 0:00 \_ sh -c /custom-installation/post.sh 27501 ? S 0:00 \_ /bin/bash /custom-installation/post.sh 27588 ? S 3:21 \_ /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 2049 tty1 Ss+ 0:05 /snap/subiquity/5004/usr/bin/python3.10 /snap/subiquity/5004/usr/bin/subiquity """ Doing a "kill -9 27588" (on apiclient) causes the installation to 'finish'. After the reboot, and after "firshboot.sh" does its thing, we have the following from 'ps axf': """ 1372 ? Ss 0:00 /usr/bin/python3 /usr/bin/cloud-init modules --mode=final 1376 ? S 0:00 \_ /bin/sh -c tee -a /var/log/cloud-init-output.log 1377 ? S 0:00 | \_ tee -a /var/log/cloud-init-output.log 1378 ? S 0:00 \_ /bin/sh /var/lib/cloud/instance/scripts/runcmd 1379 ? S 0:00 \_ /bin/bash /etc/confluent/firstboot.sh 1429 ? S 0:01 \_ /usr/bin/python3 /opt/confluent/bin/apiclient /confluent-api/self/remoteconfig/status -w 204 """ This causes the "/var/log/httpd/ssl_access_log" to start filling up. A subsequent reboot, where "firstboot.sh" is not run, has the the system coming up without "apiclient" running, and so there's no longer 'spam' in "ssl_access_log". Running "apiclient" manually from the CLI with the exact options causes a bunch of stuff in "ssl_access_log": """ fe80::[EUI-64] - - [25/Jan/2024:14:52:15 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - """ at the same time as the above is being generated, there is nothing in "/var/log/confluent/trace" or "stderr”. On Thu, January 25, 2024 07:52, Jarrod Johnson wrote: > Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also > would be tempted to see if 'confluent_selfcheck' has any suggestions. You > can also ssh into the node during that phase to confirm what it is doing > while it is seemingly hung, e.g. looking at ps axf > ________________________________ > From: David Magda <dma...@ee...> > Sent: Wednesday, January 24, 2024 9:37 PM > To: xCA...@li... <xCA...@li...> > Subject: [External] [xcat-user] Ansible and Confluent > > Hello, > > I'm trying to get Ansible working with Confluent 3.8.0. (Using an older > version due to legacy OS reasons.) > > In /var/lib/confluent/public/os/ I created a new profile called > ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took the > provided "autoinstall/user-data" file, added some partition stanzas, some > packages, etc. > > Once I sorted out a 'basic' automated Ubuntu install I tried creating a > "ansible/post.d/01-packages.yaml" file with-in the profile directory with > the following contents: > > """ > - name: install chrony > apt: > pkg: > - chrony > """ > > The Ubuntu (subiquity) installer seems to 'hang' at: > > """ > start: subiquity/Late/run/command_1: /custom-installation/post.sh > """ > > which probably corresponds to this part of the "user-data" file: > > """ > late-commands: > - chroot /target apt-get -y -q purge snapd modemmanager > - /custom-installation/post.sh > """ > > When the 'hang' occurs the following starts filling up the > "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: > > """ > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET > /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - > """ > > When I force a restart of the system/VM, it can boot off the disk, and > goes through the regular start-up process, including a bunch of cloud-init > stuff. Though after it runs "/etc/confluent/firstboot.sh", the > "ssl_access_log" file once again starts filling with the > "remoteconfig/status" stuff per above. > > Renaming "ansible/" to "ansible_off/" seems to make the problem go away. > Similar behaviour with Ubuntu 20.04. > > I'm wondering what's going with the 'hang' when "post.sh" is executed, and > the flooding after "firstboot.sh". > > Regards, > David |
From: Jarrod J. <jjo...@le...> - 2024-01-25 12:59:41
|
Unfortunately, the documentation didn't consider Ubuntu. So far we have a bit of a limitation in Ubuntu flow where the change you indicated is required, since the Ubuntu scripted installation initramfs can't make HTTPS API calls through scripting, we fed the image name so it knows where to find install.iso. I have considered the potential of ultimately superseding a significant portion of 'addons' material to a Rust executable to avoid varying dependencies of whether an environment provides curl, wget, python, a particular variant of libcrypt, etc, which would fix this gap, but it hasn't been a strong enough case to make it a priority. So we should amend to note that Ubunntu currently requires profile.yaml be tweaked and thes updateboot. ________________________________ From: David Magda <dma...@ee...> Sent: Wednesday, January 24, 2024 9:57 PM To: xCA...@li... <xCA...@li...> Subject: [External] [xcat-user] profiles with Confluent Hello, The Confluent documentation for OS deployment: https://hpc.lenovo.com/users/documentation/confluentosdeploy.html lists only five commands that need to be run to create a new profile: # cd /var/lib/confluent/public/os/ # cp -a rhel-8.2-x86_64-default rhel-8.2-x86_64-custom # cd /var/lib/confluent/private/os/ # cp -a rhel-8.2-x86_64-default rhel-8.2-x86_64-custom # osdeploy updateboot rhel-8.2-x86_64-custom However I found I had to edit a number of files to take the initial ubuntu-22.04.3-x86_64-default/ profile and create my own ubuntu-22.04.3-x86_64-test1/, per the following diffs: """ --- ./boot/efi/boot/grub.cfg_dist 2024-01-19 14:56:22.737237565 -0500 +++ ./boot/efi/boot/grub.cfg 2024-01-19 15:17:25.890594011 -0500 @@ -1,5 +1,5 @@ set timeout=5 -menuentry 'Ubuntu 22.04.3 x86_64 (Default Profile)' { - linuxefi /kernel quiet osprofile=ubuntu-22.04.3-x86_64-default +menuentry 'Ubuntu 22.04.3 x86_64 (Test1 Profile)' { + linuxefi /kernel quiet osprofile=ubuntu-22.04.3-x86_64-test1 """ """ --- ./boot.ipxe_dist 2023-10-31 14:37:15.907232034 -0400 +++ ./boot.ipxe 2024-01-19 15:17:25.891594034 -0500 @@ -1,5 +1,5 @@ #!ipxe -imgfetch boot/kernel quiet osprofile=ubuntu-22.04.3-x86_64-default initrd=addons.cpio initrd=site.cpio initrd=distribution +imgfetch boot/kernel quiet osprofile=ubuntu-22.04.3-x86_64-test1 initrd=addons.cpio initrd=site.cpio initrd=distribution """ """ --- ./profile.yaml_dist 2023-10-31 14:37:15.669226704 -0400 +++ ./profile.yaml 2024-01-19 14:57:05.614204808 -0500 @@ -1,3 +1,3 @@ -label: Ubuntu 22.04.3 x86_64 (Default Profile) -kernelargs: quiet osprofile=ubuntu-22.04.3-x86_64-default +label: Ubuntu 22.04.3 x86_64 (Test1 Profile) +kernelargs: quiet osprofile=ubuntu-22.04.3-x86_64-test1 """ IIRC, I run the "osdeploy updateboot …" command (with my profile name). Did I miss something? Should the above files been automatically changed in some fashion? Thanks for any info. (This is Confluent 3.8.0.) Regards, David _______________________________________________ xCAT-user mailing list xCA...@li... https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C4a1d1997e7854ce9402208dc1d5173dd%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638417482884036609%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=4hj43XSYyMBIYqZSbVwocKBidFlbycKgQ37YUwdUprA%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> |
From: Jarrod J. <jjo...@le...> - 2024-01-25 12:52:51
|
Anything in /var/log/confluent/stderr or /var/log/confluent/trace? Also would be tempted to see if 'confluent_selfcheck' has any suggestions. You can also ssh into the node during that phase to confirm what it is doing while it is seemingly hung, e.g. looking at ps axf ________________________________ From: David Magda <dma...@ee...> Sent: Wednesday, January 24, 2024 9:37 PM To: xCA...@li... <xCA...@li...> Subject: [External] [xcat-user] Ansible and Confluent Hello, I'm trying to get Ansible working with Confluent 3.8.0. (Using an older version due to legacy OS reasons.) In /var/lib/confluent/public/os/ I created a new profile called ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took the provided "autoinstall/user-data" file, added some partition stanzas, some packages, etc. Once I sorted out a 'basic' automated Ubuntu install I tried creating a "ansible/post.d/01-packages.yaml" file with-in the profile directory with the following contents: """ - name: install chrony apt: pkg: - chrony """ The Ubuntu (subiquity) installer seems to 'hang' at: """ start: subiquity/Late/run/command_1: /custom-installation/post.sh """ which probably corresponds to this part of the "user-data" file: """ late-commands: - chroot /target apt-get -y -q purge snapd modemmanager - /custom-installation/post.sh """ When the 'hang' occurs the following starts filling up the "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: """ fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - """ When I force a restart of the system/VM, it can boot off the disk, and goes through the regular start-up process, including a bunch of cloud-init stuff. Though after it runs "/etc/confluent/firstboot.sh", the "ssl_access_log" file once again starts filling with the "remoteconfig/status" stuff per above. Renaming "ansible/" to "ansible_off/" seems to make the problem go away. Similar behaviour with Ubuntu 20.04. I'm wondering what's going with the 'hang' when "post.sh" is executed, and the flooding after "firstboot.sh". Regards, David _______________________________________________ xCAT-user mailing list xCA...@li... https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C1a071e27a40c447e020208dc1d50acd8%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638417479688016346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C60000%7C%7C%7C&sdata=rjezz0DVeivcDm%2FQyUPGNj1CPft3hI381qfEn%2BKPHkA%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> |
From: David M. <dma...@ee...> - 2024-01-25 02:57:25
|
Hello, The Confluent documentation for OS deployment: https://hpc.lenovo.com/users/documentation/confluentosdeploy.html lists only five commands that need to be run to create a new profile: # cd /var/lib/confluent/public/os/ # cp -a rhel-8.2-x86_64-default rhel-8.2-x86_64-custom # cd /var/lib/confluent/private/os/ # cp -a rhel-8.2-x86_64-default rhel-8.2-x86_64-custom # osdeploy updateboot rhel-8.2-x86_64-custom However I found I had to edit a number of files to take the initial ubuntu-22.04.3-x86_64-default/ profile and create my own ubuntu-22.04.3-x86_64-test1/, per the following diffs: """ --- ./boot/efi/boot/grub.cfg_dist 2024-01-19 14:56:22.737237565 -0500 +++ ./boot/efi/boot/grub.cfg 2024-01-19 15:17:25.890594011 -0500 @@ -1,5 +1,5 @@ set timeout=5 -menuentry 'Ubuntu 22.04.3 x86_64 (Default Profile)' { - linuxefi /kernel quiet osprofile=ubuntu-22.04.3-x86_64-default +menuentry 'Ubuntu 22.04.3 x86_64 (Test1 Profile)' { + linuxefi /kernel quiet osprofile=ubuntu-22.04.3-x86_64-test1 """ """ --- ./boot.ipxe_dist 2023-10-31 14:37:15.907232034 -0400 +++ ./boot.ipxe 2024-01-19 15:17:25.891594034 -0500 @@ -1,5 +1,5 @@ #!ipxe -imgfetch boot/kernel quiet osprofile=ubuntu-22.04.3-x86_64-default initrd=addons.cpio initrd=site.cpio initrd=distribution +imgfetch boot/kernel quiet osprofile=ubuntu-22.04.3-x86_64-test1 initrd=addons.cpio initrd=site.cpio initrd=distribution """ """ --- ./profile.yaml_dist 2023-10-31 14:37:15.669226704 -0400 +++ ./profile.yaml 2024-01-19 14:57:05.614204808 -0500 @@ -1,3 +1,3 @@ -label: Ubuntu 22.04.3 x86_64 (Default Profile) -kernelargs: quiet osprofile=ubuntu-22.04.3-x86_64-default +label: Ubuntu 22.04.3 x86_64 (Test1 Profile) +kernelargs: quiet osprofile=ubuntu-22.04.3-x86_64-test1 """ IIRC, I run the "osdeploy updateboot …" command (with my profile name). Did I miss something? Should the above files been automatically changed in some fashion? Thanks for any info. (This is Confluent 3.8.0.) Regards, David |
From: David M. <dma...@ee...> - 2024-01-25 02:51:16
|
Hello, I'm trying to get Ansible working with Confluent 3.8.0. (Using an older version due to legacy OS reasons.) In /var/lib/confluent/public/os/ I created a new profile called ubuntu-22.04.3-x86_64-test1/, and this seems to work just fine: I took the provided "autoinstall/user-data" file, added some partition stanzas, some packages, etc. Once I sorted out a 'basic' automated Ubuntu install I tried creating a "ansible/post.d/01-packages.yaml" file with-in the profile directory with the following contents: """ - name: install chrony apt: pkg: - chrony """ The Ubuntu (subiquity) installer seems to 'hang' at: """ start: subiquity/Late/run/command_1: /custom-installation/post.sh """ which probably corresponds to this part of the "user-data" file: """ late-commands: - chroot /target apt-get -y -q purge snapd modemmanager - /custom-installation/post.sh """ When the 'hang' occurs the following starts filling up the "/var/log/httpd/ssl_access_log" file of the Confluent/xcat server: """ fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - fe80::[EUI-64] - - [24/Jan/2024:11:15:08 -0500] "GET /confluent-api/self/remoteconfig/status HTTP/1.1" 200 - """ When I force a restart of the system/VM, it can boot off the disk, and goes through the regular start-up process, including a bunch of cloud-init stuff. Though after it runs "/etc/confluent/firstboot.sh", the "ssl_access_log" file once again starts filling with the "remoteconfig/status" stuff per above. Renaming "ansible/" to "ansible_off/" seems to make the problem go away. Similar behaviour with Ubuntu 20.04. I'm wondering what's going with the 'hang' when "post.sh" is executed, and the flooding after "firstboot.sh". Regards, David |
From: VICTOR HU <vh...@us...> - 2024-01-24 16:44:16
|
[celebrate] VICTOR HU reacted to your message: ________________________________ From: Nathan A Besaw via xCAT-user <xca...@li...> Sent: Wednesday, January 24, 2024 2:00:16 PM To: xCAT Users Mailing list <xca...@li...> Cc: Nathan A Besaw <be...@us...> Subject: [EXTERNAL] [xcat-user] Announcement: new addition to the project maintainer team All, I would like to officially welcome Markus Hilger (github id: Obihörnchen) to the xCAT team of maintainers. Markus is a long time xCAT user and contributor with many years of experience using xCAT in HPC environments. Markus is also a member ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. <https://us-phishalarm-ewt.proofpoint.com/EWT/v1/PjiDSg!1e-ubl7zRvm6FYv7eKBFzrvUhYBSiT2MOe1y8Qfsihh2OQfV7fri7ZIw-mAOh1Qkub-jEeXRMmq_KiYuSrMnnBTCGpYhRLKRX4hPlEzIVMFEyXcBFmQitJ3vx99aWZD_N2zrvQ8$> Report Suspicious ZjQcmQRYFpfptBannerEnd All, I would like to officially welcome Markus Hilger (github id: Obihörnchen) to the xCAT team of maintainers. Markus is a long time xCAT user and contributor with many years of experience using xCAT in HPC environments. Markus is also a member of the xCAT consortium(Redline Performance Solutions/MEGWARE/OCF) from MEGWARE. In his new role as a project maintainer, he is currently working on installing new test infrastructure and will be contributing to the planning for the next release. Please join me in welcoming Markus to the xCAT maintainer team! |
From: Nathan A B. <be...@us...> - 2024-01-24 14:37:05
|
All, I would like to officially welcome Markus Hilger (github id: Obihörnchen) to the xCAT team of maintainers. Markus is a long time xCAT user and contributor with many years of experience using xCAT in HPC environments. Markus is also a member of the xCAT consortium(Redline Performance Solutions/MEGWARE/OCF) from MEGWARE. In his new role as a project maintainer, he is currently working on installing new test infrastructure and will be contributing to the planning for the next release. Please join me in welcoming Markus to the xCAT maintainer team! |
From: Vinícius F. <fe...@ve...> - 2024-01-11 20:23:15
|
I think I will throw in the towel. My understanding was exactly what you’ve said Jarrod. IPMI over LAN should work. From the docs it states that it has an IPMI 2.0 interface, but it does not work. I’ve managed to upgrade the BMC Firmware to the latest one (1.44) using a DOS disk image that I’ve uploaded to the RSA-II and controlled remotely from an old Windows XP VM with IE6 and Java 1.6. It was a blast. But it didn’t worked either. I’ve found the specsheet and it confirms that it should have and should work: <https://www.salland.eu/pdf/Server/IBM_x3550.pdf> [preview.png] IBM_x3550<https://www.salland.eu/pdf/Server/IBM_x3550.pdf> PDF Document · 750 KB<https://www.salland.eu/pdf/Server/IBM_x3550.pdf> I may be missing something that I don’t know/don’t understand. Not sure if a FOD (Feature on Demand) is also required or not. I can use ipmitool inband, but not outband, the IP address does not answer. Tried to change the address, configure both cards on the switch to see if at least the MAC Address of the BMC shows up, but nothing. Nothing shows up. I’ve came across some information about things like OSA SMBridge but it didn’t make sense because I have to run those on the runnning OS, which defeats the purpose of an BMC interface for Out of Band management and also those software are only for RHEL 2/3/4/5. Also there is something about: "IPMI driver and IBM mapping layer installation”, that I could not figure out. If there’s anything still in your hard drive on your head please let me know. I don’t think the machine is broken or defective because I have three of them, and all of them are with the same issues. Some outputs from the frustration: [root@x3550-1 ~]# dmesg | grep BMC [ 9.414441] ipmi_si ipmi_si.0: Found new BMC (man_id: 0x000002, prod_id: 0x0012, dev_id: 0x20) [ 9.680918] ipmi_si ipmi_si.0: Found BMC with sensor interface v3.10 2006-06-29 on interface 0 -x-x-x- [root@x3550-1 ~]# ipmitool lan print Set in Progress : Set Complete Auth Type Support : NONE MD2 MD5 PASSWORD Auth Type Enable : Callback : : User : MD2 MD5 PASSWORD : Operator : MD2 MD5 PASSWORD : Admin : MD2 MD5 PASSWORD : OEM : IP Address Source : BIOS Assigned Address IP Address : 172.25.0.99 Subnet Mask : 255.255.255.0 MAC Address : 00:21:5e:0c:01:7d SNMP Community String : public IP Header : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10 BMC ARP Control : ARP Responses Enabled, Gratuitous ARP Disabled Gratituous ARP Intrvl : 2.0 seconds Default Gateway IP : 172.25.0.254 Default Gateway MAC : 00:00:00:00:00:00 Backup Gateway IP : 0.0.0.0 Backup Gateway MAC : 00:00:00:00:00:00 802.1q VLAN ID : 1 802.1q VLAN Priority : 0 RMCP+ Cipher Suites : 0,1,2,3 Cipher Suite Priv Max : uaaaXXXXXXXXXXX : X=Cipher Suite Unused : c=CALLBACK : u=USER : o=OPERATOR : a=ADMIN : O=OEM Bad Password Threshold : Not Available -x-x-x- [root@x3550-1 ~]# /opt/ibm/toolscenter/asu/asu64 rebootbmc IBM Advanced Settings Utility version 9.30.79N Licensed Materials - Property of IBM (C) Copyright IBM Corp. 2007-2012 All Rights Reserved Error communicating with BMC. If the system contains a BMC then check your IPMI driver and IBM mapping layer installation. If the system does not contain a BMC then remove the BMC patch from ASU by issuing the ASU patchremove command with the correct patch #. Error communicating with RSA. If the system contains an RSA then check your RSA Daemon installation. If the system does not contain a RSA then remove the RSA patch from ASU by issuing the ASU patchremove command with the correct patch #. Could not find IPMI driver. Please check your IPMI driver and IBM mapping layer installation. -x-x-x- [root@x3550-1 ~]# dmesg | grep RSA [ 2.459038] usb 4-1: Product: IBM RSA2 [ 2.471591] input: IBM IBM RSA2 as /devices/pci0000:00/0000:00:1d.2/usb4/4-1/4-1:1.0/input/input2 [ 2.522290] hid-generic 0003:04B3:4001.0002: input,hidraw1: USB HID v1.10 Keyboard [IBM IBM RSA2] on usb-0000:00:1d.2-1/input0 [ 2.529608] input: IBM IBM RSA2 as /devices/pci0000:00/0000:00:1d.2/usb4/4-1/4-1:1.1/input/input3 [ 2.529779] hid-generic 0003:04B3:4001.0003: input,hidraw2: USB HID v1.10 Mouse [IBM IBM RSA2] on usb-0000:00:1d.2-1/input1 -x-x-x- [root@cloyster ~]# ipmitool -I lanplus -H 172.25.0.99 -U USERID -P PASSW0RD lan print Error: Unable to establish IPMI v2 / RMCP+ session [root@cloyster ~]# ipmitool -I lan -H 172.25.0.99 -U USERID -P PASSW0RD lan print Error: Unable to establish LAN session Error: Unable to establish IPMI v1.5 / RMCP session Thanks all. On 10 Jan 2024, at 12:35, Jarrod Johnson <jjo...@le...> wrote: So the mini-RSA card added remote video, ssh and web (and some things for IBM director at the time). The original x3550 should have provided IPMI and SOL out of the box (although the vintage is such that I think you need IPMI 1.5, which I haven't tested in a long time). Very vague in my memory, but I was arrund for those days. Fun fact, that architecture is why to this day we have an oddity in our firmware, that IPMI connects to ttyS0 and SSH connects to ttyS1, it was for backwards compatiblity to this time when the mini-RSA brought it's own serial uart and thus IPMI only worked to the builtin uart and ssh only worked to the mini-RSA's uart. ________________________________ From: Vinícius Ferrão via xCAT-user <xca...@li...> Sent: Tuesday, January 9, 2024 6:41 PM To: xca...@li... <xca...@li...> Cc: Vinícius Ferrão <fe...@ve...> Subject: [External] [xcat-user] Support for IBM Remote Supervisor Supervisor II (RSA-II) Hello, This thread may be offtopic on this list but I don’t have any other places to go with people may understand the question. I’ve bought this card thinking that it would provide IPMI for being controlled by Confluent (and xCAT maybe...) but I think I misunderstood what the device provides. Anyone knows if this card is supported? Does it provide IPMI over LAN? Long story: There’s an old IBM System x3550 (the first one) that I use to test things, and I was trying to add it as a compute node of Confluent but although it has an OOB Ethernet Interface named as management it didn’t even linked when a network cable was plugged. So after spending countless hours trying to figure it out I’ve discovered that I should have an additional IBM RSA-2 Slimline Card on the system for this management port work. I think I incorrectly assumed that this card would provide a classic IPMI over LAN interface since the server already has BMC configuration on the BIOS that I can even set the LAN settings like the IP address. So I sourced one card in the used market and after 12h fighting with the card due to wrong firmwares, mismatches between the system BIOS and the car and broken download links on IBM website and that frustrating Fix Central webpage. There still an BMC update that I could not do because the update package simply does not find the BMC on the server. Probably because the package is for EL5 and I’m running EL7. After fighting with this I was able to finally connect to the web interface that the RSA-2 provided. I can shutdown and power on the server, see some information and that’s it. However I cannot control the system using ipmitool remotely and when using ipmitool in band the LAN settings are different from those on the RSA-II card. So I think all this configuration on the BIOS about the BMC, the ipmitool lan commands are all bogus on this system. Basically the card is pretty much useless and I just wasted time and little money in this journey. So is there any chance of making this work? Any workaround? Anyone that feels the pain or knows the hardware enough to fill in the gaps what I may be missing? Thanks all. _______________________________________________ xCAT-user mailing list xCA...@li...<mailto:xCA...@li...> https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fxcat-user&data=05%7C02%7Cjjohnson2%40lenovo.com%7C1cc27dfaf4964501405d08dc119d520d%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C638404614610576954%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gSZk487MMWMoEFcsebSMJMimxFCFdiq3UEnBflYS4wQ%3D&reserved=0<https://lists.sourceforge.net/lists/listinfo/xcat-user> |
From: Ryan N. <nov...@ru...> - 2024-01-11 19:32:42
|
I just meant more than anything, certainly support for a new OS is not going to be added if it’s not already there, if the last release was announced (unless it just worked with a minor tweak). On Jan 11, 2024, at 13:15, Gilad Berman <gb...@le...> wrote: Rhel9 should be supported already though afaik Gilad Berman HPC Architect, Lenovo EMEA gb...@le...<mailto:gb...@le...> +972-522554262 <image001.png> From: Ryan Novosielski via xCAT-user <xca...@li...> Sent: Thursday, 11 January 2024 18:41 To: xCAT Users Mailing list <xca...@li...> Cc: Ryan Novosielski <nov...@ru...> Subject: [External] Re: [xcat-user] RHEL9 support in xcat I don’t know what-all happened at SC or whether a group has come together to continue it, but just remember that there’s a thread on this mailing list about the fact that xCAT is not going to be maintained going forward. -- #BlackLivesMatter ____ || \\UTGERS<file://utgers/>, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - nov...@ru...<mailto:nov...@ru...> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Jan 10, 2024, at 11:54, Imam Toufique <tec...@gm...<mailto:tec...@gm...>> wrote: Hello, Are there any plans to add RHEL9 support for xcat? If so, will it be available for community use? Thanks Regards, Imam Toufique 213-700-5485 _______________________________________________ xCAT-user mailing list xCA...@li...<mailto:xCA...@li...> https://lists.sourceforge.net/lists/listinfo/xcat-user |
From: Imam T. <tec...@gm...> - 2024-01-11 19:14:48
|
agreed! thanks! On Thu, Jan 11, 2024 at 11:13 AM Noah, Stuart via xCAT-user < xca...@li...> wrote: > Not only is there a thread on this list but there are numerous sites on > the web stating that support and development has been sunsetted. > > So, occasionally posting info > > rmation on the transition process (whatever that may be) would certainly > help to boost user confidence in > > XCAT as a viable Solution. > > > > *From:* Gilad Berman <gb...@le...> > *Sent:* Thursday, January 11, 2024 10:16 AM > *To:* xCAT Users Mailing list <xca...@li...> > *Subject:* Re: [xcat-user] [External] Re: RHEL9 support in xcat > > > > Rhel9 should be supported already though afaik Gilad Berman HPC Architect, > Lenovo EMEA gberman@ lenovo. com +972-522554262 From: Ryan Novosielski > via xCAT-user <xcat-user@ lists. sourceforge. net> Sent: Thursday, 11 > January 2024 18: 41 To: > > ZjQcmQRYFpfptBannerStart > > *CAUTION: External Sender * > > Do not click on links or open attachments unless you know the content is > safe. Protect your username and password. > > ZjQcmQRYFpfptBannerEnd > > Rhel9 should be supported already though afaik > > > > *Gilad Berman > * > > HPC Architect, Lenovo EMEA > > gb...@le... +972-522554262 > > > > > > *From:* Ryan Novosielski via xCAT-user <xca...@li...> > *Sent:* Thursday, 11 January 2024 18:41 > *To:* xCAT Users Mailing list <xca...@li...> > *Cc:* Ryan Novosielski <nov...@ru...> > *Subject:* [External] Re: [xcat-user] RHEL9 support in xcat > > > > I don’t know what-all happened at SC or whether a group has come together > to continue it, but just remember that there’s a thread on this mailing > list about the fact that xCAT is not going to be maintained going forward. > > > > -- > #BlackLivesMatter > > ____ > || \\UTGERS, > |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - nov...@ru... > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB > A555B, Newark > `' > > > > On Jan 10, 2024, at 11:54, Imam Toufique <tec...@gm...> wrote: > > > > Hello, > > > > Are there any plans to add RHEL9 support for xcat? If so, will it be > available for community use? > > > > Thanks > > Regards, > > *Imam Toufique* > > *213-700-5485* > > _______________________________________________ > xCAT-user mailing list > xCA...@li... > https://lists.sourceforge.net/lists/listinfo/xcat-user > <https://urldefense.com/v3/__https:/lists.sourceforge.net/lists/listinfo/xcat-user__;!!KOmnBZxC8_2BBQ!x_E5oEn1OPUNdyYH7JsZDoqEFV26wFyjlUOuL27uggzKBmk-ClqhLqIWPIgt34iV2nn2yxLoIf1Q0mGXMg$> > > > > > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended > recipient, or the employee or agent responsible for delivering it to the > intended recipient, you are hereby notified that any dissemination, > distribution or copying of this information is strictly prohibited. Thank > you for your cooperation. > _______________________________________________ > xCAT-user mailing list > xCA...@li... > https://lists.sourceforge.net/lists/listinfo/xcat-user > -- Regards, *Imam Toufique* *213-700-5485* |
From: Noah, S. <Stu...@cs...> - 2024-01-11 19:12:49
|
Not only is there a thread on this list but there are numerous sites on the web stating that support and development has been sunsetted. So, occasionally posting info rmation on the transition process (whatever that may be) would certainly help to boost user confidence in XCAT as a viable Solution. From: Gilad Berman <gb...@le...> Sent: Thursday, January 11, 2024 10:16 AM To: xCAT Users Mailing list <xca...@li...> Subject: Re: [xcat-user] [External] Re: RHEL9 support in xcat Rhel9 should be supported already though afaik Gilad Berman HPC Architect, Lenovo EMEA gberman@ lenovo. com +972-522554262 From: Ryan Novosielski via xCAT-user <xcat-user@ lists. sourceforge. net> Sent: Thursday, 11 January 2024 18: 41 To: ZjQcmQRYFpfptBannerStart CAUTION: External Sender Do not click on links or open attachments unless you know the content is safe. Protect your username and password. ZjQcmQRYFpfptBannerEnd Rhel9 should be supported already though afaik Gilad Berman HPC Architect, Lenovo EMEA gb...@le...<mailto:gb...@le...> +972-522554262 [cid:image001.png@01DA4478.E1C1CDF0] From: Ryan Novosielski via xCAT-user <xca...@li...<mailto:xca...@li...>> Sent: Thursday, 11 January 2024 18:41 To: xCAT Users Mailing list <xca...@li...<mailto:xca...@li...>> Cc: Ryan Novosielski <nov...@ru...<mailto:nov...@ru...>> Subject: [External] Re: [xcat-user] RHEL9 support in xcat I don’t know what-all happened at SC or whether a group has come together to continue it, but just remember that there’s a thread on this mailing list about the fact that xCAT is not going to be maintained going forward. -- #BlackLivesMatter ____ || \\UTGERS<file://UTGERS>, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - nov...@ru...<mailto:nov...@ru...> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Jan 10, 2024, at 11:54, Imam Toufique <tec...@gm...<mailto:tec...@gm...>> wrote: Hello, Are there any plans to add RHEL9 support for xcat? If so, will it be available for community use? Thanks Regards, Imam Toufique 213-700-5485 _______________________________________________ xCAT-user mailing list xCA...@li...<mailto:xCA...@li...> https://lists.sourceforge.net/lists/listinfo/xcat-user<https://urldefense.com/v3/__https:/lists.sourceforge.net/lists/listinfo/xcat-user__;!!KOmnBZxC8_2BBQ!x_E5oEn1OPUNdyYH7JsZDoqEFV26wFyjlUOuL27uggzKBmk-ClqhLqIWPIgt34iV2nn2yxLoIf1Q0mGXMg$> IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is strictly prohibited. Thank you for your cooperation. |