Dear Jochen,
The Halef demo line happened to be down a number of times over the past weeks. AFAIK, your monitoring scripts are not installed on this instance. Please install and also provide the manual on how to install monitoring scripts on new instances. Furthermore, the scripts should be part of the SourceForge project. Please check them in to the respective repositories.
Thanks,
DSO
The files are already in the SourceForge project under scripts/monitor along with all necessary information in the readme.txt: e.g. https://sourceforge.net/p/halef/halef-cairo/ci/master/tree/cairo-VM/scripts/monitor/
The corresponding files exist on tjr70, however they are neither running, nor up to date with the master. I will see to that.
Hi David,
I just updated the monitoring scripts on tjr70 and tjr 71. Both seem to work correct. If you have recieved an email a few minitues ago, this was because one address was incorrect. The scripts should inform you if a server is not reachable anymore, or if a process is missing. As long as not both servers go down at the same time. (I added my to the recipient list, too)
However I wasn´t able to implement the monitoring for asterisk, I still don´t have the correct password, which has to be specified in the properties file.
Regards,
Jochen
Dear Jochen,
The demo line 2000 is down, but, AFAIK, your monitoring scripts did not fire. Would you please take a look?
Yours,
DSO
Dear Jochen,
This arrived from tjr@it-tjr70.dhbw-stuttgart.de:
[Wed Nov 26 00:32:18 CET 2014]
it-tjr71.dhbw-stuttgart.de process ids:
ERROR: monitoring not unavailable
"monitoring not unavailable" means the same as "monitoring available", does it not? What does this tell me?
Yours,
DSO
On tjr70 and 71 no monitoring scripts were running.
For now I restarted the scripts. (You probably received mail from tjr70, since I started the monitoring there first.) I will check weather they are killed again.
regarding your last post:
It´s a double negative. My fault. It should just say, that there is no monitoring on one machine. This is due to the fact, that I have not started the monitoring scripts simultaneously.
I will make a change to the files tomorrow.
Last edit: Jochen Mohrmann 2014-11-25
Dear Jochen,
Any updates on this? The demo lines are still down, and I am not receiving any warning e-mails either.
Yours,
DSO
Dear David,
the reason why the scripts stoped was a network problem. For some reason the domain couldn´t be translated to an actual ip adress, but this problem has not occured again since then.
The scripts are running and the last check was at [Wed Dec 3 03:22:46 CET 2014].
All processes that are monitored are currently running, too (I just double checked it).
Do you know the actual reason why the demo line is down? Obviously the scripts are not able to detect this crash, but I would like to change that.
Regards,
Jochen
Dear Jochen,
The procedure to restart Halef is
cd ~/scripts
bash restart.bash
when it is done:
bash restart.bash (again)
when it is done:
sudo -s
cd ~/scripts
bash restart.bash
Yours,
DSO
2000 is back up. Thanks. So, are the scripts functional now?
I fixed the problem with the script terminating unplanned.
The problem with the extension being not available that occurred last time is still undetectable. In order to detect it, i will probably need to simulate a call. However, if I do this, using the current setting will make a parallel call impossible. Is it acceptable, if the line is blocked for a few seconds every now and then?
Are you talking about Ext 2000 specifically? Yes, it is OK if you perform some experimentation with this extension as long as you make sure that after your work session, the demo line is back up and running.
Thanks,
DSO
Any updates?
I plan to use sipcmd: https://github.com/tmakkonen/sipcmd to automatically simualte calls to specific extensions within the monitoring script. I hope that using this tool, I can distuinguish weather the extension is reachable or not, reagardless of a specific problem.
Tjr54 is currently updating, since ubuntu 3.10.2 is necessary.
Closing this bug as Patrick has implemented an email monitoring service which renders this bug obsolete.
Closing bug after HALEF PoC assessment by Robert and colleagues.