<div dir="ltr">Hey Jelle,<div><br></div><div>Looks like you've got the same symptom with a different problem :-(. I can say for certain the symptom in my case was caused by the double livestatus loading - so we know what you're running into a different thing.</div><div><br></div><div>Best,</div><div>Terence</div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, May 16, 2017 at 8:12 AM, jesm <span dir="ltr"><<a href="mailto:crap8@smetj.net" target="_blank">crap8@smetj.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><u></u><div><div style="font-family:arial,sans-serif;font-size:13px"> <p>Hi all,<br><br>We experience exactly the same problem only that we don't have the situation of loading Livestatus twice.<br>We're not using Docker and we're using the latest stable... We could try the nightly build but unfortunately we cannot reproduce the problem ...<br>It comes and goes and we have no idea why ... <br><br>The symptoms we are seeing:</p> <ul> <li>All Naemon related threads are consuming 100% cpu of every core.</li> <li>Thruk is not able to connect to the Unix domain socket and therefor each incoming request starts a fcgi process exhausting the pool in no time.</li> <li>Weirdly enough during this state, it's possible to manually query livestatus using unixcat or socat.</li> <li>Restarting Naemon does not help</li> <li>Rebooting the server does not help</li> <li>Removing retention.dat solves the problem</li> <li>Restoring the previously removed retention.dat from during the outage does NOT invoke the problem again. </li> <li>Stracing the threads shows a continuous barrage of entries like: (I have no more detailed extraction of this output)<ul><li><pre><code><... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)</code></pre></li></ul> </li> </ul> <br>After the retention.dat file was deleted and Naemon restarted we were not able to trigger the same problem. <p><br>Any ideas?<br><br>Cheers,<br><br>Jelle</p><div><div class="h5"><br><br>May 12, 2017 4:27 AM, "Terence Kent" <<a href="mailto:%22Terence%20Kent%22%20%3Cterencekent@gmail.com%3E" target="_blank">terencekent@gmail.com</a>> wrote:</div></div><p></p><div><div class="h5"> <blockquote><div><div><div dir="ltr">Hey <span style="font-size:12.8px">Sven,</span><div></div> <div> <span style="font-size:12.8px">Thanks for getting back to me so quickly, this was particularly challenging to chase down. Using </span><span style="font-size:12.8px">strace and livestatus debugging didn't actually give me more information on this one. I also confirmed I had the issue with the nightly build as well as 1.0.6.</span> </div> <div></div> <div>Anyway, I found the cause of the issue. It's configuration related and pretty subtle. If you uncomment the following directive in <span style="color:rgb(0,0,0);font-family:menlo;font-size:11px;font-variant-ligatures:no-common-ligatures">/etc/naemon/naemon.cfg</span> file...</div> <blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><p><span>broker_module=/usr/lib/naemon/<wbr>naemon-livestatus/livestatus.<wbr>so /var/cache/naemon/live</span></p></div></blockquote> <div>...then the livestatus socket gets initialized twice during naemons startup, causing the issue I describe earlier. The reason for this duplicate initialization is because the <span style="font-variant-ligatures:no-common-ligatures;color:rgb(0,0,0);font-family:menlo;font-size:11px">/etc/naemon/module-conf.d/<wbr>livestatus.cfg, </span><span style="font-variant-ligatures:no-common-ligatures;color:rgb(0,0,0)"><font face="arial, helvetica, sans-serif">which also includes the same directive. There's a hint of the duplicate initialization in the naemon log, due to multiple log messages for livestatus initialization, but that's it.</font></span> </div> <div></div> <div> <span style="font-variant-ligatures:no-common-ligatures;color:rgb(0,0,0)"><font face="arial, helvetica, sans-serif">It seems the only issue here is that the configuration is very confusing (</font></span><span style="color:rgb(0,0,0);font-family:menlo;font-size:11px;font-variant-ligatures:no-common-ligatures">/etc/naemon/naemon.cfg </span><span style="font-family:arial,helvetica,sans-serif;color:rgb(0,0,0);font-variant-ligatures:no-common-ligatures">gives you an example of how to use livestatus, making you think you should just be able to uncomment it) and that repeating a configuration directive doesn't produce an obvious error.</span> </div> <div></div> <div><font color="#000000" face="arial, helvetica, sans-serif"><span style="font-variant-ligatures:no-common-ligatures">Would you like me to file an issue for this? While it's easy to resolve, it's really hard to chase down.</span></font></div> <div></div> <div><font color="#000000" face="arial, helvetica, sans-serif"><span style="font-variant-ligatures:no-common-ligatures">Thanks!</span></font></div> <div><font color="#000000" face="arial, helvetica, sans-serif"><span style="font-variant-ligatures:no-common-ligatures">Terence</span></font></div> <div> <div>On Tue, May 9, 2017 at 12:12 AM, Sven Nierlein <span dir="ltr"><<a rel="external nofollow noopener noreferrer" href="mailto:Sven.Nierlein@consol.de" target="_blank">Sven.Nierlein@consol.de</a>></span> wrote:<br><br> <blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Terence,<br><br>Could you try the latest nightly build, just to be sure to not hunt already fixed bugs. If that doesn't help, you could increase the<br>livestatus loglevel as well. Naemon has a debug log which could be enabled, and of course strace often gives a good idea on whats<br>happening as well.<br><br>Cheers,<br>Sven<br><br><br><span>On 09.05.2017 02:05, Terence Kent wrote:<br>> Hello!<br>><br>> We're trying to update our naemon docker image to 1.0.6 and we're running into a fairly difficult-to-debug issue. Here's the issue we're seeing:<br>><br>> 1. Naemon + Apache start as expected and will run indefinitely, if Thruk is not accessed.<br>> 2. Upon signin to Thruk, the Naemon process's CPU consumption jumps to 100% and will stay there indefinitely.<br>><br>> We've been trying to get at some logging messages to see if we can diagnose the behavior, but that's been a bit more trouble than we expected. So far, we've just done the obvious thing of increasing the debuging levels found in /etc/naemon/naemon.cfg. However, this seems produce no additional information when the issue is hit.<br>><br>> Anyway, here's some information about the container environment:<br>></span><br>> *Base image:* phusion 0.9.21 (Which is Ubuntu 16.04)<br>> *Naemon primary log file entries: *These always look like this. Not much to go off of.<br><span>> ––––<br>><br>> [1494286706] Naemon 1.0.6-pkg starting... (PID=51)<br>><br>> [1494286706] Local time is Mon May 08 23:38:26 UTC 2017<br>><br>> [1494286706] LOG VERSION: 2.0<br>><br>> [1494286706] qh: Socket '/var/lib/naemon/naemon.qh' successfully initialized<br>><br>> [1494286706] nerd: Channel hostchecks registered successfully<br>><br>> [1494286706] nerd: Channel servicechecks registered successfully<br>><br>> [1494286706] nerd: Fully initialized and ready to rock!<br>><br>> [1494286706] wproc: Successfully registered manager as @wproc with query handler<br>><br>> [1494286706] wproc: Registry request: name=Core Worker 55;pid=55<br>><br>> [1494286706] wproc: Registry request: name=Core Worker 57;pid=57<br>><br>> [1494286706] wproc: Registry request: name=Core Worker 59;pid=59<br>><br>> [1494286706] wproc: Registry request: name=Core Worker 61;pid=61<br>><br>> [1494286706] wproc: Registry request: name=Core Worker 58;pid=58<br>><br>> [1494286706] wproc: Registry request: name=Core Worker 60;pid=60<br>><br>> ––––</span><br>> *Naemon livestatus log: *(Blank)<br>> *Thruk Logs: *Nothing comes out here, until I kill the naemon service, then it's just:<div><div>> ––––––––<br>><br>> [2017/05/08 19:34:00][nameon][ERROR][Thruk<wbr>] No Backend available<br>><br>> [2017/05/08 19:34:00][nameon][ERROR][Thruk<wbr>] on page: <a rel="external nofollow noopener noreferrer" href="http://10.13.30.200/thruk/cgi-bin/minemap.cgi?_=1494272037931" target="_blank">http://10.13.30.200/thruk/cgi-<wbr>bin/minemap.cgi?_=149427203793<wbr>1</a><br>><br>> [2017/05/08 19:34:00][nameon][ERROR][Thruk<wbr>] Naemon: ERROR: failed to connect - Connection refused. (/var/cache/naemon/live)<br>><br>> –––––––––<br>><br>><br>><br>> From tracing around, we're pretty confident the issue is when Thruk attempts to connect to the naemon live socket. However, what the cause of the issue is has been tough; we know the fs permissions are correct, we believe the socket is working from the log messages, and Thruk works as expected when we stop naemon (it shows it's interfaces and errors that it cannot connect to naemon). We can keep at this, of course, but I was hoping we could get pointed in the right direction.<br>><br>><br>> Thanks!<br>><br>> Terence<br>><br>><br><br> </div></div> <span><font color="#888888">--<br>Sven Nierlein <a rel="external nofollow noopener noreferrer" href="mailto:Sven.Nierlein@consol.de" target="_blank">Sven.Nierlein@consol.de</a><br>ConSol* GmbH <a rel="external nofollow noopener noreferrer" href="http://www.consol.de" target="_blank">http://www.consol.de</a><br>Franziskanerstrasse 38 Tel.:089/45841-439<br>81669 Muenchen Fax.:089/45841-111</font></span> </blockquote> </div> </div> </div></div></div></blockquote> <br> <u></u><u></u><br> </div></div></div></div>
</blockquote></div><br></div>