Distributed Nagios eXecutor (DNX)
Posted: Tue Mar 20, 2007 8:56 am
Community Discussion of Church Technology
https://tech.churchofjesuschrist.org/forum/
https://tech.churchofjesuschrist.org/forum/viewtopic.php?t=361
Any and all feedback. Your "general feedback" is of help as well.rmrichesjr wrote:Tom, I have one suggestion: What kind of feedback are you wanting? Are you looking for a technical review of the work flow diagrams? Or, are you looking for a review of the writing of the article? Are you looking for suggestions for improvement of the system? Are you looking for volunteers to help with this or other related projects?
Generic feedback follows: I scanned the first part of the article, but my lack of in-depth familiarity with Nagios left me unable to follow the meat of the article well enough to be worth a detailed reading of the rest of it, so I just skimmed the rest of the article. It looks like some cool high-tech stuff.
Good catch. Fixed in the next release (tomorrow hopefully).gblack wrote:Minor typo in the README:
# data of plugins on the first run, which you may want to fix manually. For exmaple, check_icmp needs to be run as root with
exmaple=example.
Good eye. Yes, there isn't any real authentication being done on either side yet (there is a simple IP matching going on though). We plan to add a layer of encryption, or possibly just a signed message digest to prevent DoS and high-jacking of requests (making the clients execute arbitrary code as a check). This definitely isn't meant to be run on a public network. Heads and nodes should all be running on the same switch if at all possible. Network latency is the enemy in this application.gblack wrote:I haven't written any real C in over 12 years, and most of what I wrote then was a DOS environment, so I'm beyond rusty at this point. Browsing the code though, I didn't see any place that jumped out at me where you authenticate commands coming into the workers on the UDP port (did I miss it?). Just glancing over it, if the UDP port isn't authenticated, would a DoS attack on your workers(clients) be possible?
Correct. No Nagios install on the clients is necessary, only the plug-ins need to be installed. For that matter, we have a hook that allows you to run an external script to sync the plug-ins to the worker nodes on reload. We are using a simple rsync command as our "script".gblack wrote: With the client(worker) installations, is a nagios install needed at all, or does dnx do all the work, and simply syncs (and handles) and needed plugins?
Thanks mkmurray!mkmurray wrote:I started [thread=363]a thread[/thread] regarding this topic in the Developer Help Wanted forum.
The only change to the Nagios configs themselves that needs to happen is the line that loads the NEB module (NEB=Nagios Event Broker, works very similarly to a kernel module) into Nagios itself. Everything else remains untouched. We wanted this to be minimally impacting on the administrator of the system. If DNX fails in some way (including if all worker nodes die), execution continues as if it wasn't loaded at all.tomammon wrote:Would the nagios.cfg change when you are using the DNX workers? Which config files change, and how do they change.
Correct. That sentence might be a bit mis-leading, since DNX's sole purpose is to off-load the work of plug-in execution while still accounting for the requirements in the "Approach" section. Truth is, the article is from two different documents that I stuck together for posting here. Looks like I was a bit too sloppy. That whole paragraph is unnecessary.tomammon wrote:You say that "The sole purpose of DNX is to off-load the work of plug-in execution to servers other than the Nagios scheduling server." Does this mean that there is a DNX daemon running on each of the worker nodes, instead of a full install of nagios?
All the communication is done via XML over UDP datagrams (very XML-RPC like). We had originally implemented it with TCP, but the muxing got to be messy. At some point I would like to offer the option of either TCP or UDP. It is coded in such a way that the transport can be easily replaced.tomammon wrote:What mechanism does the DNX worker use to communicate with the nagios process on the scheduling server?
I assume you mean the "In-depth Workflow: Server Side" diagram. There is a very small overhead associated with this method, but it is more that made up for in moving the checks off of the main server. Our testing with a near zero weight plug-in, two worker nodes, and about 10,000 checks and we couldn't seem to produce any real check-execution-time latency (including the LAN delay), and no visible CPU usage or load changes on either the client or the server. Granted, we haven't done the serious methodical testing yet, so YMMV.tomammon wrote:Also, from the first diagram, there appears to be quite a bit of overhead involved in the communication between DNX workers and the scheduling server. If you are going to have to request each service check from the scheduling server before executing the check on the DNX worker, where exactly are your processing savings? My guess is that you lose the work done on the scheduling server since it no longer has to run the actual plugin to run the service check, but this is not clear in the article.
Yes, we have a test environment and we have finished most of our basic testing. We have discovered some corner cases we didn't account for, so that has held up the real performance testing a bit (as well as all the other things that tend to jump up when you are trying to get the fun work done , but what we have seen so far is excellent. Our next release should fix at least two of the three known bugs, so we can resume our load testing. We have 3,000 checks from one of our servers copied over to the staging cluster and it seems to be humming along nicely. Still in the process of copying over the 6,000+ checks from the most heavily loaded machine.tomammon wrote: Has this concept been implemented in any test lab environments? It would be interesting to set it up next to another installation using the traditional method (NCSA, passive checks, etc.) and see exactly how much you save. You could set up these parallel installs and have a cacti server watch them both and then compare the results.
YMMV = Your mileage may vary?AdamA wrote:Granted, we haven't done the serious methodical testing yet, so YMMV.