Distributed Nagios eXecutor (DNX)

WelchTC · #1

I recently posted an article here in our work section talking about this cool project that a team of engineers have been working on. We would like your feedback. Tell us what you think.

Tom

#2

Tom, I have one suggestion: What kind of feedback are you wanting? Are you looking for a technical review of the work flow diagrams? Or, are you looking for a review of the writing of the article? Are you looking for suggestions for improvement of the system? Are you looking for volunteers to help with this or other related projects?

Generic feedback follows: I scanned the first part of the article, but my lack of in-depth familiarity with Nagios left me unable to follow the meat of the article well enough to be worth a detailed reading of the rest of it, so I just skimmed the rest of the article. It looks like some cool high-tech stuff.

WelchTC · #3

rmrichesjr wrote:Tom, I have one suggestion: What kind of feedback are you wanting? Are you looking for a technical review of the work flow diagrams? Or, are you looking for a review of the writing of the article? Are you looking for suggestions for improvement of the system? Are you looking for volunteers to help with this or other related projects?

Generic feedback follows: I scanned the first part of the article, but my lack of in-depth familiarity with Nagios left me unable to follow the meat of the article well enough to be worth a detailed reading of the rest of it, so I just skimmed the rest of the article. It looks like some cool high-tech stuff.

Any and all feedback. Your "general feedback" is of help as well.

Thanks,

Tom

AugustineAS · #4

Exactly, we are interested in all of the above. Good technical eyes reviewing the logic and presentation are very much appreciated.

Volunteers to help directly with the project would be really great. The project is now on SourceForge ([url=http://dnx.sourceforge.net%29]http://dnx.sourceforge.net)[/url], since it is of wider applicability than just within the Church. We are looking to get some eyes beyond our own to poke at the code to see if they can spot any weaknesses that we might have missed. Testers would be wonderful. Coders, of course, would be really great.

Even feedback on the code documentation would be appreciated. Is the code simple to understand? Are the comments sufficient to clarify what it is doing? Is it modular enough? If you had to pick up this codebase and run with it, could you easily? What would give you heartburn?

We have found a couple of minor problems since our last release that we just don't have resources to fix at the moment (we are able to work around them pretty easily). If someone were to take a look at the bug tracker and the code I suspect they would be easy to spot and maybe resolve (they are in the "client" and don't require an understanding of Nagios itself).

Jump on the mailing list (dnx-devel@lists.sourceforge.net) or speak up here. Any thoughts are appreciated, even ones of the "Why on earth are you doing it that way?" variety.

blackrg · #5

Minor typo in the README:

# data of plugins on the first run, which you may want to fix manually. For exmaple, check_icmp needs to be run as root with

exmaple=example.

blackrg · #6

I haven't written any real C in over 12 years, and most of what I wrote then was a DOS environment, so I'm beyond rusty at this point. Browsing the code though, I didn't see any place that jumped out at me where you authenticate commands coming into the workers on the UDP port (did I miss it?). Just glancing over it, if the UDP port isn't authenticated, would a DoS attack on your workers(clients) be possible?

Overall (I haven't had much time to look it over yet), the idea is definitely interesting. I've only run small nagios installations so far (single server), though I can easily see where the load would scale to a point where you'd want to distribute the load and this looks like it may be a good way to do that.

With the client(worker) installations, is a nagios install needed at all, or does dnx do all the work, and simply syncs (and handles) and needed plugins?

Thanks!

mkmurray · #7

I started [thread=363]a thread[/thread] regarding this topic in the Developer Help Wanted forum.

tomammon-p40 · #8

I like the article. The workflow diagrams were helpful.

The article is a good conceptual "mission statement" for the project, but I think more configuration details would be helpful. Would the nagios.cfg change when you are using the DNX workers? Which config files change, and how do they change.

You say that "The sole purpose of DNX is to off-load the work of plug-in execution to servers other than the Nagios scheduling server." Does this mean that there is a DNX daemon running on each of the worker nodes, instead of a full install of nagios? What mechanism does the DNX worker use to communicate with the nagios process on the scheduling server?

Also, from the first diagram, there appears to be quite a bit of overhead involved in the communication between DNX workers and the scheduling server. If you are going to have to request each service check from the scheduling server before executing the check on the DNX worker, where exactly are your processing savings? My guess is that you lose the work done on the scheduling server since it no longer has to run the actual plugin to run the service check, but this is not clear in the article.

Has this concept been implemented in any test lab environments? It would be interesting to set it up next to another installation using the traditional method (NCSA, passive checks, etc.) and see exactly how much you save. You could set up these parallel installs and have a cacti server watch them both and then compare the results.

Tom

AugustineAS · #9

gblack wrote:Minor typo in the README:

# data of plugins on the first run, which you may want to fix manually. For exmaple, check_icmp needs to be run as root with

exmaple=example.

Good catch. Fixed in the next release (tomorrow hopefully).

gblack wrote:I haven't written any real C in over 12 years, and most of what I wrote then was a DOS environment, so I'm beyond rusty at this point. Browsing the code though, I didn't see any place that jumped out at me where you authenticate commands coming into the workers on the UDP port (did I miss it?). Just glancing over it, if the UDP port isn't authenticated, would a DoS attack on your workers(clients) be possible?

Good eye. Yes, there isn't any real authentication being done on either side yet (there is a simple IP matching going on though). We plan to add a layer of encryption, or possibly just a signed message digest to prevent DoS and high-jacking of requests (making the clients execute arbitrary code as a check). This definitely isn't meant to be run on a public network. Heads and nodes should all be running on the same switch if at all possible. Network latency is the enemy in this application.

gblack wrote: With the client(worker) installations, is a nagios install needed at all, or does dnx do all the work, and simply syncs (and handles) and needed plugins?

Correct. No Nagios install on the clients is necessary, only the plug-ins need to be installed. For that matter, we have a hook that allows you to run an external script to sync the plug-ins to the worker nodes on reload. We are using a simple rsync command as our "script".

mkmurray wrote:I started [thread=363]a thread[/thread] regarding this topic in the Developer Help Wanted forum.

Thanks mkmurray!

tomammon wrote:Would the nagios.cfg change when you are using the DNX workers? Which config files change, and how do they change.

The only change to the Nagios configs themselves that needs to happen is the line that loads the NEB module (NEB=Nagios Event Broker, works very similarly to a kernel module) into Nagios itself. Everything else remains untouched. We wanted this to be minimally impacting on the administrator of the system. If DNX fails in some way (including if all worker nodes die), execution continues as if it wasn't loaded at all.

tomammon wrote:You say that "The sole purpose of DNX is to off-load the work of plug-in execution to servers other than the Nagios scheduling server." Does this mean that there is a DNX daemon running on each of the worker nodes, instead of a full install of nagios?

Correct. That sentence might be a bit mis-leading, since DNX's sole purpose is to off-load the work of plug-in execution while still accounting for the requirements in the "Approach" section. Truth is, the article is from two different documents that I stuck together for posting here. Looks like I was a bit too sloppy. That whole paragraph is unnecessary.

tomammon wrote:What mechanism does the DNX worker use to communicate with the nagios process on the scheduling server?

All the communication is done via XML over UDP datagrams (very XML-RPC like). We had originally implemented it with TCP, but the muxing got to be messy. At some point I would like to offer the option of either TCP or UDP. It is coded in such a way that the transport can be easily replaced.

tomammon wrote:Also, from the first diagram, there appears to be quite a bit of overhead involved in the communication between DNX workers and the scheduling server. If you are going to have to request each service check from the scheduling server before executing the check on the DNX worker, where exactly are your processing savings? My guess is that you lose the work done on the scheduling server since it no longer has to run the actual plugin to run the service check, but this is not clear in the article.

I assume you mean the "In-depth Workflow: Server Side" diagram. There is a very small overhead associated with this method, but it is more that made up for in moving the checks off of the main server. Our testing with a near zero weight plug-in, two worker nodes, and about 10,000 checks and we couldn't seem to produce any real check-execution-time latency (including the LAN delay), and no visible CPU usage or load changes on either the client or the server. Granted, we haven't done the serious methodical testing yet, so YMMV.

You are correct that the main savings are in the scheduling server not having to actually execute the checks itself. The checks are far and away the main source of load for a Nagios box. I will see what I can do to make that clear in the article. It needs to be re-written.

tomammon wrote: Has this concept been implemented in any test lab environments? It would be interesting to set it up next to another installation using the traditional method (NCSA, passive checks, etc.) and see exactly how much you save. You could set up these parallel installs and have a cacti server watch them both and then compare the results.

Yes, we have a test environment and we have finished most of our basic testing. We have discovered some corner cases we didn't account for, so that has held up the real performance testing a bit (as well as all the other things that tend to jump up when you are trying to get the fun work done

, but what we have seen so far is excellent. Our next release should fix at least two of the three known bugs, so we can resume our load testing. We have 3,000 checks from one of our servers copied over to the staging cluster and it seems to be humming along nicely. Still in the process of copying over the 6,000+ checks from the most heavily loaded machine.

It would be very interesting to see DNX perform side by side with NSCA, particularly with the piped/batch approach described on the nagios-devel list last month. I do think we are more efficient overall, if for no other reason than we don't need Nagios running in two places. Any takers on side by side testing? I don't think we have the time in the immediate future, but I think it would be worthwhile.

But really, even if DNX were slightly worse performance-wise (which I am sure it isn't

, DNX's advantages lie in not having to maintain configurations in multiple places, and having transparent failover in the event of "slave" failure (something you don't get with the traditional NSCA method) and the ability to add and remove worker nodes on-the-fly (meaning without reloading or configuration changes, again, impossible with NSCA).

Did I answer all the questions and are they clear enough? I am happy to go into more detail if people are interested. This is a great discussion and feedback. Thanks and keep it coming!

mkmurray · #10

AdamA wrote:Granted, we haven't done the serious methodical testing yet, so YMMV.

YMMV = Your mileage may vary?

I had to Google it, I've never seen that acronym before.

Tech Forum

Distributed Nagios eXecutor (DNX)

Distributed Nagios eXecutor (DNX)

What kind of feedback is wanted?

Thoughts on the article

Ymmv