Network Interface Statistics Collection Agent

Welcome to N.I.S.C.A. v2.5 (15 October, 2003)

Yeah, yeah, I know, this is more of a FAQ than a Readme, but so?


About NISCA

The Network Interface Statistics Collection Agent (or N.I.S.C.A.) is a complete network statistics collector and graph generator aimed at helping network administrators do their job by providing functionality that MRTG doesn't offer.

You moved the Changelog???

Yep, sure did. It's now located in the CHANGELOG file instead.
And the To Do list has moved to the TODO file.

Why does NISCA exist?

NISCA was born to replace the popular MRTG package. Although MRTG is a fine application respected all over the world, I've always found it lacking some features that I really wanted in a network statistics analyzer; things like true-type fonts, collection of data into a database, the "time zoom" feature, and the ability to generate graphs from any period in the past without losing any detail due to data compression.

NISCA 2.5 features

How does one use it?

First, install Apache, MySQL, and PHP4 (see the PHP_HINTS file).
Then, install NISCA (see the INSTALL file).
Then, configure it (see the INSTALL file).
Then, use it (see below).

The form on the index.phtml page is fairly self-explanatory, but here are some things to keep in mind while using it.

NOTE:

The list of hosts/interfaces to choose from on the index page is generated from the actual collected stats in the database, NOT the interfaces you have configured in the administration section! I'm hoping the reason for this is obvious. :)

Also note that communities are not displayed or used at all on either the index page or the actual report page for security reasons, so if you monitor more than one community on the same host, and the same interface name exists in both communities, you'll only see it on this page once... but a report generated for that interface will merge the stats for every community it exists under. Try to avoid monitoring the same interface via two different communities on the same host for this reason. :)

If using the "fancy" Javascript host/interface selection method, here's what you do...

  1. Pick the hostname out of the first select box. This will set the bottom-left select box to a list of the interfaces in every community on that host.

  2. Click on an interface you want a report on. This will add it to the bottom-right select box.

  3. Repeat as necessary, changing the hostname as needed to get the interfaces you want in the select box.

  4. The "Clear Deselected" button will clear any interfaces from the bottom-right select box that aren't selected, but it tends to crash Netscape 4.7 on Xwindows, and Mozilla 0.8 has some major trouble with multiple select boxes, so don't rely on it.

  5. The "Clear All" button erases all interfaces in the bottom-right list; it shouldn't crash anything.

  6. Once you have the perfect list of interfaces to report on in the bottom-right box, set the other report options on the page as desired and submit it.

    If not using the fancy selection method, you'll get a list of every hostname/interface combination to pick from. This is more convenient if you only have a few interfaces monitored, but can be annoying if there are hundreds to choose from. You can turn off the fancy method in the Global Options config section. If you're using Mozilla, you'll then get to see the <OPTGROUP> tags in action; they separate each host to make the list easier to navigate. If you have any other browser in existence, they probably won't show up. (Even though OPTGROUP is in the HTML4.0 standard, Mozilla (and any other browsers based on that Gecko rendering engine thing) is the only browser claiming to support HTML4 that supports OPTGROUPs. That I could find, at least. I know all versions of Netscape and Internet Explorer (tm) don't support it even though they claim to be HTML4-compliant. Are you surprised?)

    If you have the "Select how much data to view here" drop-down box set to "A date/time range, set below", it will use the "from" and "to" dates and times at the bottom of that section to restrict which data to analyze; otherwise, it will ignore everything in the "From" and "To" boxes.

    If you have one of the options with an "X" in it selected, it will use whatever is in the "X = ___" field in place of the "X". So if you select "The past X hours" and put "3" in the "X = ___" field, you'll get a report covering the past three hours and nothing more. You might get something less though, since the odds are great that it won't return exactly three hours' worth of data; you're more likely to get two hours and 59 minutes' worth, depending on what you have the "$delay" set to. NISCA always uses the actual time stamps in the database rather than trying to force it to precisely match a particular time frame. You can also put decimals in it; for example, "1.5" in "X" and "Days" in the dropdown box will give you stats for the past day and a half.

    If you select "The entire contents of the database", that's exactly what you'll get... so be careful if you've got years and years of data collected four times a minute in it. NISCA ain't that fast yet... :)

    Sometimes, like during fsck-laden reboots (and let me just take this opportunity to plug The Reiserfs, which has saved my butt quite a few times now) or periods during which you didn't collect data, there will be gaps in the data. In this case, NISCA will point out the places where it filled in the intervening space as best it could by putting the From and To times in red. Its detection of this condition is done by adding twice the requested summary interval (last section of the form) to the previous timestamp and if that's still less than the current stamp, it will assume there was a gap and make it red just to call your attention to it. This doesn't catch all gaps, though, only the ones larger than the Summary Interval you specified on the report form. However, it always calculates averages using the actual time period of each line, so gaps are always averaged right whether the intervening time matches the requested summary interval or not.

    Each graph contains a red circle around the largest Y-axis values found on it, so you can quickly find the peaks. Peak values and times are placed on the top of the graph.

    Each graph generated is given a unique filename using a rather large random number, so every time you run it it'll give you a different image filename. This is thanks to the (mis)behavior of the caching mechanisms of almost all browsers. Also, every time you run it, any graphs older than one minute are deleted, so there shouldn't be any build-up of them.

    Reports can be saved under any name you wish. Once you've set the options on the form the way you want to save them, enter a name for the report (near the bottom of the page) and then hit the "Run It" button. The report options will be saved, then used to display the requested information. But if a report already exists with the name you choose, it won't be overwritten; you have to use the admin section to either delete it and then try to save it again, or save it under a different name (it will still show you the results of the options you chose, it just won't save them as that report name).

    To recall a saved report, just click on it in the drop-down list at the top of the index page. If you have Javascript disabled, you'll have to then click the "Run It" button to view it; if it's enabled, the report will be displayed as soon as you change the value of the drop-down list.

    Report administration is handled via the administration pages; that's the only place an existing report can be changed or deleted.

    One more thing; if you submit a report and then hit "escape" to stop loading it before it displays, the servers and scripts will continue to grind along working on it even though you'll never see the results. Try to avoid doing it... it can cause slowness. :)

    Oh. See the end of the INSTALL file for instructions on using the fancy administration section.

    Fine Tuning NISCA

    This will be one of the hardest tasks for a NISCA user, but all the people involved in developing and contributing to this project are working hard in order to provide as much information as possible.

    The NISCA user has to take into account many parameters in order to setup the COLLECTION INTERVAL and the number of hosts/interfaces monitored with NISCA. The interval is rather different than MRTG's, which is done via crontab and thus can generate overlapping statistics if collection takes longer than the crontab interval (300 seconds, usually). The way the collectors in NISCA work is, they will poll all your monitored hosts and THEN go to sleep for the delay interval you have configured; thus, if collection takes six minutes, and your delay is 5 minutes, the effective delay time will be eleven minutes. Running the command "snmp_collect t" will help you determine how long each collection cycle takes, and you can adjust your interval time accordingly. (The "t" puts it in debug mode.)

    Another thing about the collection interval. The smaller it is, the "fuzzier" your graphs will be. Anything less than 15 seconds or so will be just about useless. A 5-minute delay will probably look best on fast interfaces (and take up much less database space. :)

    People need to evaluate many parameters in order to not generate overload of the whole system (nisca, network, monitored hosts, etc.) Estimating all these parameters is very very complex, especially because various systems react in different ways to SNMP requests and the network conditions can change from moment to moment.

    Do not overestimate your setup's abilities!

    One thing about NISCA is that it uses memory, a lot of it, while it's generating reports for you (and only then). And the more datapoints being analyzed, the more memory it takes. This means you can quickly get several httpd processes taking up lots and lots of memory. To help fix this, I've changed the "MaxRequestsPerChild" setting in Apache's httpd.conf file from its default of "0" (unlimited) to "1". This will force every child server process to die as soon as it's done with its request, and thus it won't consume all your memory. Setting this to "2" or higher doesn't seem to do much good; the children don't die, and new children are spawned which will take up just as much space, so if you run four 60-meg reports one after the other you could bring your machine to a complete halt if it's set higher than "1". Your mileage, as always, may vary; tune it for you. This seems to be much more well-behaved with later Apaches (1.3.28 is what I use now and it plays nice).

    I've also seen PHP die with an error similar to "Maximum allowed memory usage exceeded" when viewing large reports. If this happens to you a lot, you can edit your "php.ini" file and change the max memory allowed (it defaults to 8 meg, 8388608). This setting is called "memory_limit". Don't forget to HUP or restart your web server if you change this.

    In the future, more technical details will be provided, but for now the user should start with a minimum setup: a Delay value of 300 seconds at first, then slowly increase the number of interfaces and decrease the Delay time in order to not overload both the NISCA server and the network(s) over which the server is polling the monitored hosts. Trial and error is the best way to see what you can get away with.

    Benchmark

    The report generator currently generates a graph from 100,000 datapoints (1 year's worth) in about 30 seconds running on an 850Mhz AMD Athlon with 768M of RAM. A multi-interface report which adds the transfer averages of 2 interfaces over a one-month period (some 15,000 entries) takes 55 seconds (it's a much more intense operation). Your Mileage May Vary. It's the graph generation and the database bottleneck that takes so long. Yes, I'm working on ways of speeding it up... it ain't easy.

    Apparently the report generation time isn't entirely cumulative; getting reports one interface at a time takes more time than one report on many interfaces.

    As for disk space, the 1,780,000 entries in my database take up 360 meg of disk space in the form of MySQL tables/indices. Since I moved the hosts, communities, and interfaces to another table and now use medium integers to refer to them in the "stats" table, the disk space usage has been cut in half and response time of just about everything (except reports) has become instantaneous since it doesn't have to look through hundreds of thousands of rows to find every unique host/community/if now. Even graph generation speed has been doubled just from this one change. Just the opposite of the effect I thought it would have; live and learn, I always say.

    A Detailed Description of the Multiple-Interface Graphing Method

    I'm including this just to satisfy people's curiosity. I'm sure there are other geeks out there who'd love to know how it works. So here we go... warning; I can't avoid getting a bit technical.

    When I set out to write the multi-IF graphing code, I had no idea how complicated it was, or how simple the final solution would be. I had to rewrite it all from scratch four times to get it right, making all kinds of notes and diagrams and drawings and stuff. Here's what I finally came up with.

    First of all, I didn't want Nisca to do it the MRTG way: require you to be collecting pre-summed statistics from each interface desired before you can draw a graph of it. It just seemed silly to me, especially since it requires that you poll each interface TWICE... once for the regular single stats, and one again for the summed-interfaces stats. There had to be a way to take any existing set of statistics for any combination of interfaces on any number of hosts over any time period, whether all the interfaces involved had identical time periods or not, and add them together in the same time periods. My first attempt was horrible; I won't bore you with the grisly details of how a one-month report took half an hour and 500 meg of memory, and then delivered a graph that looked like something a drunk centipede had walked all over after wading through a few pools of paint. Let's just say, I wasn't satisfied.

    So after the second rewrite attempt, I'm sitting there staring into space trying to think of an answer, and I realized I was staring at a CD storage rack. And my mind whispered to me, "Pigeonholing!" Just make the entries fall into the right slots, and make the slots as wide as the collection interval. But even that delivered pretty shoddy results. And then I realized something else, something that probably would have occurred instantly to anyone who does statistical analysis for a living.

    Statistics are always measured in pairs. There's a starting point for both the counter itself (which is a running total) and the timestamp it was collected, and a corresponding ending point. You find the amount of traffic transferred by subtracting the earlier counter from the later counter. If the machine has rebooted in between them, this should result in a negative number; if that happens, the later counter is used by itself to determine the change in count; there's no way to know exactly how much data was transferred between the earlier counter and the later counter because it got zeroed out somewhere in between them. So in that case, the entire value of the later counter is used as the amount transferred between the two entries (because it was at least that many bytes, but probably a lot more). So once you have a value for the amount of bytes transferred between the two points, you figure out the interval between them; divide A by B, and you have the average. But as it turned out, that's useless.

    I had divided the report period up into "pigeonholes," or slots, that were as wide as the requested averaging interval (300 seconds, or 5 minutes, by default). Sometimes an entry's endpoints would both lie within one slot; sometimes its starting point was in one slot and its ending was in the very next slot; and sometimes there was one or more slots without a datapoint in it in between them. So I re-re-rewrote it, again, using this approach like a baker uses a flour sifter, and it worked. Imagine my shock.

    It keeps track of the time stamps of the current stat and the previous one. When start and end are in the same slot, it just adds that whole counter change to the slot and keeps going. When the end goes past a slot boundary, it starts doing math. It finds the time between the start time and the slot boundary and divides it by the time between the start and end times. This gives it the percentage of the total time which lies in the earlier slot. It multiplies that by the whole counter change between them, which tells it how much of the data belongs to the earlier slot, and it adds it to the earlier slot's counter array (remember, it's still not an average at this point). Then there are two possibilities.

    It adds the averaging interval to the slot boundary. That will either put the boundary past the end time of the stat, or it won't... meaning there are intervening slots without a stat entry. If so, it repeats the earlier percentage operation, but instead it divides the value of the averaging interval by the total time of the entry and adds that to that slot. It does this until the slot boundary passes the ending timestamp.

    Once the slot boundary is past the ending timestamp, however it got there, it does the percentage thing again, this time using the time between the slot's beginning and the stat's end timestamp to calculate the percentage, which it multiplies by the total and adds to that slot's counter. And by the way, these percentages can be zero; that just means an entry's start or end lies exactly on a slot boundary, so 100% goes on one side and 0% goes on the other. Pretty neat how that worked out.

    Now, this has one unfortunate side-effect. The very last entry for an interface won't have an ending point with which to calculate a transferred count for the last slot. This means the last slot will almost always fall sharply downwards, since it will almost always have far less data transferred in it than all the previous slots. So when viewing these graphs, please don't panic; it doesn't mean all your interfaces went down sometime in the past five minutes or anything. :)

    Now, that was just for one type of data; incoming bytes, say. It has to do all that separately for the incoming and outgoing stats of every entry. And that's why it doesn't support making multiple-interface graphs of packets, or drops, or errors; not only is it kinda pointless, it would also mean another hundred lines of code for each added report type.

    Once it's done every entry, it passes the data to the makegraph() function, which converts them from counts to average per-second rates, just as it does for regular graphs. And that's a whole other story in itself. :)

    Who has actually contributed?

    New ideas, requests, job offers, and comments are always welcome from everyone. Contributing to NISCA will only improve its quality and your karma. What a bargain!

    Oddities

    We now have the first confirmed Nisca-related tech support call, made from Milano to Rome, Italy, at or about 4:00PM (Italy time, +0200) on Monday, June the 25th, 2001. If anyone knows of an earlier call, let me know. :)

    Mumbo-Jumbo

    This program is released under the GNU General Public License (see LICENSE). This means you have my permission to do anything you like with it except printing it out, rolling it up, and swatting your pet with it. I will not condone cruelty to animals. And if you make any money off of it, please think of my poor unemployed self and have pity on me as you count your millions.

    You know that really long boring bit about "as-is" and "merchantability" and "fitness for a particular purpose" and all that crap? Insert it here.

    Note that I am not affiliated with Team Nisca, which makes ID card printers; the National Interscholastic Swimming Coaches Association; the Northern Ireland Society for Computing in Anaesthesia; the Nuptual Illusions Service Center of Antarctica; the Non-Ischemic Sub-Cortical Aneurisms society; Naughty Isaac's Stuffed Chicken Arcades, Inc.; or the NISCA protocol, which is used to connect systems in an OpenVMS cluster. Anyone claiming otherwise will certainly be ridiculed into an embarassing extinction, because to me NISCA means "the Network Interface Statistics Collection Agent" and nothing more. And with version 3.0, it won't even mean that.

    Note that I am affiliated with isthisthingon.org, a very, very non-profit non-organization of no one in particular.

    Any resemblance to actual programs, living or dead, is purely coincidental. I ask you all to reflect upon how often form follows function.

    Contact Info-Mation

    Author's email: phee@isthisthingon.org
             or
    brett@fnord.org
    Official Site: http://nisca.sourceforge.net/
    Author's ICQ #:13130273