Welcome to N.I.S.C.A. v2.5 (15 October, 2003)
Yeah, yeah, I know, this is more of a FAQ than a Readme, but so?
The Network Interface Statistics Collection Agent (or N.I.S.C.A.) is a complete network statistics collector and graph generator aimed at helping network administrators do their job by providing functionality that MRTG doesn't offer.
Yep, sure did. It's now located in the CHANGELOG
file instead.
And the To Do list has moved to the TODO file.
NISCA was born to replace the popular MRTG package. Although MRTG is a fine application respected all over the world, I've always found it lacking some features that I really wanted in a network statistics analyzer; things like true-type fonts, collection of data into a database, the "time zoom" feature, and the ability to generate graphs from any period in the past without losing any detail due to data compression.
First, install Apache, MySQL, and PHP4 (see the PHP_HINTS file).
Then, install NISCA (see the INSTALL file).
Then, configure it (see the INSTALL file).
Then, use it (see below).
The form on the index.phtml page is fairly self-explanatory, but here are some things to keep in mind while using it.
The list of hosts/interfaces to choose from on the index page is generated from the actual collected stats in the database, NOT the interfaces you have configured in the administration section! I'm hoping the reason for this is obvious. :)
Also note that communities are not displayed or used at all on either the index page or the actual report page for security reasons, so if you monitor more than one community on the same host, and the same interface name exists in both communities, you'll only see it on this page once... but a report generated for that interface will merge the stats for every community it exists under. Try to avoid monitoring the same interface via two different communities on the same host for this reason. :)
If using the "fancy" Javascript host/interface selection method, here's what you do...
If not using the fancy selection method, you'll get a list of every hostname/interface combination to pick from. This is more convenient if you only have a few interfaces monitored, but can be annoying if there are hundreds to choose from. You can turn off the fancy method in the Global Options config section. If you're using Mozilla, you'll then get to see the <OPTGROUP> tags in action; they separate each host to make the list easier to navigate. If you have any other browser in existence, they probably won't show up. (Even though OPTGROUP is in the HTML4.0 standard, Mozilla (and any other browsers based on that Gecko rendering engine thing) is the only browser claiming to support HTML4 that supports OPTGROUPs. That I could find, at least. I know all versions of Netscape and Internet Explorer (tm) don't support it even though they claim to be HTML4-compliant. Are you surprised?)
If you have the "Select how much data to view here" drop-down box set to "A date/time range, set below", it will use the "from" and "to" dates and times at the bottom of that section to restrict which data to analyze; otherwise, it will ignore everything in the "From" and "To" boxes.
If you have one of the options with an "X" in it selected, it will use whatever is in the "X = ___" field in place of the "X". So if you select "The past X hours" and put "3" in the "X = ___" field, you'll get a report covering the past three hours and nothing more. You might get something less though, since the odds are great that it won't return exactly three hours' worth of data; you're more likely to get two hours and 59 minutes' worth, depending on what you have the "$delay" set to. NISCA always uses the actual time stamps in the database rather than trying to force it to precisely match a particular time frame. You can also put decimals in it; for example, "1.5" in "X" and "Days" in the dropdown box will give you stats for the past day and a half.
If you select "The entire contents of the database", that's exactly what you'll get... so be careful if you've got years and years of data collected four times a minute in it. NISCA ain't that fast yet... :)
Sometimes, like during fsck-laden reboots (and let me just take this opportunity to plug The Reiserfs, which has saved my butt quite a few times now) or periods during which you didn't collect data, there will be gaps in the data. In this case, NISCA will point out the places where it filled in the intervening space as best it could by putting the From and To times in red. Its detection of this condition is done by adding twice the requested summary interval (last section of the form) to the previous timestamp and if that's still less than the current stamp, it will assume there was a gap and make it red just to call your attention to it. This doesn't catch all gaps, though, only the ones larger than the Summary Interval you specified on the report form. However, it always calculates averages using the actual time period of each line, so gaps are always averaged right whether the intervening time matches the requested summary interval or not.
Each graph contains a red circle around the largest Y-axis values found on it, so you can quickly find the peaks. Peak values and times are placed on the top of the graph.
Each graph generated is given a unique filename using a rather large random number, so every time you run it it'll give you a different image filename. This is thanks to the (mis)behavior of the caching mechanisms of almost all browsers. Also, every time you run it, any graphs older than one minute are deleted, so there shouldn't be any build-up of them.
Reports can be saved under any name you wish. Once you've set the options on the form the way you want to save them, enter a name for the report (near the bottom of the page) and then hit the "Run It" button. The report options will be saved, then used to display the requested information. But if a report already exists with the name you choose, it won't be overwritten; you have to use the admin section to either delete it and then try to save it again, or save it under a different name (it will still show you the results of the options you chose, it just won't save them as that report name).
To recall a saved report, just click on it in the drop-down list at the top of the index page. If you have Javascript disabled, you'll have to then click the "Run It" button to view it; if it's enabled, the report will be displayed as soon as you change the value of the drop-down list.
Report administration is handled via the administration pages; that's the only place an existing report can be changed or deleted.
One more thing; if you submit a report and then hit "escape" to stop loading it before it displays, the servers and scripts will continue to grind along working on it even though you'll never see the results. Try to avoid doing it... it can cause slowness. :)
Oh. See the end of the INSTALL file
for instructions on using the fancy administration section.
This will be one of the hardest tasks for a NISCA user,
but all the people involved in developing and contributing
to this project are working hard in order to provide as
much information as possible.
The NISCA user has to take into account many parameters
in order to setup the COLLECTION INTERVAL and the number
of hosts/interfaces monitored with NISCA. The interval is
rather different than MRTG's, which is done via crontab and
thus can generate overlapping statistics if collection takes
longer than the crontab interval (300 seconds, usually).
The way the collectors in NISCA work is, they will poll all
your monitored hosts and THEN go to sleep for the delay
interval you have configured; thus, if collection takes
six minutes, and your delay is 5 minutes, the effective
delay time will be eleven minutes. Running the command
"snmp_collect t" will help you determine how long each
collection cycle takes, and you can adjust your interval
time accordingly. (The "t" puts it in debug mode.)
Another thing about the collection interval. The smaller it
is, the "fuzzier" your graphs will be. Anything less than
15 seconds or so will be just about useless. A 5-minute
delay will probably look best on fast interfaces (and take
up much less database space. :)
People need to evaluate many parameters in order to not
generate overload of the whole system (nisca, network,
monitored hosts, etc.) Estimating all these parameters
is very very complex, especially because various systems
react in different ways to SNMP requests and the network
conditions can change from moment to moment.
Do not overestimate your setup's abilities!
One thing about NISCA is that it uses memory, a lot of it,
while it's generating reports for you (and only then).
And the more datapoints being analyzed, the more memory it
takes. This means you can quickly get several httpd processes
taking up lots and lots of memory. To help fix this, I've
changed the "MaxRequestsPerChild" setting in Apache's
httpd.conf file from its default of "0" (unlimited) to "1".
This will force every child server process to die as soon as
it's done with its request, and thus it won't consume all your
memory. Setting this to "2" or higher doesn't seem to do much
good; the children don't die, and new children are spawned
which will take up just as much space, so if you run four
60-meg reports one after the other you could bring your
machine to a complete halt if it's set higher than "1".
Your mileage, as always, may vary; tune it for you. This seems
to be much more well-behaved with later Apaches (1.3.28 is
what I use now and it plays nice).
I've also seen PHP die with an error similar to "Maximum
allowed memory usage exceeded" when viewing large reports.
If this happens to you a lot, you can edit your "php.ini"
file and change the max memory allowed (it defaults to 8
meg, 8388608). This setting is called "memory_limit". Don't
forget to HUP or restart your web server if you change this.
In the future, more technical details will be provided, but
for now the user should start with a minimum setup: a Delay
value of 300 seconds at first, then slowly increase the
number of interfaces and decrease the Delay time in order
to not overload both the NISCA server and the network(s)
over which the server is polling the monitored hosts. Trial
and error is the best way to see what you can get away with.
The report generator currently generates a graph from 100,000
datapoints (1 year's worth) in about 30 seconds running on an
850Mhz AMD Athlon with 768M of RAM. A multi-interface report
which adds the transfer averages of 2 interfaces over a one-month
period (some 15,000 entries) takes 55 seconds (it's a much more
intense operation). Your Mileage May Vary. It's the graph generation
and the database bottleneck that takes so long. Yes, I'm working on
ways of speeding it up... it ain't easy.
Apparently the report generation time isn't entirely cumulative;
getting reports one interface at a time takes more time than one
report on many interfaces.
As for disk space, the 1,780,000 entries in my database take up 360
meg of disk space in the form of MySQL tables/indices. Since I moved
the hosts, communities, and interfaces to another table and now use
medium integers to refer to them in the "stats" table, the disk
space usage has been cut in half and response time of just about
everything (except reports) has become instantaneous since it doesn't
have to look through hundreds of thousands of rows to find every
unique host/community/if now. Even graph generation speed has been
doubled just from this one change. Just the opposite of the effect
I thought it would have; live and learn, I always say.
I'm including this just to satisfy people's curiosity. I'm sure
there are other geeks out there who'd love to know how it works.
So here we go... warning; I can't avoid getting a bit technical.
When I set out to write the multi-IF graphing code, I had no idea
how complicated it was, or how simple the final solution would be.
I had to rewrite it all from scratch four times to get it right,
making all kinds of notes and diagrams and drawings and stuff.
Here's what I finally came up with.
First of all, I didn't want Nisca to do it the MRTG way: require
you to be collecting pre-summed statistics from each interface
desired before you can draw a graph of it. It just seemed silly to
me, especially since it requires that you poll each interface
TWICE... once for the regular single stats, and one again for the
summed-interfaces stats. There had to be a way to take any
existing set of statistics for any combination of interfaces on
any number of hosts over any time period, whether all the
interfaces involved had identical time periods or not, and add
them together in the same time periods. My first attempt was
horrible; I won't bore you with the grisly details of how a one-month
report took half an hour and 500 meg of memory, and then
delivered a graph that looked like something a drunk centipede had
walked all over after wading through a few pools of paint. Let's
just say, I wasn't satisfied.
So after the second rewrite attempt, I'm sitting there staring into
space trying to think of an answer, and I realized I was staring at
a CD storage rack. And my mind whispered to me, "Pigeonholing!"
Just make the entries fall into the right slots, and make the slots
as wide as the collection interval. But even that delivered pretty
shoddy results. And then I realized something else, something that
probably would have occurred instantly to anyone who does statistical
analysis for a living.
Statistics are always measured in pairs. There's a starting point
for both the counter itself (which is a running total) and the
timestamp it was collected, and a corresponding ending point.
You find the amount of traffic transferred by subtracting the
earlier counter from the later counter. If the machine has
rebooted in between them, this should result in a negative
number; if that happens, the later counter is used by itself to
determine the change in count; there's no way to know exactly
how much data was transferred between the earlier counter and
the later counter because it got zeroed out somewhere in between
them. So in that case, the entire value of the later counter is
used as the amount transferred between the two entries (because it
was at least that many bytes, but probably a lot more). So once
you have a value for the amount of bytes transferred between the
two points, you figure out the interval between them; divide A by
B, and you have the average. But as it turned out, that's useless.
I had divided the report period up into "pigeonholes," or slots,
that were as wide as the requested averaging interval (300 seconds,
or 5 minutes, by default). Sometimes an entry's endpoints would both
lie within one slot; sometimes its starting point was in one slot and
its ending was in the very next slot; and sometimes there was one
or more slots without a datapoint in it in between them. So I
re-re-rewrote it, again, using this approach like a baker uses a
flour sifter, and it worked. Imagine my shock.
It keeps track of the time stamps of the current stat and the
previous one. When start and end are in the same slot, it just adds
that whole counter change to the slot and keeps going. When the end
goes past a slot boundary, it starts doing math. It finds the time
between the start time and the slot boundary and divides it by
the time between the start and end times. This gives it the percentage
of the total time which lies in the earlier slot. It multiplies
that by the whole counter change between them, which tells it how
much of the data belongs to the earlier slot, and it adds it to
the earlier slot's counter array (remember, it's still not an
average at this point). Then there are two possibilities.
It adds the averaging interval to the slot boundary. That will
either put the boundary past the end time of the stat, or it
won't... meaning there are intervening slots without a stat
entry. If so, it repeats the earlier percentage operation, but
instead it divides the value of the averaging interval by the
total time of the entry and adds that to that slot. It does this
until the slot boundary passes the ending timestamp.
Once the slot boundary is past the ending timestamp, however it
got there, it does the percentage thing again, this time using
the time between the slot's beginning and the stat's end timestamp
to calculate the percentage, which it multiplies by the total
and adds to that slot's counter. And by the way, these percentages
can be zero; that just means an entry's start or end lies
exactly on a slot boundary, so 100% goes on one side and 0%
goes on the other. Pretty neat how that worked out.
Now, this has one unfortunate side-effect. The very last entry
for an interface won't have an ending point with which to
calculate a transferred count for the last slot. This means
the last slot will almost always fall sharply downwards, since
it will almost always have far less data transferred in it than
all the previous slots. So when viewing these graphs, please
don't panic; it doesn't mean all your interfaces went down
sometime in the past five minutes or anything. :)
Now, that was just for one type of data; incoming bytes, say.
It has to do all that separately for the incoming and outgoing
stats of every entry. And that's why it doesn't support making
multiple-interface graphs of packets, or drops, or errors; not
only is it kinda pointless, it would also mean another hundred
lines of code for each added report type.
Once it's done every entry, it passes the data to the
makegraph() function, which converts them from counts to
average per-second rates, just as it does for regular graphs.
And that's a whole other story in itself. :)
(It wouldn't have been possible without you two... :)
New ideas, requests, job offers, and comments are always welcome
from everyone. Contributing to NISCA will only improve its quality
and your karma. What a bargain!
We now have the first confirmed Nisca-related tech support call,
made from Milano to Rome, Italy, at or about 4:00PM (Italy time,
+0200) on Monday, June the 25th, 2001. If anyone knows of an earlier
call, let me know. :)
This program is released under the GNU General Public License
(see LICENSE). This means you have my permission to do
anything you like with it except printing it out, rolling it up,
and swatting your pet with it.
I will not condone cruelty to animals.
And if you make any money off of it, please think of my
poor unemployed self and have pity on me as you count
your millions.
You know that really long boring bit about "as-is"
and "merchantability" and "fitness for a particular purpose"
and all that crap? Insert it here.
Note that I am not affiliated with Team Nisca, which makes ID card
printers; the National Interscholastic Swimming Coaches Association;
the Northern Ireland Society for Computing in Anaesthesia; the
Nuptual Illusions Service Center of Antarctica; the Non-Ischemic
Sub-Cortical Aneurisms society; Naughty Isaac's Stuffed Chicken
Arcades, Inc.; or the NISCA protocol, which is used to connect
systems in an OpenVMS cluster. Anyone claiming otherwise will
certainly be ridiculed into an embarassing extinction, because to
me NISCA means "the Network Interface Statistics Collection Agent"
and nothing more. And with version 3.0, it won't even mean that.
Note that I am affiliated with isthisthingon.org,
a very, very non-profit non-organization of no one in particular.
Any resemblance to actual programs, living or dead, is purely
coincidental. I ask you all to reflect upon how often form
follows function.
Fine Tuning NISCA
Benchmark
A Detailed Description of the Multiple-Interface Graphing Method
Who has actually contributed?
Oddities
Mumbo-Jumbo
Contact Info-Mation
Author's email: |
phee@isthisthingon.org or brett@fnord.org |
Official Site: | http://nisca.sourceforge.net/ |
Author's ICQ #: | 13130273 |