Notes on RRDTOOL implementation of Aberrant Behavior Detection

RRD (http://ee-staff.ethz.ch/~oetiker/webtools/rrdtool) is the acronym for Round Robin Database. RRD is a system to store and display time-series data. It stores the data in a very compact way that will not expand over time, and it presents useful graphs by processing the data at different temporal resolutions. The RRDtool software is available as a standalone application or as a PERL library. RRDtool is a nimble program that minimizes the I/O operations required by real-time updates to time series. While not as powerful as full-blown database software, it does provide the functionality required by a real-time environment:

conversion of updates at irregular intervals into regular intervals via linear interpolation
conversion of counter data to rates
aggregation over user-defined time intervals via multiple aggregation functions

This document describes the implementation of aberrant behavior detection for RRDTOOL. Some familiarity with the internals of the current release (1.0.25) of RRDtool is assumed, as this document makes reference to the pre-existing architecture.

There are several reasons why support for aberrant behavior detection was integrated within RRDtool, as opposed to implemented as a separate program. These include:

(1) Facilitates efficient real-time application of aberrant behavior detection. Integration offer greater efficiency relative to an external program with respect to I/O access. An external program would have fetch data from RRD at the same frequency of update, while code within RRD operates on this data already in-memory. Efficiency becomes a top priority for the service network at the IAP/ISP level, where RRDtool can be essential part of the monitoring system of hundreds of network interfaces and application services.

(2) Leverages ability of RRDtool to perform the temporal interpolation (data updates at irregular intervals) and conversion of counters to rates.

(3) Leverages the graphing capabilities of RRDtool. Graphs can be generated to display the results of aberrant behavior detection using the existing capabilities of RRDtool.

(4) Leverages client software designed to run with RRDtool (i.e. Cricket).

Summary of Aberrant Behavior Detection Algorithm

Aberrant behavior detection is decomposed into three pieces, each building on its predecessor:

(1) An algorithm for predicting the values of a time series one time step into the future.

(2) A measure of deviation between the predicted values and the observed values.

(3) A mechanism to decide if and when an observed value or sequence of observed values is ‘too deviant’ from the predicted value(s).

The Holt-Winters Time Series Forecasting Algorithm is an online, or incremental, algorithm that adaptively predicts future observations in a time series. It’s forecast is the sum of three components: a baseline (or intercept), a linear trend over time (or slope), and a seasonal coefficient (a periodic effect, such as a daily cycle). There is one seasonal coefficient for each time point in the period (cycle). After a value is observed, each of these components is updated via exponential smoothing. The algorithm learns from past values and uses them to predict the future. The rate of adaptation is governed by 3 parameters: alpha (intercept), beta (slope), and gamma (seasonal). The prediction can also be viewed as a ‘smoothed’ value for the time series.

The measure of deviation is a seasonal weighted absolute deviation. The term ‘seasonal’ means deviation is measured separately for each time point in the seasonal cycle. As with Holt-Winters Forecasting, deviation is predicted using the measure computed from past values (but only at that point in the seasonal cycle). After the value is observed, the algorithm learns from the observed value via exponential smoothing. Scaling the sequence of predicted deviation values for the observed time series generates confidence bands (we usually think of the sequences as continuous lines rather than as a set of discrete points).

Aberrant behavior (a potential ‘failure’) is reported whenever the number of times the observed value violates the confidence bands meets or exceeds a specified threshold within a specified temporal window (i.e. 5 violations during the past 45 minutes with a value observed every 5 mintues).

Architecture

This section describes the architecture of the implementation of aberrant behavior detection in the RRDtool.

On disk, the round robin database (RRD) is organized into sequential sections, round robin archives (RRA). Within each RRA is a section for each of the data sources (input) stored in this RRD. Each RRA is defined by a consolidation function which maps primary data points (PDP) to consolidated data points (CDP). At another level, an RRA is just an array of data values that is updated in sequence according to some function at some fixed time interval.

On its face, the aberrant behavior algorithm needs at least two arrays, one to store the forecast values corresponding to each primary data point, and a second to store the predicted deviation corresponding to each primary data point.

As implemented, the seasonal coefficients and deviations that are used to calculate the forecast and predicted deviations are stored in a second pair of RRAs. These arrays have length equal to the seasonal period and are updated once for each primary data point.

Failures are tracked by a fifth RRA, which determines violations and failures on each call to RRD update.

The intercept and slope coefficients required for the forecast are updated for every primary data point and are unique for each data source (DS). They could be stored in a second pair of RRAs. However, only the most recent value of each is required for a given forecast. As implemented these parameters are stored in a temporary buffer in the header allocated for each RRA-data source combination in the RRD (the CDP prep area). This buffer is flushed back to disk on every call to RRD update.

The adaptation parameters are fixed for the life of the RRA and are the same for all data sources within that RRA. They are stored in the RRA parameter buffer (in RRA def), which is read only during RRD update.

New RRAs

Therefore, implementation of the aberrant behavior algorithm adds five new “consolidation functions” to RRDtool:

(1) HWPREDICT: an array of forecasts computed by the Holt-Winters algorithm, one for each primary data point. The HWPREDICT array depends on an array of type SEASONAL for lookup of seasonal coefficients. The index of this SEASONAL array is stored as a parameter of the HWPREDICT RRA.

(2) SEASONAL: an array of seasonal coefficients with length equal to the seasonal period. For each primary data point, there is one seasonal coefficient that matches the index in the seasonal cycle. In order to update the seasonal coefficients via exponential smoothing, the SEASONAL RRA depends on the HWPREDICT RRA. The index of the matching HWPREDICT RRA is stored as a parameter.

(3) DEVPREDICT: an array of deviation predictions. Essentially, DEVPREDICT copies values from the DEVSEASONAL array; it does no processing of its own. The index of the DEVSEASONAL array is stored as a parameter.

(4) DEVSEASONAL: an array of seasonal deviations. For each primary data point, there is one seasonal deviation that matches the index of the primary data point in the seasonal cycle. In order to compute deviations, DEVSEASONAL RRA must be able to compute the predicted value, and so depends on the HWPREDICT RRA. The index of the matching HWPREDICT RRA is stored as a parameter.

(5) FAILURES: an array of boolean indicators, a 1 indicating a failure. The RRA-DS buffer stores each value within the window. Each update removes the oldest value from this buffer and inserts the new observation. On each update, the number of violations is recomputed. The maximum window length enforced by this buffer is 9 time points. For comparison with deviations, the FAILURES RRA depends on the DEVSEASONAL RRA. The index of the matching DEVSEASONAL RRA is stored as a parameter.

Usage

This section illustrates how to use the aberrant behavior detection algorithm in RRDtool through an example.

Our monitoring target will be a router interface on a link between two data centers in the WebTV production service network. The variable will be the outgoing bandwidth rate (in Mbps). Bandwidth usage follows a daily cycle and SNMP is polled at fives minute intervals.

Creating a RRD file

The first step is to create a RRD for this target with aberrant behavior detection enabled. In order to simplify the creation for the novice user, in addition to supporting explicit creation the HWPREDICT, SEASONAL, DEVPREDICT, DEVSEASONAL, and FAILURES RRAs, the enhanced RRDtool create command supports implicit creation of the other four when HWPREDICT is specified alone. To take advantage of this, use the following syntax:

RRA:HWPREDICT:<array length>:<alpha>:<beta>:<period>

Where:

<array length> is the number of predictions to store before wrap-around; this number must be longer than the seasonal period. This value will also be the RRA row count for DEVPREDICT RRA.

<alpha> is the intercept adaptation parameter, which must fall between 0 and 1. A larger value means the intercept adapts faster. The same value will be also be used for gamma.

<beta> is the slope adaptation parameter, again between 0 and 1.

<period> is the number of primary data points in the seasonal period. This value will be the RRA row counts for the SEASONAL and DEVSEASONAL RRAs.

Using this option creates the FAILURES RRA with the default window length of 9 and a default threshold value of 7. The default length of the FAILURES RRA is one period.

For the example, the create command is:

rrdtool create monitor.rrd –s 300 \

DS:ifOutOctets:COUNTER:1800:0:4294967295 \

RRA:AVERAGE:0.5:1:2016 \

RRA:HWPREDICT:1440:0.1:0.0035:288

While explicitly creating the additional RRAs is one means of finer control, some parameters can be changed using the tune command. Enhanced RRDtool supports several new tune flags:

--window-length --failure-threshold --alpha --beta –gamma –deltapos --deltaneg

Each of these flags takes a single argument. The purpose of the argument is self-evident. The --gamma flag will reset the adaptation parameter for both the SEASONAL and DEVSEASONAL RRAs (setting both to the same value).

--deltapos and –deltaneg set the scale parameter for the upper and lower confidence band respectively, the default value for both is 2 (a symmetric confidence band).

For example, suppose we are unhappy with the default window length and threshold for the FAILURES RRA implicitly created by the previous command. Issue the command:

rrdtool tune monitor.rrd --window-length 5 \

--failure-threshold 3

The remainder of the example uses the default window length of 9 and the default threshold of 7.

In some cases, explicit creation of the RRAs is required. Suppose the network technician is only interested in smoothing the observed time series, not confidence bands or aberrant behavior detection. The Holt-Winters Forecasting algorithm can be viewed a smoothing algorithm. In this case, he can explicitly create the HWPREDICT and SEASONAL RRAs without the others.

Use this syntax:

RRA:HWPREDICT:<length>:<alpha>:<beta>:<period>:<index of SEASONAL>

RRA:SEASONAL:<period>:<gamma>:<index of HWPREDICT>

The arguments of HWPREDICT are the same as before, with the addition of:

<index of SEASONAL> 1-based index of the SEASONAL array in the order RRAs are specified in the create command.

The arguments of the SEASONAL RRA are:

<period> is the number of primary data points in the seasonal period. It must match the value specified by the <period> argument of HWPREDICT. It must be an integer greater than 2.

<gamma> is the adaptation parameter for seasonal coefficients, which must value between 0 and 1.

<index of HWPREDICT> 1-based index of the HWPREDICT array in the order RRAs are specified in the create command

Confidence bands can be created independently of aberrant behavior detection. In this case, create the four RRAs HWPREDICT, SEASONAL, DEVSEASONAL, and DEVPREDICT but omit the FAILURES RRA.

Use this syntax:

RRA:HWPREDICT:<length>:<alpha>:<beta>:<period>:<index of SEASONAL>

RRA:SEASONAL:<period>:<gamma>:<index of HWPREDICT>

RRA:DEVSEASONAL:<period>:<gamma>:<index of HWPREDICT>

RRA:DEVPREDICT:<array length>:<index of DEVSEASONAL>

The arguments of HWPREDICT and SEASONAL are the same as before. The arguments of DEVSEASONAL and DEVPREDICT are:

<period> is the number of primary data points in the seasonal period. It must match the value specified by the <period> argument of the HWPREDICT and SEASONAL arrays (this restriction may be lifted in a future implementation). It must be an integer greater than 2.

<array length> is the number of deviations to store before wrap-around; this number must be longer than the seasonal period.

<gamma> is the adaptation parameter for seasonal deviations, which must value between 0 and 1. It need not match the adaptation parameter for the SEASONAL array.

<index of HWPREDICT> 1-based index of the HWPREDICT array in the order RRAs are specified in the create command

<index of DEVSEASONAL> 1-based index of the DEVSEASONAL array in the order RRAs are specified in the create command

Finally, the FAILURES RRA can be create explicitly with the syntax, but at a minimum the HWPREDICT, SEASONAL, and DEVSEASONAL arrays must be created as well. If confidence bands are also desired, create DEVPREDICT.

Use this syntax:

RRA:FAILURES:<length>:<threshold>:<window length>:<index of DEVSEASONAL>

Where:

<length> is the number of indicators (0,1 values) to store before wrap-around. A 1 indicates a failure: that is, the number of violations in the last window of observations meets or exceeds the threshold.

<threshold> is the minimum number of violations within a window (observed values outside the confidence bounds) that constitutes a failure.

<window length> is the number of time points in the window. Specify an integer greater than or equal to the threshold and less than or equal to 28 (the maximum value).

<index of DEVSEASONAL> 1-based index of the DEVSEASONAL array in the order RRAs are specified in the create command.

Detecting Aberrant Behavior

The aberrant behavior detection algorithm requires nothing unusual for the RRDtool update command; so the collection mechanism (i.e. Cricket invoking SNMP) will run normally. Now suppose some time has passed and the network technician is monitoring outgoing bandwidth at the router interface. He can view a graph of daily activity, including confidence bands and any failures, with the following command:

rrdtool graph example.gif \

DEF:obs=monitor.rrd:ifOutOctets:AVERAGE \

DEF:pred=monitor.rrd:ifOutOctets:HWPREDICT \

DEF:dev=monitor.rrd:ifOutOctets:DEVPREDICT \

DEF:fail=monitor.rrd:ifOutOctets:FAILURES \

TICK:fail#ffffa0:1.0:”Failures Average bits out” \

CDEF:scaledobs=obs,8,* \

CDEF:upper=pred,dev,2,*,+ \

CDEF:lower=pred,dev,2,*,- \

CDEF:scaledupper=upper,8,* \

CDEF:scaledlower=lower,8,* \

LINE2:scaledobs#0000ff:”Average bits out” \

LINE1:scaledupper#ff0000:”Upper Bound Average bits out” \

LINE1:scaledlower#ff0000:”Lower Bound Average bits out”

TICK is a new graphing option in RRD tool. For every non-zero value in the DEF or CDEF argument, it plots a tick mark. The length of the mark (line) is specified by the third argument (after the color code) as a decimal percentage of the y-axis. 1.0 is 100% of the length of the y-axis, so the tick mark becomes a vertical line on the graph.

Here is an example of this daily graph generated on Wed, May 31, 2000 for the router target described previously:

The thin red lines are the confidence bands and the yellow bars represent failures (actually multiple failures in sequence—once the observed value strays outside the confidence bands it remains outside the bands for roughly a two hour period in both cases). The TICK graph option generated the yellow bars from the FAILURES RRA.

The graph suggests that bandwidth on this outgoing link is increasing faster than expected by the model (past history). It is up to the network technician to decide if this represents aberrant behavior of interest. One approach the technician might take is to view the time series for this router interface over a longer time period.

With hindsight, it is easy to demonstrate something unusual is going on and the aberrant behavior detection model is catching it in real time. Here is the time series for the week and half period from May 24, 2000 to June 2, 2000:

It is clear that Wed, May 31, is unusual. Bandwidth increases in two steps: once in the early morning and again in the early afternoon. In this case, the dip to 0 in the early morning hour and the subsequent jump can be attributed to a scheduled downtime for the service network. Perhaps the remainder of bandwidth activity on Wed has the same cause, in which case aberrant behavior detected is a false positive in the eyes of the network technician.

Other Notes

Given the example, it is clear that there are many benefits of aberrant behavior detection, but it does complicate the software implementation and impact runtime performance. These details are discussed in this section.

Processing updates

For the five aberrant detection RRAs, every time a primary data point (PDP) is calculated by rrd_update, each of the RRAs is updated. This translates into five additional disk seek/write operations. Furthermore, the SEASONAL and DEVSEASONAL RRAs each require a seek/read operation. While the number of additional disk operations is independent of the number of data sources and these operations are performed sequentially whenever possible (i.e. seek avoids moving backwards within the file), these operations do add significant I/O cost relative to a RRD without the aberrant detection RRAs. In a conventional RRD file, there is usually only one RRA that is updated for every PDP, requiring only a single seek/write operation.

After the update, the cdp_prep areas of the HWPREDICT and SEASONAL RRAs contain sufficient information to make a forecast of future values. The next predicted value can be computed using the following formula:

Prediction = intercept + slope • (sequence length of unknown values + 1)

+ seasonal coefficient

The intercept is stored in the CDP_hw_intercept member of the cdp_prep area of the HWPREDICT RRA and the data source of interest. The slope is stored in the CDP_hw_slope member of the cdp_prep area of the HWPREDICT RRA. The expression, (sequence length of unknown values + 1), is stored as the CDP_null_count member of the cdp_prep area of the HWPREDICT RRA. Finally, the seasonal coefficient is stored in the CDP_hw_seasonal member of the cdp_prep area of the SEASONAL RRA and the data source of interest.

Aberrant behavior detection does not need to compute the prediction until the value to be predicted is already observed, but it is conceivable that another application may want the prediction before the value is observed. This could be facilitated in the future by adding a new call, rrd_predict, to read the appropriate buffers of HWPREDICT and SEASONAL RRAs and compute the prediction.

Isolated or sequences of unknown (U) values are generally not a problem for the Holt-Winters algorithm, either in making forecasts or in updating values (basically, for a unknown value, no update occurs). The algorithm can actually forecast multiple steps into the future, provided the initialization phase is complete.

Bulk Updates

A bulk update can occur when a time interval between observed values exceeds the step size. In this case, the observed value generates more than one primary data point. If the interval exceeds the heartbeat of a data source, all the intervening primary data points will be set to unknown (U), otherwise they will be set to the same (equal) value.

Bulk updates create a complication for the aberrant detection RRAs, which are updated for every primary data point. The primary issue is the cost of computing the prediction and predicted deviation. Unlike the case of the conventional consolidation function (i.e. average or maximum), the interim consolidated data points (CDPs) for the aberrant detection RRAs are not identical and must be computed separately.

This computation cost aside, there are implementation issues related to the interdependencies between the aberrant detection RRAs. Essentially, to compute the predictions and predicted deviations in sequence, each RRA requires its dependent RRAs to be synchronized (at the same time point). However, the architecture of the rrd_update function is to process each RRA independently of the others. This architecture is imposed by the optimization that writes to each RRA occur in sequence.

Given in these issues, updates in excess of 2 CDPs per observed value (that is, an observed value which translates into more than 2 PDPs) are not currently supported by the aberrant detection implementation. The intermediary CDPs are set to U. In the future, the implementation may support bulk updates, especially as algorithm can handle sequences of U values as input.

Cold-start Issues

The implementation is designed to use bootstrap initialization. The intercept coefficient is initialized to the first known value. The slope is initialized to 0; this initialized is predicated on the assumption the linear trend over time is close to 0. If this is not the case, the time required for the Holt-Winters algorithm to gravitate away from 0 will depend on the seasonal adaptation parameter, gamma. During the first seasonal cycle of known values, seasonal coefficients are initialized. During the second seasonal cycle of known values, seasonal deviations are initialized.

The sensitivity of the algorithm to the initial values (the first two cycles), declines over time as determined by the adaptation parameters. The weight of the initial value persists longer than subsequent values. For example, given a daily cycle, no periodic smoothing, and gamma = 0.1, the weight of the initial value in the tenth iteration (day) seasonal component is »0.4. In contrast, the weight of the second day observation at the same point in the cycle is only »0.04.

Unknown values during the first two seasonal cycles can complicate the cold-start. The implementation initializes any coefficients it can during the first two cycles. If some coefficients remain unknown due to unknown values, they are initialized during the first subsequent cycle with known value at that time point in the cycle. As the first value of a COUNTER data source is always unknown, the minimum cold-start period for COUNTER data sources is three seasonal cycles.

Periodic smoothing

Reflecting the belief that seasonal coefficients and deviations are smooth periodic functions of time, the implementation periodically runs a moving-average smoother with a window of 5% of the period on the DEVSEASONAL and SEASONAL RRAs. This operation requires reading all DEVSEASONAL and SEASONAL RRA values from disk, updating them, and flushing them back to disk.

The smoother is run once per period, at an offset determined by a FNV hashing algorithm on the file name of the RRD file. The intent of the hashing algorithm is to prevent temporal clumping of smoothing operations for multiple RRD files (as RRDTOOL usually runs in a environment with multiple RRD files).

Dump and Restore Issues

The rra_def parameters and cdp_prep areas provided by the RRD structure definition are sufficient to store all necessary parameters on disk in and in memory for the aberrant behavior detection algorithm, previously the dump and restore (to/from XML) functions did not fully support the import/export of these data structures. The implementation now flushes all RRA parameters to XML using the <param> tag on dump and the restore function reads the new tags. The contents of the cdp_prep areas are written as before, marked by the <ds> tag. To identify the new XML format, the implementation uses a version tag of 0002.

Modification by File

The following list describes files modified from the current release (1.0.25) of RRDtool to implement aberrant behavior detection:

rrd_create.c : support for creation of HWPREDICT, SEASONAL, DEVSEASONAL, and DEVPREDICT RRAs.

rrd_dump.c : support for exporting rra_def par and cdp_prep scratch variables.

rrd_restore.c : support for importing rra_def par and cdp_prep scratch variables.

rrd_hw.c : main routines for updating Holt-Winters RRAs; smoothing routines; FIFO queue implemetation.

rrd_update.c : call routines in rrd_hw.c as appropriate; rewrite for efficient bulk updates

rrd_graph.c : add data reduction support for HWPREDICT, SEASONAL, DEVSEASONAL, and DEVPREDICT RRAs; adding the TICK graph element.

rrd_format.h : add enumeration values for reference in RRA and CDP scratch arrays.

rrd_tool.h : function prototypes, FIFO queue method prototypes.

hash_32.c : implementation of FNV hash.

fnv32.h : header file for FNV routines.