Rule Engine

The rule engine automates repetitive tasks based on real-time statistical analysis of incoming data.

Such tasks include triggering an outgoing webhook, executing a Python script, sending an email/Slack alert, or generating derived statistics.

The engine evaluates incoming series, message, and property commands and executes response actions when appropriate:

IF condition THEN action-1 ... action-N

Example

IF percentile(75) > 300 THEN alert_slack_channel

A rule condition can operate on a single metric or correlate multiple metrics using value, database, and rule functions.

max() > 1.5 && value('temperature') > 50

Processing Pipeline

The incoming data is consumed by the rule engine independently of the persistence path.

The data is maintained in windows which are in-memory structures initialized for each unique combination of metric, entity, and grouping tags extracted from incoming commands.

The processing pipeline consists of the following stages:

  • Filtering
  • Grouping
  • Evaluating Condition
  • Executing Actions

Filtering

The incoming data samples are processed by a chain of filters prior to the grouping stage. Such filters include:

  • Input Filter. All samples are discarded if the Settings > Input Settings > Rule Engine option is disabled.

  • Status Filter. Samples are discarded for metrics and entities that are disabled.

  • Rule Filter accepts data that satisfies the metric, entity, and tag filters specified in the rule.

Filter tab

Grouping

Once the sample passes through the filter chain, the sample is allocated to matching windows grouped by metric, entity, and optional tags. Each window maintains its own array of data samples in working memory.

The commands can be associated with windows in a 1-to-1 fashion by enabling the All Tags setting or by enumerating all tags as the grouping tags.

If the Group by Entity option is cleared, the entity field is ignored for grouping purposes and the window is grouped only by metric and tags.

The rule engine supports two types of windows:

  • Count-based windows accumulate up to the specified number of samples. The samples are sorted by command timestamp, with the most recent command placed at the end of the array. When the window reaches the limit, the first sample with the earliest timestamp is removed from the window to free up space for an incoming sample.

  • Time-based windows store samples that are timestamped within the specified interval of time. The start time of time-based windows is continuously updated. Old records are automatically removed from the window once they are outside of the time range.

Condition Checking

Windows are continuously updated as new samples are added and old samples are removed.

When a window is updated, the rule engine checks the condition and triggers various response actions based on the condition result.

Condition example:

avg() > 80

Window Status

Windows are stateful. When the condition for a given window changes to true, the window is initialized in memory with the status OPEN.

On subsequent true evaluations, the status transitions to REPEAT.

When the condition becomes false, the window status is reverted to CANCEL.

The current window status is displayed on the Alerts > Rule Windows page.

Windows are updated when the commands enter or exit the windows. Scheduled rules that are checked at a regular interval, regardless of incoming data, can be constructed using the built-in timer metrics.

Actions

Actions are triggered on window status changes, for example upon window OPEN status or every N-th REPEAT status occurrence.

Supported response actions:

Triggers for the above actions can be configured independently, for example to send email every 6 hours yet to log events for all repeat occurrences.

Correlation

Each rule evaluates data received for only one specified metric. To create conditions that check values for multiple metrics, use value, database, and rule functions.

  • Value functions:
percentile(95) > 80 && values('metric2') != 0
  • Database functions:
percentile(95) > 80 && db_statistic('max', '1 hour', 'metric2') < 10*1024
  • Rule functions:
percentile(95) > 80 && rule_open('inside_temperature_check')

Developing Rules

Rules can be considered software programs in their own right and as such involve initial development, testing, documentation and maintenance efforts.

To minimize the number of rules with manual thresholds, the rule engine in ATSD provides the following capabilities:

  • Condition overrides.
  • Comparison of windows with different lengths.
  • Automated thresholds.

Manual Thresholds

Thresholds can be set manually which requires some trial and error to determine a level that strikes a balance between false positives and missed alerts.

value > 90

Since a single baseline cannot handle all edge cases, the Overrides can be used to enumerate exceptions.

To reduce unnecessary alerts, apply averaging functions and increase window durations.

avg() > 90

To reduce distortions caused by a small number of outliers, use percentiles instead of averages.

percentile(75) > 90

Alternatively, use the minimum or a low percentile function with the reversed comparator to check that all samples in the window exceed the threshold. This is equivalent to checking that the last N consecutive samples are above the threshold.

// all samples are above 90
min() > 90
// only 10% of the smallest samples are below 90
percentile(10) >= 90

Deviation Thresholds

Short-term anomalies can be spotted by comparing statistical functions for different overlapping intervals.

The condition below activates an alert if the 5-minute average exceeds the 1-hour average by more than 20 and by more than 10%.

avg('5 minute') - avg() > 20 && avg('5 minute') / avg() > 1.1

Forecast Thresholds

The forecast function returns an estimated value for the current series based on the Holt-Winters or ARIMA forecasting algorithms. The condition fires if the window average deviates from the expected value by more than 25% in any direction.

abs(avg() - forecast()) > 25

Similarly, the forecast_deviation function can be utilized to compare actual and expected values as a ratio of standard deviation.

abs(forecast_deviation(avg())) > 2

Correlation Thresholds

In cases where the analyzed metric is related to another metric, use the database functions to identify abnormal behavior in both metrics.

The primary metric is expected to be below 50 as long as the second metric remains below 100. Otherwise, an alert is raised.

avg() > 50 && db_statistic('avg', '1 hour', 'page_views_per_minute') < 100

The same condition can be generalized with a ratio as well.

avg() / db_statistic('avg', '1 hour', 'page_views_per_minute') > 2

As an alternative, use the value() function to access the last value for metrics submitted within the same series command or parsed from the same row in CSV files.

value > 75 && value('page_views_per_minute') < 1000

Overrides

The default baseline can be adjusted for particular series using the Overrides table.

Alert Severity

Severity is a measure of criticality assigned to alerts generated by the rule. The severity level ranges between NORMAL and FATAL and is specified on the Logging tab in the rule editor.

If an alert is raised by a condition defined in the Overrides table, its severity supersedes the default severity.

In rules operating on message commands, the alert severity can be inherited from the severity field of the underlying message. To enable this behavior, set Severity on the Logging tab to unknown.

Complex Rules

In cases that involve processing of large volumes of historical data, use Scheduled SQL queries to analyze the data.

To trigger an email notification from an SQL query, use WHERE and HAVING filters to develop a query that returns no rows if the situation is normal.

SELECT entity, tags, percentile(90, value) FROM page_views
  WHERE datetime >= current_day
  GROUP BY entity, tags, period(1 DAY)
HAVING percentile(90, value) > 1000
-- HAVING condition acts as a rule filter
  • Set Send Empty Report parameter to No.
  • Specify triggers such as an email notification or a file export.

As a result, the query triggers actions only when it returns at least one row.

Monitoring

Open Alerts

Open alerts are displayed on the Alerts > Open Alerts page and can be retrieved with /alert/query Data API query and incorporated into portals using the Alert Console widget.

Rule Windows

Rule windows are initialized in memory and are displayed on the Alerts > Rule Windows page. If no windows are present for the given rule, check that the rule is enabled and that data is not discarded by one of the filters.

Rule Errors

Rule Errors can occur in case of invalid or malformed expressions. The Alerts > Rule Errors page contains the list of most recent errors as well as the relevant context and the command details. The errors are also logged as messages by entity atsd with type rule-error and source rule-engine.

Action Logging

Webhook, Email and Script actions log their status as ATSD messages. To view action logs, select the option in the left menu.

  • Email Notification Log

  • Webhook Notification Log

  • Script Action Log