The rule engine enables automation of repetitive tasks based on real-time statistical analysis of incoming data.
The engine evaluates incoming
property commands and executes response actions when appropriate:
IF condition = true THEN action-1, ... action-N
IF percentile(75) > 300 THEN alert_slack_channel
The incoming data is consumed by the rule engine independently of the persistence path.
The data is maintained in windows which are
in-memory structures initialized for each unique combination of metric, entity, and grouping tags extracted from incoming commands.
The rule engine processing pipeline consists of the following stages:
The incoming data samples are processed by a chain of filters prior to the grouping stage. Such filters include:
Input Filter. All samples are discarded if the Settings > Input Settings > Rule Engine option is disabled.
Status Filter. Samples are discarded for metrics and entities that are disabled.
Rule Filter accepts data that satisfies the metric, entity, and tag filters specified in the rule.
Once the sample passes through the filter chain, the sample is allocated to matching windows grouped by metric, entity, and optional tags. Each window maintains its own array of data samples in working memory.
The commands can be associated with windows in a 1-to-1 fashion by enumerating all series tags as the grouping tags.
If the 'Group by Entity' option is unchecked, the
entity field is ignored for grouping purposes and the window is grouped only by metric and tags.
The rule engine supports two types of windows:
Count-based windows accumulate up to the specified number of samples. The samples are sorted in order of arrival, with the most recently received sample being placed at the end of the array. When the window becomes full based on user specifications, the first sample (oldest arrival time) is removed from the window to free up space at the end of the array for an incoming sample to be added there.
Time-based windows store samples that were recorded within the specified interval of time, ending with the current time. The time-based window does not limit how many samples may be held by the window and its time range is continuously updated. Old records are automatically removed from the window once they are outside of the time range.
Windows are continuously updated as new samples are added and old samples are removed to maintain the size of the given window at a constant interval length or sample count.
When a window is updated, the rule engine checks the condition and triggers various response actions based on the condition result.
avg() > 80
Windows are stateful. When the condition for a given window becomes
true, the window is initialized in memory with the status
true evaluations, the window status changes to
When the condition becomes
false, the window status is reverted to
Window status can be accessed on the Alerts > Rule Windows page.
Windows are updated when the command enters or exits the window. Scheduled rules can be emulated using the built-in
Actions are triggered on window status changes, for example upon window
OPEN status or every N-th
REPEAT status occurrence.
Supported response actions:
Triggers for all actions may be configured separately. For example, you can configure a rule such that logging events are generated on all occurrences whereas email messages are sent every 6 hours.
- Value functions:
percentile(95) > 80 && values('metric2') != 0
- Database functions:
percentile(95) > 80 && db_statistic('max', '1 hour', 'metric2') < 10*1024
- Rule functions:
percentile(95) > 80 && rule_open('inside_temperature_check')
Rules can be considered software programs in their own right and as such involve initial development, testing, documentation and maintenance efforts.
To minimize the number of rules with manual thresholds, the rule engine in ATSD provides the following capabilities:
- Condition overrides.
- Comparison of windows with different lengths.
- Automated thresholds.
Thresholds can be set manually which requires some trial and error to determine a level that strikes a balance between
false positives and missed alerts.
value > 90
Since a single baseline cannot handle all edge cases, the
Overrides can be used to enumerate exceptions.
false positives, apply an averaging function to longer windows.
avg() > 90
To reduce distortions caused by a small number of outliers, use percentiles instead of averages.
percetile(75) > 90
Alternatively, use the
minimum or a below-median percentile function with the reversed comparator to check that all samples in the window exceed the threshold. This is equivalent to checking that the last
N consecutive samples are above the threshold.
// all samples are above 90 min() > 90
// only 10% of the smallest samples are below 90 percentile(10) >= 90
Short-term anomalies can be spotted by comparing statistical functions for different overlapping intervals.
The condition below activates an alert if the 5-minute average exceeds the 1-hour average by more than
20 and by more than
avg('5 minute') - avg() > 20 && avg('5 minute') / avg() > 1.1
forecast function returns an estimated value for the current series based on the Holt-Winters or ARIMA forecasting algorithms.
The condition fires if the window average deviates from the expected value by more than
25% in any direction.
abs(avg() - forecast()) > 25
forecast_deviation function can be utilized to compare actual and expected values as a ratio of standard deviation.
abs(forecast_deviation(avg())) > 2
In cases where the analyzed metric is dependent on another measure, use the database functions to identify abnormal behavior in one of the metrics.
The primary metric is expected to be below
50 as long as the second metric remains below
100. Otherwise, an alert is raised.
avg() > 50 && db_statistic('avg', '1 hour', 'page_views_per_minute') < 100
The same condition can be generalized with a ratio as well.
avg() / db_statistic('avg', '1 hour', 'page_views_per_minute') > 2
As an alternative, use the
value(metric) function to access the last value for metrics submitted within the same
series command or parsed from the same row in CSV files.
value > 75 && value('page_views_per_minute') < 1000
The default baseline can be adjusted for particular series using the Overrides table.
Severity is a measure of criticality assigned to alerts generated by the rule. The severity level ranges between
FATAL and is specified on the 'Logging' tab in the rule editor.
If an alert is raised by a condition defined in the
Overrides table, its severity supersedes the default severity.
In rules operating on
message commands, the alert severity can be inherited from the 'severity' field of the underlying message.
To enable this behavior, set Severity on the 'Logging' tab to
In cases that require analysis of long-term data or flexible joining and grouping, it maybe more optimal to analyze and react to data using Scheduled SQL queries.
To trigger a notification by an SQL query:
- Develop a query such that it returns an empty result if the situation is normal.
SELECT entity, tags, percentile(90, value) FROM page_views WHERE datetime >= current_day GROUP BY entity, tags, period(1 DAY) HAVING percentile(90, value) > 1000 -- HAVING condition acts as a rule filter
- Create a scheduled SQL query.
- Set Send Empty Report parameter to
- Specify triggers such as an email notification or a file export.
As a result, the query triggers actions only when it returns at least one row.
Rule windows are initialized in memory and are displayed on the Alerts > Rule Errors page. If no windows are present for the given rule, check that the rule is enabled and that data is not discarded by one of the filters.
Rule Errors can occur in case of invalid or malformed expressions. The Alerts > Rule Errors page contains the list of most recent errors as well as the relevant context and the command details.