The rule engine automates repetitive tasks based on real-time statistical analysis of incoming data.
The engine evaluates incoming
property commands and executes response actions when appropriate:
IF condition THEN action-1 ... action-N
IF percentile(75) > 300 THEN alert_slack_channel
max() > 1.5 && value('temperature') > 50
The incoming data is consumed by the rule engine independently of the persistence path.
The data is maintained in windows which are
in-memory structures initialized for each unique combination of metric, entity, and grouping tags extracted from incoming commands.
The processing pipeline consists of the following stages:
- Evaluating Condition
- Executing Actions
The incoming data samples are processed by a chain of filters prior to the grouping stage. Such filters include:
Input Filter. All samples are discarded if the Settings > Input Settings > Rule Engine option is disabled.
Status Filter. Samples are discarded for metrics and entities that are disabled.
Rule Filter accepts data that satisfies the metric, entity, and tag filters specified in the rule.
Once the sample passes through the filter chain, the sample is allocated to matching windows grouped by metric, entity, and optional tags. Each window maintains its own array of data samples in working memory.
The commands can be associated with windows in a 1-to-1 fashion by enabling the All Tags setting or by enumerating all tags as the grouping tags.
If the Group by Entity option is cleared, the
entity field is ignored for grouping purposes and the window is grouped only by metric and tags.
The rule engine supports two types of windows:
Count-based windows accumulate up to the specified number of samples. The samples are sorted by command timestamp, with the most recent command placed at the end of the array. When the window reaches the limit, the first sample with the earliest timestamp is removed from the window to free up space for an incoming sample.
Time-based windows store samples that are timestamped within the specified interval of time. The start time of time-based windows is continuously updated. Old records are automatically removed from the window once they are outside of the time range.
Windows are continuously updated as new samples are added and old samples are removed.
When a window is updated, the rule engine checks the condition and triggers various response actions based on the condition result.
avg() > 80
Windows are stateful. When the condition for a given window changes to
true, the window is initialized in memory with the status
true evaluations, the status transitions to
When the condition becomes
false, the window status is reverted to
The current window status is displayed on the Alerts > Rule Windows page.
Windows are updated when the commands
exit the windows. Scheduled rules that are checked at a regular interval, regardless of incoming data, can be constructed using the built-in
Actions are triggered on window status changes, for example upon window
OPEN status or every N-th
REPEAT status occurrence.
Supported response actions:
Triggers for the above actions can be configured independently, for example to send email every 6 hours yet to log events for all repeat occurrences.
- Value functions:
percentile(95) > 80 && values('metric2') != 0
- Database functions:
percentile(95) > 80 && db_statistic('max', '1 hour', 'metric2') < 10*1024
- Rule functions:
percentile(95) > 80 && rule_open('inside_temperature_check')
Rules can be considered software programs in their own right and as such involve initial development, testing, documentation and maintenance efforts.
To minimize the number of rules with manual thresholds, the rule engine in ATSD provides the following capabilities:
- Condition overrides.
- Comparison of windows with different lengths.
- Automated thresholds.
Thresholds can be set manually which requires some trial and error to determine a level that strikes a balance between
false positives and missed alerts.
value > 90
Since a single baseline cannot handle all edge cases, the
Overrides can be used to enumerate exceptions.
To reduce unnecessary alerts, apply averaging functions and increase window durations.
avg() > 90
To reduce distortions caused by a small number of outliers, use percentiles instead of averages.
percentile(75) > 90
Alternatively, use the
minimum or a low percentile function with the reversed comparator to check that all samples in the window exceed the threshold. This is equivalent to checking that the last
N consecutive samples are above the threshold.
// all samples are above 90 min() > 90
// only 10% of the smallest samples are below 90 percentile(10) >= 90
Short-term anomalies can be spotted by comparing statistical functions for different overlapping intervals.
The condition below activates an alert if the 5-minute average exceeds the 1-hour average by more than
20 and by more than
avg('5 minute') - avg() > 20 && avg('5 minute') / avg() > 1.1
forecast function retrieves a precalculated forecast for the current series. The forecast object contains fields that can be compared with the current statistics, for example, to raise an alert if the moving average deviates from the expected value by more than the specified threshold.
# forecast() returns an object with fields and methods abs(avg() - forecast().interpolated) > 25
For convenience the actual value can be compared with the forecast range.
forecast_deviation function can be called to compare the actual and expected values as a ratio of the standard deviation.
abs(forecast_deviation(avg())) > 3.0
In cases where the analyzed metric is related to another metric, use the database functions to identify abnormal behavior in both metrics.
The primary metric is expected to be below
50 as long as the second metric remains below
100. Otherwise, an alert is raised.
avg() > 50 && db_statistic('avg', '1 hour', 'page_views_per_minute') < 100
The same condition can be generalized with a ratio as well.
avg() / db_statistic('avg', '1 hour', 'page_views_per_minute') > 2
As an alternative, use the
value() function to access the last value for metrics submitted within the same
series command or parsed from the same row in CSV files.
value > 75 && value('page_views_per_minute') < 1000
The default baseline can be adjusted for particular series using the Overrides table.
To check conditions on a fixed schedule, use the built-in
timer_ metrics such as
timer_1h which are generated by the database internally.
// Runs on Satursday between 15:00 and 16:00 now.hourOfDay = 15 && now.hourOfDay = 6 && /* remaining checks */
To prevent repeat notifications when the compared value oscillates around the threshold, make the threshold conditional upon the window status. Once the window becomes open, the threshold is adjusted to cancel the alert only after a substantial change in the compared value.
/* Window opens when the value exceeds 80. Thereafter the value needs to drop below 70 for the window to cancel. */ value >= before_status == 'CANCEL' ? 80 : 70
Severity is a measure of criticality assigned to alerts generated by the rule. The severity level ranges between
FATAL and is specified on the Logging tab in the rule editor.
If an alert is raised by a condition defined in the
Overrides table, its severity supersedes the default severity.
In rules operating on
message commands, the alert severity can be inherited from the
severity field of the underlying message.
To enable this behavior, set Severity on the Logging tab to
In cases that involve processing of large volumes of historical data, use Scheduled SQL queries to analyze the data.
To trigger an email notification from an SQL query, use
HAVING filters to develop a query that returns no rows if the situation is normal.
SELECT entity, tags, percentile(90, value) FROM page_views WHERE datetime >= current_day GROUP BY entity, tags, period(1 DAY) HAVING percentile(90, value) > 1000 -- HAVING condition acts as a rule filter
- Set Send Empty Report parameter to
- Specify triggers such as an email notification or a file export.
As a result, the query triggers actions only when it returns at least one row.
Rule windows are initialized in memory and are displayed on the Alerts > Rule Windows page. If no windows are present for the given rule, check that the rule is enabled and that data is not discarded by one of the filters.
Rule Errors can occur in case of invalid or malformed expressions. The Alerts > Rule Errors page contains the list of most recent errors as well as the relevant context and the command details. The errors are also logged as messages by entity
atsd with type
rule-error and source
Webhook, Email and Script actions log their status as ATSD messages. To view action logs, select the option in the left menu.
- Email Notification Log
- Webhook Notification Log
- Script Action Log