The Axibase Time Series Database enables users to instrument their information assets, systems, and sensors, and it provides visibility, control, and operational intelligence. The primary use cases for ATSD are system performance monitoring, sensor data storage and analytics, as well as IoT (Internet-of-Things) data storage and analytics.
ATSD excels when it comes to working with IoT and sensor-focused use cases.
A typical use case covers the collection and storage of sensor data, execution of data analytics at scale, generation of forecasts, creation of visualization portals, and automatic raising of alerts in the case of abnormal deviations or threshold breaches. ATSD was purposefully designed to handle complex projects like these from start to finish.
This article will focus on an implemented use case: monitoring and analyzing air quality sensor data using the Axibase Time Series Database.
Steps taken to execute the use case:
– Collect historical data from AirNow into ATSD
– Stream current data from AirNow into ATSD
– Use the R Language to generate forecasts for all collected entities and metrics
– Create Holt-Winters forecasts in ATSD for all collected entities and metrics
– Build visualization portal in ATSD
– Set up alert and notification rules in ATSD Rule Engine
We used the data from more than 2,000 monitoring sensor stations located in over 300 cities across the United States. These stations generate hourly readings of several key air quality metrics. We retrieved both the historical and the streaming data and stored it in ATSD.
PM2.5 is a particulate matter that consists of particles less than 2.5 micrometers in diameter and is often called “fine” particles. These particles are so small they can be detected only with an electron microscope. Sources of fine particles include all types of combustion, such as combustion in motor vehicles, power plants, residential wood burning, forest fires, agricultural burning, and industrial processes.
o3 (Ozone) occurs naturally in the Earth’s upper atmosphere, 6 to 30 miles above the Earth’s surface. There ozone forms a protective layer that shields us from the Sun’s harmful ultraviolet rays. Man-made chemicals are known to destroy this beneficial ozone layer.
Other collected metrics are: PM10 (particulate matter up to 10 micrometers in size), CO (carbon monoxide), NO2 (nitrogen dioxide), and SO2 (sulfur dioxide).
Collecting/Streaming the Data
For this use case, we collected, stored, analyzed, and accurately forecast a total of 5 years of historical data. In order for the forecasts to have maximum accuracy and to account for trends and seasonal cycles, we recommend accruing at least 3 to 5 years of detailed historical data.
We immediately identified an issue with the accuracy of the data. The data was becoming available with a fluctuating time delay of 1 to 3 hours. We conducted an analysis by collecting all values for each metric and entity. This resulted in several data points being recorded for the same metric, entity, and time. This fact led us to believe that there was both a time delay and a stabilization period. The results are presented below:
Once available, the data took another 3 to 12 hours to stabilize. This means that the values were fluctuating during that time frame for most data points.
As a result of this analysis, we decided to collect all data with a 12-hour delay in order to increase the accuracy of the data and the forecasts.
Axibase Collector was used to collect the data from the monitoring sensor stations and to stream the data into the Axibase Time Series Database. The Collector is an effective tool for batch downloading of historical data and streaming fresh data as it becomes available.
In Axibase Collector, we set up a job to collect the data from the air monitoring sensor stations in Fresno, California. For this particular example, Fresno was selected because it is considered to be one of the most polluted cities in the United States, with air quality warnings being often issued to the public.
The File Job sets up a cron task that runs at a specified interval to collect the data and batch upload it into ATSD.
The File Forwarding Configuration is a parser configuration for the data that is incoming from an external source. The path to the external data source is specified and a default entity is assigned to the Fresno monitoring sensor station. Start time and end time determine the time frame for retrieving new data (end time syntax is used).
Once these two configurations are saved, the Collector starts streaming fresh data into ATSD.
You can view the entities and metrics streamed by the Collector into ATSD from the UI.
The whole data set currently has over 87,000,000 records for each metric, all stored in ATSD.
Generating the Forecasts in R
Our next steps were analyzing the data and generating accurate forecasts. First, we used built-in Holt-Winters and Arima algorithms in ATSD and then used custom R language data forecasting algorithms for comparison.
To analyze the data in R, the R language API client was used to retrieve the data and then to save the custom forecasts back into ATSD.
Forecasts were built for all metrics for the period from May 11th to June 1st.
The steps taken to forecast the PM2.5 metric are outlined below.
The Rssa package was used to generate the forecast. This package implements Singular Spectrum Analysis (SSA) method.
We used recommendations from the following sources to choose parameters for SSA forecasting:
- Basic Singular Spectrum Analysis and Forecasting with R, Computational Statistics and Data Analysis, Volume 71, March 2014, Pages 934-954
- Singular Spectrum Analysis for Time Series, SpringerBriefs in Statistics, 2013
The following steps were executed when building the forecasts:
- PM2.5 series were retrieved from ATSD using the
query()function. 72 days of data were loaded from ATSD.
- SSA decomposition was built with a window of 24 days and 100 eigen triples:
dec <- ssa(values, L = 24 * 24, neig = 100)
- eigen values, eigen vectors, pairs of sequential eigen vectors, and w-correlation matrix of the decomposition were graphed:
plot(dec, type = "values")
plot(dec, type = "vectors", idx = 1:20)
plot(dec,type = "paired", idx = 1:20)
plot(wcor(dec), idx = 1:100)
A group of eigen triples was then selected to use for forecasting. The plots suggest several options.
Three different options – 1, 1:23, and 1:35 – were tested because groups 1, 2:23, and 24:35 are separated from other eigen vectors, as judged from the w-correlation matrix.
rforecast() function was used to build the forecast:
rforecast(x = dec, groups = 1:35, len = 21 * 24, base = "original")
We ran the tests with
bforecast() using different parameters, but
rforecast() was determined to be the best option in this case.
Graph of the original series and three resulting forecasts:
We selected the forecast with eigen triples 1:35 as the most accurate and saved it into ATSD.
- To save forecasts into ATSD, the
save_series()function was used.
Generating the Forecasts in ATSD
The next step was to create a competing forecast in ATSD using the built-in forecasting features. A majority of the settings were left in automatic mode so that the system itself could determine the best parameters when generating the forecast.
Visualizing the Results
To visualize the data and the forecasts, a portal was created using the built-in visualization features.
Thresholds were set for each metric in order to alert the user when either the forecast or the actual data was reaching unhealthy levels of air pollution.
When comparing the R forecasts and the ATSD forecasts to the actual data, the ATSD forecasts turned out to be significantly more accurate in most cases, learning and recognizing patterns and trends with more certainty. As of today, as the actual data is coming in, it is following the ATSD forecast very closely. Any deviations are minimal and fall within the confidence interval.
Evidently, ATSD’s built-in forecasting in most cases produces more accurate results than one of the most advanced R-language forecasting algorithms, which was used as a part of this use case. It is definitely possible to rely on ATSD to forecast air pollution for a few days/weeks into the future.
Alerts and Notifications
A smart alert notification was set up in the Rule Engine to notify the user by email if the pollution levels breached the set threshold or deviated from the ATSD forecast.
Analytical rules set in Rule Engine for PM2.5 metric: alerts will be raised if the streaming data satisfies one of the below rules:
|value > 30
|Raise an alert if last metric value exceeds threshold.
|forecast_deviation(avg()) > 2
|Raise an alert if the actual value exceeds the forecast by more than 2 standard deviations (see image below). Smart rules capture extreme spikes in air pollution.
At this point the use case is fully implemented and will function autonomously. ATSD automatically streams the sensor data, generates a new forecast every 24 hours for 3 weeks into the future, and raises alerts if the pollution levels rise above the threshold or if a negative trend is discovered.
Results and Conclusions
The results of this case can be useful for travelers who need to have an accurate forecast of environmental and pollution related issues that they may face during their visit. This information can also be helpful to expats moving to a new city or country. Studies have proven that longterm exposure to high levels of PM2.5 can lead to serious health issues.
This research and environmental forecasting is especially valuable in regions like China, where air pollution is seriously affecting the local population and visitors. In cities like Shanghai, Beijing, and Guangzhou, PM2.5 levels are constantly fluctuating from unhealthy to critical levels, yet accurate forecasting is limited. PM2.5 forecasting is essential for travelers and tourists who need to plan their trips during periods of lower pollution levels due to potential health risks of pollution exposure.
Government agencies can also take advantage of pollution monitoring to plan and issue early warnings. This way, precautions can be taken to prevent exposure to unhealthy levels of PM2.5 pollution. Detecting a trend and raising an alert before PM2.5 levels breach the unhealthy threshold is critical for public safety and health. Reliable air quality data and data analytics allow people to adapt and make informed decisions.
In conclusion, Big Data Analytics is an empowering tool that can put valuable information in the hands of corporations, governments, and individuals. That knowledge can motivate people to effect change. Air pollution is currently affecting the lives of over a billion people across the globe and with current trends the situation is only going to get worse. Often the exact source of air pollution, the way the pollutants are interacting in the air, and the dispersion of pollution cannot be determined. The lack of such information makes air pollution a difficult problem to tackle. However, the advances in modern technologies and the arrival of new Big Data solutions enable us to combine sensor data with meteorological satellite data to perform extensive analytics and forecasting. Big Data analytics will make it possible to pinpoint the pollution source and dispersion trends in advance.
At Axibase, we sincerely believe that Big Data has a large role to play in tackling air pollution. We predict that in the coming years advanced data analytics will be a key factor influencing governments’ decisions and changes in pollution regulations.
Other IoT and sensor-focused use cases will soon be published on the Axibase blog. Visit us frequently or follow us on Twitter to stay updated.
Learn more about the Axibase Time Series Database.