Environmental Monitoring Using Big Data

The Axibase Time Series Database  enables users to instrument their information assets, systems, and sensors, and it provides visibility, control, and operational intelligence. The primary use cases for ATSD are system performance monitoring, sensor data storage and analytics, as well as IoT (Internet-of-Things) data storage and analytics.

ATSD excels when it comes to working with IoT and sensor-focused use cases.

A typical use case covers the collection and storage of sensor data, execution of data analytics at scale, generation of forecasts, creation of visualization portals, and automatic raising of alerts in the case of abnormal deviations or threshold breaches. ATSD was purposefully designed to handle complex projects like these from start to finish.

This article will focus on an implemented use case: monitoring and analyzing air quality sensor data using the Axibase Time Series Database.

Steps taken to execute the use case:
– Collect historical data from AirNow into ATSD
– Stream current data from AirNow into ATSD
– Use the R Language to generate forecasts for all collected entities and metrics
– Create Holt-Winters forecasts in ATSD for all collected entities and metrics
– Build visualization portal in ATSD
– Set up alert and notification rules in ATSD Rule Engine

The Data

We used the data from more than 2,000 monitoring sensor stations located in over 300 cities across the United States. These stations generate hourly readings of several key air quality metrics. We retrieved both the historical and the streaming data and stored it in ATSD.

The data was sourced from AirNow, which is a U.S. government EPA program that protects public health by providing forecast and real-time air quality information.

The two main collected metrics were PM2.5 and Ozone (o3).

PM2.5 is a particulate matter that consists of particles less than 2.5 micrometers in diameter and is often called “fine” particles. These particles are so small they can be detected only with an electron microscope. Sources of fine particles include all types of combustion, such as combustion in motor vehicles, power plants, residential wood burning, forest fires, agricultural burning, and industrial processes.

o3 (Ozone) occurs naturally in the Earth’s upper atmosphere, 6 to 30 miles above the Earth’s surface. There ozone forms a protective layer that shields us from the Sun’s harmful ultraviolet rays. Man-made chemicals are known to destroy this beneficial ozone layer.

Other collected metrics are: PM10 (particulate matter up to 10 micrometers in size), CO (carbon monoxide), NO2 (nitrogen dioxide), and SO2 (sulfur dioxide).

Collecting/Streaming the Data

For this use case, we collected, stored, analyzed, and accurately forecast a total of 5 years of historical data. In order for the forecasts to have maximum accuracy and to account for trends and seasonal cycles, we recommend accruing at least 3 to 5 years of detailed historical data.

We immediately identified an issue with the accuracy of the data. The data was becoming available with a fluctuating time delay of 1 to 3 hours. We conducted an analysis by collecting all values for each metric and entity. This resulted in several data points being recorded for the same metric, entity, and time. This fact led us to believe that there was both a time delay and a stabilization period. The results are presented below:

insertion total delay

Once available, the data took another 3 to 12 hours to stabilize. This means that the values were fluctuating during that time frame for most data points.


As a result of this analysis, we decided to collect all data with a 12-hour delay in order to increase the accuracy of the data and the forecasts.

total delay

Axibase Collector was used to collect the data from the monitoring sensor stations and to stream the data into the Axibase Time Series Database. The Collector is an effective tool for batch downloading of historical data and streaming fresh data as it becomes available.

In Axibase Collector, we set up a job to collect the data from the air monitoring sensor stations in Fresno, California. For this particular example, Fresno was selected because it is considered to be one of the most polluted cities in the United States, with air quality warnings being often issued to the public.

The File Job sets up a cron task that runs at a specified interval to collect the data and batch upload it into ATSD.

collector file job

The File Forwarding Configuration is a parser configuration for the data that is incoming from an external source. The path to the external data source is specified and a default entity is assigned to the Fresno monitoring sensor station. Start time and end time determine the time frame for retrieving new data (end time syntax is used).

collector parser

Once these two configurations are saved, the Collector starts streaming fresh data into ATSD.
You can view the entities and metrics streamed by the Collector into ATSD from the UI.

atsd metrics

The whole data set currently has over 87,000,000 records for each metric, all stored in ATSD.

Generating the Forecasts in R

Our next steps were analyzing the data and generating accurate forecasts. First, we used built-in Holt-Winters and Arima algorithms in ATSD and then used custom R language data forecasting algorithms for comparison.

To analyze the data in R, the R language API client was used to retrieve the data and then to save the custom forecasts back into ATSD.

Forecasts were built for all metrics for the period from May 11th to June 1st.

The steps taken to forecast the PM2.5 metric are outlined below.

The Rssa package was used to generate the forecast. This package implements Singular Spectrum Analysis (SSA) method.

We used recommendations from the following sources to choose parameters for SSA forecasting:

The following steps were executed when building the forecasts:

  • PM2.5 series were retrieved from ATSD using the query() function. 72 days of data were loaded from ATSD.

  • SSA decomposition was built with a window of 24 days and 100 eigen triples:

dec <- ssa(values, L = 24 * 24, neig = 100)
  • eigen values, eigen vectors, pairs of sequential eigen vectors, and w-correlation matrix of the decomposition were graphed:
plot(dec, type = "values")

singular values r

plot(dec, type = "vectors", idx = 1:20)


plot(dec,type = "paired", idx = 1:20)

pairs eigenvectors r

plot(wcor(dec), idx = 1:100)

w correlation matrix r

A group of eigen triples was then selected to use for forecasting. The plots suggest several options.

Three different options – 1, 1:23, and 1:35 – were tested because groups 1, 2:23, and 24:35 are separated from other eigen vectors, as judged from the w-correlation matrix.

The rforecast() function was used to build the forecast:

rforecast(x = dec, groups = 1:35, len = 21 * 24, base = "original")

We ran the tests with vforecast() and bforecast() using different parameters, but rforecast() was determined to be the best option in this case.

Graph of the original series and three resulting forecasts:

3 forecasts r

We selected the forecast with eigen triples 1:35 as the most accurate and saved it into ATSD.

  • To save forecasts into ATSD, the save_series() function was used.

Generating the Forecasts in ATSD

The next step was to create a competing forecast in ATSD using the built-in forecasting features. A majority of the settings were left in automatic mode so that the system itself could determine the best parameters when generating the forecast.

atsd forecast settings

Visualizing the Results

To visualize the data and the forecasts, a portal was created using the built-in visualization features.

protal air quality

Thresholds were set for each metric in order to alert the user when either the forecast or the actual data was reaching unhealthy levels of air pollution.

When comparing the R forecasts and the ATSD forecasts to the actual data, the ATSD forecasts turned out to be significantly more accurate in most cases, learning and recognizing patterns and trends with more certainty. As of today, as the actual data is coming in, it is following the ATSD forecast very closely. Any deviations are minimal and fall within the confidence interval.

ATSD forecast vs actual data

Evidently, ATSD’s built-in forecasting in most cases produces more accurate results than one of the most advanced R-language forecasting algorithms, which was used as a part of this use case. It is definitely possible to rely on ATSD to forecast air pollution for a few days/weeks into the future.

Keep track of how these forecasts perform in comparison to the actual data in Axibase Chart Lab.


Alerts and Notifications

A smart alert notification was set up in the Rule Engine to notify the user by email if the pollution levels breached the set threshold or deviated from the ATSD forecast.

Analytical rules set in Rule Engine for PM2.5 metric: alerts will be raised if the streaming data satisfies one of the below rules:

Rule Description
value > 30 Raise an alert if last metric value exceeds threshold.
forecast_deviation(avg()) > 2 Raise an alert if the actual value exceeds the forecast by more than 2 standard deviations (see image below). Smart rules capture extreme spikes in air pollution.

Standard Deviation

At this point the use case is fully implemented and will function autonomously. ATSD automatically streams the sensor data, generates a new forecast every 24 hours for 3 weeks into the future, and raises alerts if the pollution levels rise above the threshold or if a negative trend is discovered.

Results and Conclusions

The results of this case can be useful for travelers who need to have an accurate forecast of environmental and pollution related issues that they may face during their visit. This information can also be helpful to expats moving to a new city or country. Studies have proven that longterm exposure to high levels of PM2.5 can lead to serious health issues.

This research and environmental forecasting is especially valuable in regions like China, where air pollution is seriously affecting the local population and visitors. In cities like Shanghai, Beijing, and Guangzhou, PM2.5 levels are constantly fluctuating from unhealthy to critical levels, yet accurate forecasting is limited. PM2.5 forecasting is essential for travelers and tourists who need to plan their trips during periods of lower pollution levels due to potential health risks of pollution exposure.

Beijing Air Quality

Government agencies can also take advantage of pollution monitoring to plan and issue early warnings. This way, precautions can be taken to prevent exposure to unhealthy levels of PM2.5 pollution. Detecting a trend and raising an alert before PM2.5 levels breach the unhealthy threshold is critical for public safety and health. Reliable air quality data and data analytics allow people to adapt and make informed decisions.

In conclusion, Big Data Analytics is an empowering tool that can put valuable information in the hands of corporations, governments, and individuals. That knowledge can motivate people to effect change. Air pollution is currently affecting the lives of over a billion people across the globe and with current trends the situation is only going to get worse. Often the exact source of air pollution, the way the pollutants are interacting in the air, and the dispersion of pollution cannot be determined. The lack of such information makes air pollution a difficult problem to tackle. However, the advances in modern technologies and the arrival of new Big Data solutions enable us to combine sensor data with meteorological satellite data to perform extensive analytics and forecasting. Big Data analytics will make it possible to pinpoint the pollution source and dispersion trends in advance.

At Axibase, we sincerely believe that Big Data has a large role to play in tackling air pollution. We predict that in the coming years advanced data analytics will be a key factor influencing governments’ decisions and changes in pollution regulations.

Other IoT and sensor-focused use cases will soon be published on the Axibase blog. Visit us frequently or follow us on Twitter to stay updated.

Learn more about the Axibase Time Series Database.


AWS EC2 T2 instances: 700 seconds of fame

There is no shortage of innovation going on in cloud pricing models, ranging from Google’s preemptible virtual machines to pay-by-the-tick code execution services such as AWS Lambda. The overall trend is that pricing gets ever more granular and complex and this applies both to emerging pay-as-go services as well as to mainstream virtual machines. As IaaS providers get more creative, so are their customers. We’re not far away from the times where capacity owners will be placing and buying back IT capacity just like they manage corporate cash positions today. It’s a world where enterprises will be closing daily books not just in dollars but also in petabytes and GHz.

A small step in this direction is a new utility pricing model implemented by AWS for their T2 virtual machines (aka EC2 instances). These virtual machines have a fixed maximum capacity measured by the number of CPU cores and a variable burst capacity measured in CPU credit units. CPU credits measure the amount of time that the virtual machine is allowed run at its maximum capacity. One CPU credit is equal to 1 virtual CPU running at 100% utilization for 1 minute. If the machine has 60 credits, it can utilize 1 vCPU for 1 hour. If the machine’s CPU credit balance is zero, it will run at what Amazon calls “Base performance”.

The AWS pricing page provides the following parameters for T2 instances as of May 17, 2015:

Instance type Initial CPU credit* CPU credits earned per hour Base performance (CPU utilization)
t2.micro 30 6 10%
t2.small 30 12 20%
t2.medium 60 24 40%**
  • T2 instances run on High Frequency Intel Xeon Processors operating at 2.5GHz with Turbo up to 3.3GHz.

The above table means that a t2.small instance can run for 12 minutes each hour at full capacity. If it happens to run out of CPU credits and it is still has some useful work to do, it will run on a CPU share equivalent to 500MHz CPU (2.5Ghz * 20%).  Expect delays because that’s not much. To put things in context, Samsung Galaxy S6 phone has 4 cores with 2.1 GHz clock speed each. The mobile phone CPU architecture is no match to Xeon but I couldn’t resist the chance to reference the specifications.

The charts below illustrate the relationship between CPU credit balance, CPU credits and CPU usage.


CPU Usage results in CPU credits drawdown which reduces the CPU credit balance. Idle time frees the CPU and earns CPU credits.

The CPU usage in this particular case is generated with stress package triggered by cron at the beginning of each hour and it runs just enough to maintain CPU credit balance at a constant level throughout the day. There were no applications running on the instance other than services pre-installed on Ubuntu 14.04 server distribution.

The CPU load command looks as follows:

3 * * * * stress –cpu 1 –timeout 700

Those of you with a trained eye for details noticed that the number of seconds that the stress command loads the CPU is 700 seconds. It’s 20 seconds less than the advertized 12 minutes but we’ll assume that this is how much CPU time is consumed by the machine when it’s not CPU stressed.

So what’s the conclusion on T2 instances after all? One way to approach this is to follow Amazon advice that  “T2 instances are a good choice for workloads that don’t use the full CPU often or consistently, but occasionally need to burst”. The question is how do you know if T2 is a good choice given your particular application. The solution is monitoring and alerting. Unlike other instances, T2 machines expose CPU credit balance and CPU credit usage metric which you can view in Amazon CloudWatch console or query with AWS CloudWatch API.

You can also create alerts for CPU credit balance to be fired during prime hours in case the system is out of credits.

Last but not least, if you’re running more than a few EC2 instance, you can have Axibase Time-Series Database collect CloudWatch metrics for you so you can make smart capacity planning decisions and forecasting at scale.