## Measuring Cloud Oktas from Outer Space

### This article describes the approach undertaken by data scientists at Axibase to calculate cloud cover using satellite imagery from the Japanese Himawari 8 satellite.

Today, cloud cover is measured using automated weather stations, specifically ceilometers and sky imager instruments. The ceilometer is an upward pointed laser that calculates the time required for the laser to reflect back to ground surface from overhead clouds, which determines the height of the cloud base. The sky imager divides the sky into regions and finds the percentage of clouds in those different regions. Only clouds located directly overhead are detected. As a result, with how sparsely weather stations are placed today, the amount of cloud cover data is limited. Most modern weather stations can discover and measure clouds up to a ceiling of 7600 meters. You can learn more about modern automated weather stations and ceilometers on Wikipedia.

Ceilometer:

Sky Imager:

You can learn more about automated weather stations in Australia on the official website of the Bureau of Meteorology Australia.

Cloud cover measurements have many applications and benefits in weather forecasting and solar energy generation. For example, seasonal cloud cover statistics allow tourists to plan their holidays for sunnier weeks and months of the year. This information is also useful to mountain climbers planning their ascent, since they need to choose seasons with less cloud cover, guaranteeing the best possible conditions for their summit attempts. Photovoltaic energy generation hinges heavily on quality cloud cover data. Solar panels are most efficient when there are no clouds, so when building a solar power station it is important to analyze cloud oktas data. Because automated weather stations that measure this metric are distributed sparsely, the data is often not available. Below is a visualization comparing cloud cover with solar power generation for a particular station in Australia. It is readily apparent that the two metrics are interdependent.

This research project is aimed at calculating cloudiness over Australia from satellite images. Our goal is to use a simple method that can effectively determine the cloud cover without employing complex algorithms and machine learning. We compared our results with actual data from ground weather stations. Using data available from the JMA (Japan Meteorological Agency) and the Australian BOM (Bureau of Meteorology) made this research especially interesting and feasible.

### Cloudiness Data

Australian meteorological stations are used as the source of cloudiness data. The list of all meteorological stations is available on the website of the Australian Bureau of Meteorology.

Here is a summary of the available stations:

• 20112 – total number of meteorological stations.
• 7568 – total number of currently functional stations.
• 867 – total number of station that have available data.
• 778 – total number of station whose data is loaded into ATSD.
• 394 – total number of stations that measure cloud_oktas.
• 45 – total number of station that measure cloud oktas at least 4 times per day.

Cloud cover measurements are available from the Australian Bureau of Meteorology Latest Weather Observations portal. Each station’s cloud cover data for the past few days can be retrieved in JSON format using the REST API.

As stated on Wikipedia, cloud cover is the fraction of the sky obscured by clouds when observed from a particular location. It is measured in “oktas” (meaning “eighths”): 0, 1/8, 2/8, …, 1. Several methods are used to measure cloud cover and it is not exactly clear which of them is used by the Australian weather stations.

### Satellite Data

We analyzed images from a Japanese geostationary weather satellite, Himawari 8. The launch of this satellite took place on October 7, 2014 and it became operational on July 7, 2015. It provides high quality images of the Earth in 16 frequency bands every 10 minutes. You can learn more about Himawari satellites and imaging on the Meteorological Satellite Center of JMA website.

The Japan Meteorological Agency (JMA) processes the satellite images and determines several parameters of the clouds. The results can viewed on the Meteorological Satellite Center of JMA website.

To determine cloud cover from Himawari images as simply as possible, we analyzed only one of the bands. For this research project, we used images of Australia available from the MSC JMA Real Time Image portal. The server keeps images for the past 24 hours. We decided to use images in infrared band 13 with wavelength equal to 10400 nm.

### Data Flow

We used the Axibase Time Series Database to collect data from the Australian Bureau of Meteorology in JSON format. ATSD comes with the Axibase Collector, which collects data from any remote source and stores it in ATSD. Another benefit of ATSD was the built-in visualization that allowed us to graph our results, giving us a good understanding of our progress.

The images from JMA are loaded in PNG format into R, where they are analyzed. To analyze the images in R, we used the EBImage, oce, and geosphere R packages. The results of the analysis are stored in ATSD.

Once both the actual cloud oktas and the calculated cloud oktas were loaded into ATSD, visualization portals were built to compare the collected and computed metrics.

### Cloudiness Detection From Images

Using the geographical coordinates of each station, we determined each station’s location on the satellite images. The images are somewhat distorted near the border of Australia and on the lines of the coordinate grid. This distortion comes in the form of overlays added on top of the images, the green Australian border and white grid. Therefore, we only studied stations that are far from the distorted areas.

We used a simple method to detect clouds. Since clouds are cooler than the earth’s surface, they will be rendered white on the satellite images, and the earth’s surface w black. Therefore, the brightness of the pixels in the images reflects cloudiness. Hence, we calculated cloud cover for a given meteorological station as the average pixel brightness over a 3 * 3 square of pixels, centered over the station.

Since one pixel on the image, depending on the location, covers an area from 5.5 * 3.9 to 5.5 * 5.6 square kilometers, it turns out that for the determination of cloudiness we analyzed an area of about 230 square kilometers.

Here is the key line from the R script used to calculate the cloudiness_himawari_b13 metric from the satellite images:

To store the results in ATSD, the following code is used:

Once we stored the calculation results in ATSD, we were able to view graphs of actual and calculated cloud cover in Axibase Chart Lab. For example, here is a Chart Lab portal for the weather station near the town of Oakey:

In this Chart Lab portal the cloudiness_himawari_b13 metric is the calculated cloud cover, and the cloud_oktas metric is cloud cover measured by the station.

We selected stations that measure cloud cover at an average frequency of at least once every 4 hours. The table below displays the Pearson correlation coefficient between actual and calculated cloud cover for the selected stations:

NOTE: the Count column indicates the average number of cloud cover measurements per day.

In conclusion, the selected approach does not appear to work particularly well and there is little correlation between computed and factual values.

Interestingly, the cloudiness_himawari_b13 series has a daily cycle; it is less during the day than at night. We clearly see this trend if we compare this series with the height of the sun above the horizon (sun altitude). Sun altitude is calculated using a script created by Vladimir Agafonkin.

The results clearly show that the correlation during daytime hours is somewhat higher.

We tried to remove the diurnal cycle by subtracting the average of values of the last n days. Here are the results with the diurnal cycle removed:

### Improving the Correlation

We attempted to improve the correlation by adjusting the method for determining the cloudiness from an image.

1. We averaged out the brightness over different areas of the images.
2. We took a weighted average of brightness over large areas of the image, giving less weight to pixels more distant from the center of the area. To calculate the weights, we used geometrical principles, as described below. We took into account the height of the cloud’s lower edge directly over a station. Meteorological stations measure the height of clouds and this data is available from the Australian Bureau of Meteorology. It seems that the lower the height of the cloud the greater the value of cloud cover.

To be more concrete, here are example computations for 8 meteorological stations. We choose them because they have enough measurements of cloud cover and are far from the overlaid white lines on the images.
For each of the stations we selected 5 disks (they are not perfectly round) with radii of 0, 3, 5, 10, 20, 30 pixels:

Averaged brightness over each disc is an estimation of the cloud cover. So we have 6 estimations, and their correlations with actual values of cloud cover are saved in columns avg0, …, avg30 in the following tables:

Averaged Brightness:

Weighted Average Brightness:

Columns wavg0, … wavg30 display correlations between cloud cover and the weighted average of brightness for given disks. To compute the weight (w) of each pixel, we use the following equation:

$w = \frac{h}{(d^2 + h^2)^{3/2}} = \frac{h}{l^3}.$

In this equation, d is the distance between the center of the pixel and the given station (in meters) and h is the cloud height (in meters). So distant and high clouds have lower weights.

Here is a code snippet from the R script that we used to calculate the weighted average of brightness for given disks:

Unfortunately, since we know only the height of the clouds directly above each station, we use this height when calculating the weighted average for all pixels, assuming that the height of all clouds is the same around each station.

To explain the reasoning behind using this formula, let’s assume that clouds are flat (this assumption is incorrect, but we will use it anyways). A flat cloud with an area of A squared meters has height of h meters above the ground and is d meters away of station. As a result, the “solid angle” of the cloud for the station is approximately equal wA:

Legend:

• A – the cloud area.
• h – the height of the cloud.
• d – distance from the station to the center of the cloud’s projection on the ground.
• l – distance between the station and the center of the cloud.
• s – the area of the cloud’s image on the unit hemisphere. This tells us how big the cloud appears to the station.

$s = \int_A w\ d\sigma \approx w\cdot A$, where $w = \frac{h}{(d^2 + h^2)^{3/2}} = \frac{h}{l^3}.$

The measurement of a “solid angle” is equal to the area of intersection of this angle and the unit sphere centered in the angle vertex. So the contribution of a pixel to cloud cover is proportional to the w coefficient, because all pixels on an image cover nearly the same area equal to A. The exact statement is that the integral of w over the area A equals to the “solid angle.”

The results show that none of the methods used to improve the correlation led to a significant increase.

### Changing the Original Logic – Improved Correlation

To improve the correlation, we decided to change the original logic of how cloud oktas are determined from satellite images. Rather than using black as the earth’s color, we determined a shade of grey (for each station) that the earth (without cloud cover) reached during the day and used it as the base of our calculations. This logic was used because the temperature of the earth’s surface changes during the day, meaning that its brightness on the satellite images changes as well. The brightest point occurred during the night, when the earth is at its coolest. All darker shades are considered cloudless, and lighter shades are considered cloudy.

This approach lead to an improvement in correlations. This approach also decreased the diurnal cycle.

Here are the calculations and results for the town of Oakey:

The lower brightness threshold = 0.2916667.
All values below the lower threshold are zeroed out.
The upper brightness threshold = 0.4583333.
All values are scaled – divided by the upper brightness threshold.
Those values greater than 1 are set equal to 1.

### Conclusion

When comparing the calculated cloud oktas with solar power generation for a particular station, it is clear that this algorithm can be used as a basis of forecasting and power generation planning.

The above Chart Lab portal compares the power generation of a solar power station near one of the automated weather stations in the city of Griffith, for which we calculated cloud cover. The solar power station is 3 kilometers away from the automated weather station. From these results it is clear that increases in calculated cloud cover lead to decreases in solar power generation and vice-versa. This means that there is a correlation between the calculated cloud oktas and solar power generation.

If we compare the improved correlation results with solar power generation for the same station, their interdependency is even more apparent. There is a strong correlation between the improved calculated cloud cover and solar power generation.

The results of this research project lead us to believe that our algorithm can be used as a way to calculate cloud oktas with relative accuracy. The calculated cloud cover accuracy is high enough that it can be used to forecast and plan solar energy production. This conclusion is especially true for areas that are not covered by BOM meteorological weather stations, where there is no other real source of cloud cover data.

## Tracking the popularity of programming languages on Stack Exchange

#### Axibase has been tracking the popularity of programming languages on Stackoverflow.

The Stack Exchange API V2.2 allows tracking of the number of questions asked and answered for a particular topic using tags.

The method used to track programming language questions is: /tags/{tags}/info

This method returns the total number of questions asked about a programming language (tag). From the data, we can determine the number of new questions asked, the popularity of each language, and its popularity growth rate.

The retrieved data can also be found in the Stackoverflow UI when searching for each language (tag):

Request example:

https://api.stackexchange.com/2.2/tags/java/info?site=stackoverflow

Response:

{ "items": [ { "has_synonyms": true, "is_moderator_only": false, "is_required": false, "count": 880312, "name": "java" } ], "has_more": false, "quota_max": 300, "quota_remaining": 299 }

We are using Stack Exchange API to collect hourly data for the following programming languages, using their respective tags:

• Java
• Go
• Scala
• Python
• Ruby
• Javascript
• PHP
• Ruby-on-rails
• SQL
• R
• Matlab

Axibase Collector was used to collect the data. A JSON job was created in Axibase Collector for each of the tags (programming languages), to collect the data hourly, and store it in the Axibase Time Series Database for analytics and visualization.

Using the collected data, we constructed a visualization portal in ATSD.

When viewing the portal, we immediately noticed that at some point in each day there is a negative change in question count.  After some investigating, we came to the conclusion that this negative change is due to cleanup of invalid or closed questions.

The total number of questions asked shows that there are three programming languages far ahead of others in terms of popularity: Java, Javascript, and PHP.

This data is useful for many companies, programmers, and students. For example, at Axibase, we use the gathered information to help determine which client libraries development to prioritize first.

The results also show what are the up and coming programming languages, which can be useful to students and programmers deciding what language specialization to pursue in their career or studies.

IDE vendors, like JetBrains or Eclipse, can find such data useful to see what new languages are gaining popularity and consider them for future products.

Authors covering programming languages may find this information helpful when deciding what new languages to features in their works.

Visit Axibase to learn about the Axibase Time Series Database, Data Visualization, Data Forecasting, and the Internet of Things.

## Environmental Monitoring Using Big Data

The Axibase Time Series Database  enables users to instrument their information assets, systems, and sensors, and it provides visibility, control, and operational intelligence. The primary use cases for ATSD are system performance monitoring, sensor data storage and analytics, as well as IoT (Internet-of-Things) data storage and analytics.

ATSD excels when it comes to working with IoT and sensor-focused use cases.

A typical use case covers the collection and storage of sensor data, execution of data analytics at scale, generation of forecasts, creation of visualization portals, and automatic raising of alerts in the case of abnormal deviations or threshold breaches. ATSD was purposefully designed to handle complex projects like these from start to finish.

This article will focus on an implemented use case: monitoring and analyzing air quality sensor data using the Axibase Time Series Database.

Steps taken to execute the use case:
– Collect historical data from AirNow into ATSD
– Stream current data from AirNow into ATSD
– Use the R Language to generate forecasts for all collected entities and metrics
– Create Holt-Winters forecasts in ATSD for all collected entities and metrics
– Build visualization portal in ATSD

### The Data

We used the data from more than 2,000 monitoring sensor stations located in over 300 cities across the United States. These stations generate hourly readings of several key air quality metrics. We retrieved both the historical and the streaming data and stored it in ATSD.

The data was sourced from AirNow, which is a U.S. government EPA program that protects public health by providing forecast and real-time air quality information.

The two main collected metrics were PM2.5 and Ozone (o3).

PM2.5 is a particulate matter that consists of particles less than 2.5 micrometers in diameter and is often called “fine” particles. These particles are so small they can be detected only with an electron microscope. Sources of fine particles include all types of combustion, such as combustion in motor vehicles, power plants, residential wood burning, forest fires, agricultural burning, and industrial processes.

o3 (Ozone) occurs naturally in the Earth’s upper atmosphere, 6 to 30 miles above the Earth’s surface. There ozone forms a protective layer that shields us from the Sun’s harmful ultraviolet rays. Man-made chemicals are known to destroy this beneficial ozone layer.

Other collected metrics are: PM10 (particulate matter up to 10 micrometers in size), CO (carbon monoxide), NO2 (nitrogen dioxide), and SO2 (sulfur dioxide).

### Collecting/Streaming the Data

For this use case, we collected, stored, analyzed, and accurately forecast a total of 5 years of historical data. In order for the forecasts to have maximum accuracy and to account for trends and seasonal cycles, we recommend accruing at least 3 to 5 years of detailed historical data.

We immediately identified an issue with the accuracy of the data. The data was becoming available with a fluctuating time delay of 1 to 3 hours. We conducted an analysis by collecting all values for each metric and entity. This resulted in several data points being recorded for the same metric, entity, and time. This fact led us to believe that there was both a time delay and a stabilization period. The results are presented below:

Once available, the data took another 3 to 12 hours to stabilize. This means that the values were fluctuating during that time frame for most data points.

As a result of this analysis, we decided to collect all data with a 12-hour delay in order to increase the accuracy of the data and the forecasts.

Axibase Collector was used to collect the data from the monitoring sensor stations and to stream the data into the Axibase Time Series Database. The Collector is an effective tool for batch downloading of historical data and streaming fresh data as it becomes available.

In Axibase Collector, we set up a job to collect the data from the air monitoring sensor stations in Fresno, California. For this particular example, Fresno was selected because it is considered to be one of the most polluted cities in the United States, with air quality warnings being often issued to the public.

The File Job sets up a cron task that runs at a specified interval to collect the data and batch upload it into ATSD.

The File Forwarding Configuration is a parser configuration for the data that is incoming from an external source. The path to the external data source is specified and a default entity is assigned to the Fresno monitoring sensor station. Start time and end time determine the time frame for retrieving new data (end time syntax is used).

Once these two configurations are saved, the Collector starts streaming fresh data into ATSD.
You can view the entities and metrics streamed by the Collector into ATSD from the UI.

The whole data set currently has over 87,000,000 records for each metric, all stored in ATSD.

### Generating the Forecasts in R

Our next steps were analyzing the data and generating accurate forecasts. First, we used built-in Holt-Winters and Arima algorithms in ATSD and then used custom R language data forecasting algorithms for comparison.

To analyze the data in R, the R language API client was used to retrieve the data and then to save the custom forecasts back into ATSD.

Forecasts were built for all metrics for the period from May 11th to June 1st.

The steps taken to forecast the PM2.5 metric are outlined below.

The Rssa package was used to generate the forecast. This package implements Singular Spectrum Analysis (SSA) method.

We used recommendations from the following sources to choose parameters for SSA forecasting:

The following steps were executed when building the forecasts:

• PM2.5 series were retrieved from ATSD using the query() function. 72 days of data were loaded from ATSD.

• SSA decomposition was built with a window of 24 days and 100 eigen triples:

dec <- ssa(values, L = 24 * 24, neig = 100)
• eigen values, eigen vectors, pairs of sequential eigen vectors, and w-correlation matrix of the decomposition were graphed:
plot(dec, type = "values")

plot(dec, type = "vectors", idx = 1:20)

plot(dec,type = "paired", idx = 1:20)

plot(wcor(dec), idx = 1:100)

A group of eigen triples was then selected to use for forecasting. The plots suggest several options.

Three different options – 1, 1:23, and 1:35 – were tested because groups 1, 2:23, and 24:35 are separated from other eigen vectors, as judged from the w-correlation matrix.

The rforecast() function was used to build the forecast:

rforecast(x = dec, groups = 1:35, len = 21 * 24, base = "original")

We ran the tests with vforecast() and bforecast() using different parameters, but rforecast() was determined to be the best option in this case.

Graph of the original series and three resulting forecasts:

We selected the forecast with eigen triples 1:35 as the most accurate and saved it into ATSD.

• To save forecasts into ATSD, the save_series() function was used.

### Generating the Forecasts in ATSD

The next step was to create a competing forecast in ATSD using the built-in forecasting features. A majority of the settings were left in automatic mode so that the system itself could determine the best parameters when generating the forecast.

### Visualizing the Results

To visualize the data and the forecasts, a portal was created using the built-in visualization features.

Thresholds were set for each metric in order to alert the user when either the forecast or the actual data was reaching unhealthy levels of air pollution.

When comparing the R forecasts and the ATSD forecasts to the actual data, the ATSD forecasts turned out to be significantly more accurate in most cases, learning and recognizing patterns and trends with more certainty. As of today, as the actual data is coming in, it is following the ATSD forecast very closely. Any deviations are minimal and fall within the confidence interval.

Evidently, ATSD’s built-in forecasting in most cases produces more accurate results than one of the most advanced R-language forecasting algorithms, which was used as a part of this use case. It is definitely possible to rely on ATSD to forecast air pollution for a few days/weeks into the future.

A smart alert notification was set up in the Rule Engine to notify the user by email if the pollution levels breached the set threshold or deviated from the ATSD forecast.

Analytical rules set in Rule Engine for PM2.5 metric: alerts will be raised if the streaming data satisfies one of the below rules:

Rule Description
value > 30 Raise an alert if last metric value exceeds threshold.
forecast_deviation(avg()) > 2 Raise an alert if the actual value exceeds the forecast by more than 2 standard deviations (see image below). Smart rules capture extreme spikes in air pollution.

At this point the use case is fully implemented and will function autonomously. ATSD automatically streams the sensor data, generates a new forecast every 24 hours for 3 weeks into the future, and raises alerts if the pollution levels rise above the threshold or if a negative trend is discovered.

### Results and Conclusions

The results of this case can be useful for travelers who need to have an accurate forecast of environmental and pollution related issues that they may face during their visit. This information can also be helpful to expats moving to a new city or country. Studies have proven that longterm exposure to high levels of PM2.5 can lead to serious health issues.

This research and environmental forecasting is especially valuable in regions like China, where air pollution is seriously affecting the local population and visitors. In cities like Shanghai, Beijing, and Guangzhou, PM2.5 levels are constantly fluctuating from unhealthy to critical levels, yet accurate forecasting is limited. PM2.5 forecasting is essential for travelers and tourists who need to plan their trips during periods of lower pollution levels due to potential health risks of pollution exposure.

Government agencies can also take advantage of pollution monitoring to plan and issue early warnings. This way, precautions can be taken to prevent exposure to unhealthy levels of PM2.5 pollution. Detecting a trend and raising an alert before PM2.5 levels breach the unhealthy threshold is critical for public safety and health. Reliable air quality data and data analytics allow people to adapt and make informed decisions.

In conclusion, Big Data Analytics is an empowering tool that can put valuable information in the hands of corporations, governments, and individuals. That knowledge can motivate people to effect change. Air pollution is currently affecting the lives of over a billion people across the globe and with current trends the situation is only going to get worse. Often the exact source of air pollution, the way the pollutants are interacting in the air, and the dispersion of pollution cannot be determined. The lack of such information makes air pollution a difficult problem to tackle. However, the advances in modern technologies and the arrival of new Big Data solutions enable us to combine sensor data with meteorological satellite data to perform extensive analytics and forecasting. Big Data analytics will make it possible to pinpoint the pollution source and dispersion trends in advance.

At Axibase, we sincerely believe that Big Data has a large role to play in tackling air pollution. We predict that in the coming years advanced data analytics will be a key factor influencing governments’ decisions and changes in pollution regulations.

Other IoT and sensor-focused use cases will soon be published on the Axibase blog. Visit us frequently or follow us on Twitter to stay updated.

## AWS EC2 T2 instances: 700 seconds of fame

There is no shortage of innovation going on in cloud pricing models, ranging from Google’s preemptible virtual machines to pay-by-the-tick code execution services such as AWS Lambda. The overall trend is that pricing gets ever more granular and complex and this applies both to emerging pay-as-go services as well as to mainstream virtual machines. As IaaS providers get more creative, so are their customers. We’re not far away from the times where capacity owners will be placing and buying back IT capacity just like they manage corporate cash positions today. It’s a world where enterprises will be closing daily books not just in dollars but also in petabytes and GHz.

A small step in this direction is a new utility pricing model implemented by AWS for their T2 virtual machines (aka EC2 instances). These virtual machines have a fixed maximum capacity measured by the number of CPU cores and a variable burst capacity measured in CPU credit units. CPU credits measure the amount of time that the virtual machine is allowed run at its maximum capacity. One CPU credit is equal to 1 virtual CPU running at 100% utilization for 1 minute. If the machine has 60 credits, it can utilize 1 vCPU for 1 hour. If the machine’s CPU credit balance is zero, it will run at what Amazon calls “Base performance”.

The AWS pricing page provides the following parameters for T2 instances as of May 17, 2015:

Instance type Initial CPU credit* CPU credits earned per hour Base performance (CPU utilization)
t2.micro 30 6 10%
t2.small 30 12 20%
t2.medium 60 24 40%**
• T2 instances run on High Frequency Intel Xeon Processors operating at 2.5GHz with Turbo up to 3.3GHz.

The above table means that a t2.small instance can run for 12 minutes each hour at full capacity. If it happens to run out of CPU credits and it is still has some useful work to do, it will run on a CPU share equivalent to 500MHz CPU (2.5Ghz * 20%).  Expect delays because that’s not much. To put things in context, Samsung Galaxy S6 phone has 4 cores with 2.1 GHz clock speed each. The mobile phone CPU architecture is no match to Xeon but I couldn’t resist the chance to reference the specifications.

The charts below illustrate the relationship between CPU credit balance, CPU credits and CPU usage.

CPU Usage results in CPU credits drawdown which reduces the CPU credit balance. Idle time frees the CPU and earns CPU credits.

The CPU usage in this particular case is generated with stress package triggered by cron at the beginning of each hour and it runs just enough to maintain CPU credit balance at a constant level throughout the day. There were no applications running on the instance other than services pre-installed on Ubuntu 14.04 server distribution.

The CPU load command looks as follows:

3 * * * * stress –cpu 1 –timeout 700

Those of you with a trained eye for details noticed that the number of seconds that the stress command loads the CPU is 700 seconds. It’s 20 seconds less than the advertized 12 minutes but we’ll assume that this is how much CPU time is consumed by the machine when it’s not CPU stressed.

So what’s the conclusion on T2 instances after all? One way to approach this is to follow Amazon advice that  “T2 instances are a good choice for workloads that don’t use the full CPU often or consistently, but occasionally need to burst”. The question is how do you know if T2 is a good choice given your particular application. The solution is monitoring and alerting. Unlike other instances, T2 machines expose CPU credit balance and CPU credit usage metric which you can view in Amazon CloudWatch console or query with AWS CloudWatch API.

You can also create alerts for CPU credit balance to be fired during prime hours in case the system is out of credits.

Last but not least, if you’re running more than a few EC2 instance, you can have Axibase Time-Series Database collect CloudWatch metrics for you so you can make smart capacity planning decisions and forecasting at scale.

Enjoy!

## Combining periodic and sliding windows

The range of techniques to analyze trends and minimize noise in time-series is quite extensive.

### Fixed Period Aggregation

One of the most common approaches is to regularize time-series by applying a grouping function to observations made within each fixed-duration period. This transformation is called aggregation and we can apply it to raw series to calculate, for example, hourly averages from irregular samples. Each period in the above example would start at exactly 0 minutes and 0 seconds each hour and have a duration of 60 minutes. The period would include all values that occurred between HH:00:00.000 and HH:59:59.999. The most commonly used functions include:

• Sum
• Minimum
• Maximum
• Median
• Average / Mean
• Percentile (0 to 100%)
• Standard Deviation
• Variance

In relational databases aggregation for specific periods such as 1 minute, 1 hour, 1 day, 1 month, 1 year can be easily computed with GROUP BY function by formatting timestamp with a truncated datetime pattern. Any other period is more difficult to implement and the query often involves database-specific syntax and nested queries.

SELECT server, AVG(cpu_busy), TO_CHAR(sample_time, 'YYYY-MM-DD HH24') FROM metrics_os_intraday GROUP BY server, TO_CHAR(sample_time, 'YYYY-MM-DD HH24')

On the other hand, non-relational time-series databases are built with support for custom aggregation periods and allow the user to easily specify any period. In ATSD the period is specified with interval = [count] [unit] format, for example: interval = 15 minute. The aggregation period can be also customized interactively using aggregation controls in time-series chart.

List of aggregators supported in Axibase Time-Series Database:

• COUNT
• MIN
• MAX
• AVG
• SUM
• PERCENTILE
• STANDARD_DEVIATION
• FIRST
• LAST
• DELTA
• WAVG
• WTAVG
• THRESHOLD_COUNT
• THRESHOLD_DURATION
• THRESHOLD_PERCENT

## Sliding Window Aggregation

Sliding window aggregation is closely related to moving average which is another widely used method to smooth effects of individual observations and to display trends behind raw data. Such an average is computed for the last-N samples or for samples taken during the last-N minutes.
In both cases, the calculation relies on the concept of a count-based on time-based sliding window which boundaries are continuously adjusted as we progress along the timeline.

Window Type Example Description
count average(100) Average value of the last 100 samples.
time average(’15 minute’) Average value of all samples collected during the last 15 minutes.

There are different types of moving averages with better control over smoothing with linear, geometric or exponentially decreasing weights (see Wikipedia) but we won’t go into their detailed descriptions in this article. Likewise, it is possible to utilize any grouping function such as percentile(95%) instead of average.

In terms of visualization, moving averages are often displayed alongside raw values and the chart may include multiple moving averages for different time intervals. This is especially common in technical analysis used in finance and econometrics.

The product of sliding window aggregation is not the same as periodic aggregation however. Moving average series contains the same timestamps as the underlying raw series which means that such series could be irregular and may contain an arbitrarily large number of samples.

## Combined Aggregation

What we’re interested are scenarios when it’s beneficial to combine periodic and sliding aggregates in one representation.

Consider the case of CPU utilization where we need to display hourly CPU averages over the last 24 hours and also display values for the current hour since the data is streaming in continuously. We noticed that if we simply compute periodic aggregations, the average for the last (current) hour would become quite volatile at the beginning of each hour because the grouping function would be computed only for the first few samples. As result, end-users would get false-alarmed about sudden changes in monitored metrics at the start of the hour.

Consider the above example. Notice how average for the last hour spikes from 5% to 23% and back within a matter of several minutes. This could raise false-positive alerts particularly if the underlying metric is collected at a high-frequency and exhibits significant variance.

The solution we came up with was to implement a ‘moving-average’ setting which controls how the aggregate for the most recent period is calculated. If moving-average is enabled, the last period is computed as a sliding window. The end of the sliding window is equal to last sample time, and the length of the window equals aggregation period. This allows us to smooth aggregate values displayed at the beginning of the period.

This setting has proven to be quite useful and it’s now enabled by default.

See detailed example in our Chart Lab