EC2 Monitoring: The Case of the Stolen CPU

What do you do when the top command displays that only 40% of the CPU is busy, but the CloudWatch says that the server is maxed out at 100%? The answer is simple, CloudWatch is correct and the top is not. This raises a question on how to measure the performance of virtual machines if you can no longer take operating system statistics at face value. How do you define thresholds, raise alerts, and create management reports if the underlying data appears to be misleading?

CPU Usage Displayed by Top

CPU Usage reported by CloudWatch CPU Usage reported by Tivoli OS agent

If you are an IBM customer with a pSeries frame, these questions are not entirely new to you. When IBM introduced shared pools and micro-partitioning back in 2004, it radically changed how CPU usage is monitored in the AIX part of the world. In fact, since CPU capacity is allocated to a logical partition dynamically, the traditional CPU breakdown by system/user/wait i/o has become irrelevant for capacity planning. What matters is the CPU consumption in processor units as well as the ratio of CPU units consumed to CPU units allocated. The ratio can be greater than 100%, which is not a scalability-on-demand feature that Amazon customers can enjoy as of this writing.

The XEN hypervisor powering Amazon EC2 infrastructure has made great progress by adding flexibility to resource allocations, but it is still years behind IBM POWER hypervisor in terms of granularity. Nevertheless, there are still some options left to correlate OS and hypervisor metrics for the initiated observer and an aspiring cloud guru. For example, you may notice that the top output contains an additional metric called stolen CPU (st for short).

Stolen CPU Displayed by Top

The metric is exposed by the XEN hypervisor and, in the above example, it is equal to 56.9%. Stolen CPU is the number of cycles which were re-claimed by the hypervisor because the virtual machine reached the maximum allocated number of underlying processor core units. In the example above, the m1.small EC2 instance was allocated 0.4 processor units so when 40% of the CPU is busy, it is equal to the usage percent of the underlying core. However, as 40% is the maximum CPU share that can be allocated to this VM, the effective CPU usage is 40%/40% = 100%, which is the number displayed by CloudWatch.

Another option that can be used to retrofit the existing agent, or SNMP- based monitoring tools that do not integrate with CloudWatch, is to use the idle CPU metric. All you need to do is to re-write the rules to measure the idle CPU instead of busy CPU. For example, if you have a >75% threshold defined for busy CPU, create a <25% rule for idle CPU. If the idle CPU is 0, then your server is CPU bound.

Idle CPU Displayed by Top

If you are wondering where 40% comes from, the math is pretty simple. The m1.small linux system is entitled to 1 EC2 compute unit, which provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. Since the VM runs on a machine with 2.6 GHz clock speed, it is entitled to 38.4% – 46.2% processor share on this particular XEN node. You can run cat /proc/cpuinfo command to find out the CPU architecture behind your EC2 instances.

Find Out the CPU Clock Speed on Linux EC2 Instance

There is an ongoing industry discussion about the ‘stolen CPU’ or ‘steal time’ terms. Obviously, the word itself carries a connotation that might make some AWS customers wonder if their fully-paid CPU time was somehow stolen by rogue EC2 instances running on the same physical node. Rest assured, the rules of the game are fair. The best way to describe stolen CPU time to your peers is to think of it as shared CPU time belonging to other AWS customers.

A great tool for monitoring CPU steal time is nmon. Nmon is a system performance monitoring tool designed by Nigel Griffiths at IBM, for both AIX and Linux systems. The nmon tool provides in-depth performance statistics in console or batch modes, covering all major system components. Nmon is a single binary with no dependencies that has a minimal overhead and is commonly used to monitor production infrastructures on a continuous basis.

Recently the NMON tool has been forked by Axibase to include CPU steal time metrics. Continue reading about Amazon EC2 cloud monitoring, CPU credits and CPU steal time in our new article.

Axibase Time-Series Database also has the capability to collect Amazon CloudWatch metrics for advanced analytics and long-term retention. Learn about the Amazon Web Services integration.