EC2 Monitoring: The Case of the Stolen CPU

What do you do when the top command displays that only 40% of the CPU is busy, but the CloudWatch says that the server is maxed out at 100%? The answer is simple, CloudWatch is correct and the top is not. This raises a question on how to measure the performance of virtual machines if you can no longer take operating system statistics at face value. How do you define thresholds, raise alerts, and create management reports if the underlying data appears to be misleading?

CPU Usage Displayed by Top

CPU Usage reported by CloudWatch CPU Usage reported by Tivoli OS agent

If you are an IBM customer with a pSeries frame, these questions are not entirely new to you. When IBM introduced shared pools and micro-partitioning back in 2004, it radically changed how CPU usage is monitored in the AIX part of the world. In fact, since CPU capacity is allocated to a logical partition dynamically, the traditional CPU breakdown by system/user/wait i/o has become irrelevant for capacity planning. What matters is the CPU consumption in processor units as well as the ratio of CPU units consumed to CPU units allocated. The ratio can be greater than 100%, which is not a scalability-on-demand feature that Amazon customers can enjoy as of this writing.

The XEN hypervisor powering Amazon EC2 infrastructure has made great progress by adding flexibility to resource allocations, but it is still years behind IBM POWER hypervisor in terms of granularity. Nevertheless, there are still some options left to correlate OS and hypervisor metrics for the initiated observer and an aspiring cloud guru. For example, you may notice that the top output contains an additional metric called stolen CPU (st for short).

Stolen CPU Displayed by Top

The metric is exposed by the XEN hypervisor and, in the above example, it is equal to 56.9%. Stolen CPU is the number of cycles which were re-claimed by the hypervisor because the virtual machine reached the maximum allocated number of underlying processor core units. In the example above, the m1.small EC2 instance was allocated 0.4 processor units so when 40% of the CPU is busy, it is equal to the usage percent of the underlying core. However, as 40% is the maximum CPU share that can be allocated to this VM, the effective CPU usage is 40%/40% = 100%, which is the number displayed by CloudWatch.

Another option that can be used to retrofit the existing agent, or SNMP- based monitoring tools that do not integrate with CloudWatch, is to use the idle CPU metric. All you need to do is to re-write the rules to measure the idle CPU instead of busy CPU. For example, if you have a >75% threshold defined for busy CPU, create a <25% rule for idle CPU. If the idle CPU is 0, then your server is CPU bound.

Idle CPU Displayed by Top

If you are wondering where 40% comes from, the math is pretty simple. The m1.small linux system is entitled to 1 EC2 compute unit, which provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. Since the VM runs on a machine with 2.6 GHz clock speed, it is entitled to 38.4% – 46.2% processor share on this particular XEN node. You can run cat /proc/cpuinfo command to find out the CPU architecture behind your EC2 instances.

Find Out the CPU Clock Speed on Linux EC2 Instance

There is an ongoing industry discussion about the ‘stolen CPU’ or ‘steal time’ terms. Obviously, the word itself carries a connotation that might make some AWS customers wonder if their fully-paid CPU time was somehow stolen by rogue EC2 instances running on the same physical node. Rest assured, the rules of the game are fair. The best way to describe stolen CPU time to your peers is to think of it as shared CPU time belonging to other AWS customers.

The introduction of t2 instances with burstable CPU has made the calculation of steal time a bit more complex while new CloudWatch metrics allow you to monitor when exactly a t2 EC2 instance will be subject to CPU limit. Read more about t2 instances and CPU credits in our new article.

aws_cpu_credits_t2_small

In terms of ‘steal time’ monitoring consider installing the latest version of nmon (15.d+) which is a Linux/AIX system performance monitoring tool designed by Nigel Griffiths at IBM. The nmon tool provides statistics through a console or CSV file, covering all major system components including CPU, network, disk, memory, file systems. Just like collectd, nmon is a single binary with no dependencies that has a minimal overhead. More details are presented here.

nmon console steal 1