EC2 monitoring: the case of stolen CPU

When the top command displays 40% CPU busy but CloudWatch says the server is maxed out at 100% – which side do you take? The answer is simple (CloudWatch is correct, top is not) but it raises a question about how to measure performance of virtual machines if you can no longer take operating system statistics at face value. How do you define thresholds, raise alerts, and create management reports if the underlying data appears to be misleading?

CPU Usage displayed by top

CPU Usage reported by CloudWatch CPU Usage reported by Tivoli OS agent

If you’re an IBM customer with a pSeries frame these questions aren’t entirely new to you. When IBM introduced shared pools and micro-partitioning back in 2004 it radically changed how CPU usage is monitored in the AIX part of the world. In fact, since CPU capacity is allocated to a logical partition dynamically, the traditional CPU breakdown by system/user/wait i/o has become irrelevant for capacity planning. What matters is CPU consumption in processor units as well as the ratio of CPU units consumed to CPU units allocated. The ratio can be greater than 100% which is not a scalability-on-demand feature that Amazon customers can enjoy as of this writing.

The XEN hypervisor powering Amazon EC2 infrastructure has made great progress of adding flexibility to resource allocations, but it’s still years behind IBM POWER hypervisor in terms of granularity. Nevertheless, there are still some options left to correlate OS and hypervisor metrics for the initiated observer and an aspiring cloud guru. For example, you may notice that the top output contains an additional metric called stolen CPU (st for short).

CPU Stolen displayed by top

The metric is exposed by the XEN hypervisor and in the above example it’s equal to 56.9%. Stolen CPU means how many cycles were re-claimed by the hypervisor because the virtual machine has reached the maximum allocated number of processor units of the underlying processor core. In the example above, the m1.small EC2 instance was allocated 0.4 processor units and so 40% CPU busy means the percentage usage of the underlying core. However because 40% is the maximum CPU share that can be allocated to this VM, the effective CPU usage is 40%/40% = 100%. Which is the number displayed by CloudWatch.

Another option that can used to retrofit the existing agent- or SNMP- based monitoring tools, that don’t integrate with CloudWatch, is to use the CPU idle metric. All you need to do is to re-write rules to measure CPU idle instead of CPU busy. E.g. if you have a >75% threshold defined for CPU busy, create a <25% rule for CPU idle. If CPU idle is 0, then your server is CPU bound.

CPU Idle displayed by top

If you’re wondering where does 40% comes from, the math is pretty simple. The m1.small linux system is entitled to 1 EC2 compute unit which provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. Since the VM runs on a machine with 2.6 GHz clock speed, it’s entitled to 38.4% – 46.2% processor share on this particular XEN node. You can run cat /proc/cpuinfo command to find out CPU architecture behind your EC2 instances.

Finding out CPU clock speed on Linux EC2 instance

By the way, there is an ongoing industry discussion about the ‘stolen cpu’ or ‘steal time’ term. Obviously, the word itself carries a connotation that might make some AWS customers wonder if their fully-paid CPU time was somehow stolen by rogue EC2 instances running on the same physical node. Rest assured, the rules of the game are fair. The best way to describe stolen CPU time to your peers is to think of it as shared CPU time belonging to other AWS customers.