Understanding CPU Steal Time - when should you be worried?

aliguori · on July 25, 2013

This analysis of steal time is not entirely correct.

Steal time exists to fix a problem. When a hypervisor needs to pre-empt a running guest, without steal time, when the hypervisor eventually resumes that guest, as far as the guest can tell, the process that was running when the whole guest was pre-empted had run the entire time.

This means that if a guest is pre-empted, then CPU usage reporting in the guest becomes horribly wrong with some processes having much higher reported usage than they actually got. This affects fairness and can cause lots of bad things.

Steal time is simply a way to tell a guest that it was pre-empted. The guest OS can then use that information to correct its usage information and preserve fairness.

However, it is not a general indication of overcommit. When a guest idles a VCPU, that VCPU will be put on the scheduler queue. It may receive an event that would normally cause it to awaken the VCPU however if the system is overcommitted, it may take much longer for the VCPU to be woken up.

Most clouds are designed to allow multiple VCPUs per physical CPU too and there certainly is capping in place. You can still see steal time even though you are getting your full share.

Let me give an example:

1) You are capped at 50%. You run for your full 50%, go idle, the hypervisor realizes you've exhausted your slice, and doesn't schedule you until the next slice. No steal time is reported.

2) You are capped at 50%. You have a neighbor attempting to use his full time slice. Instead of getting to run for the first half of your slice with the neighbor running for the second half, the hypervisor carves up the slice into 10 slots and schedules you both in alternative slots. Both guests see 50% steal time.

You will get the same performance in both scenarios even though the steal time is reported differently.

rodgerd · on July 25, 2013

My rule of thumb is pretty simple: if I see steal but still have an abundance of idle, I don't have a problem. If I see steal and low/no idle, I have a problem with an overcommited hypervisor.

It's derived from stress testing and production across a variety of virtualisation platforms, and it's generally proven pretty accurate.

falcolas · on July 25, 2013

Closely related to CPU Steal time is memory ballooning. If an instance is starting to require a lot of memory, and other instances are not, hypervisors (particularly vmware) will steal memory from other VMs on the same machine and give it to the misbehaving VM.

This can result in swapping on the unfortunate target VMs.

You can detect it by seeing a vmware program running using a lot of CPU (ironically no memory), and by watching your free memory percentage decrease while your programs are not actually consuming more memory.

lsc · on July 25, 2013

My understanding is that memory sharing is generally not done in multi-tenant xen systems, and that this is one of the reasons why Xen is so dominant in that space.

OpenVZ, generally speaking, shares memory, but does not use ballooning; ballooning is specific to systems where each user has their own kernel.

Generally speaking, I think it's much less bad to oversubscribe CPU than memory. Among other things, if you take cpu away from a heavy user to give to a user who has not used their fair share, all the new user has to do is reload cpu cache from main memory, which is slow, but super fast compared to reloading main memory cache from disk, which is what the light user has to do when you take memory away from the heavy user.

Of course, the equation is quite different when the system is all owned by the same entity.

sehrope · on July 25, 2013

Using micro instances I've seen it go up to 99% during CPU intensive work (ex: app build). The hang ups waiting for it continue made me decide to switch the build server to an m1.small instance instead. It's idle the vast majority of the time but the extra $$ for it is totally worth it when you're running a build.

The steal % is usually zero on the m1.small instance. I just tried maxing out the cpu and watching "top" this is as high as it got:

    Cpu(s):  5.3%us, 39.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si, 55.7%st

To max out the cpu I ran the following in a separate ssh terminal while watching "top". An m1.small only has a single v-cpu so only a single running copy should be necessary.

    while :; do date > /dev/null ; done

archivator · on July 25, 2013

Micro instances only provide "burst" CPU usage - if you keep above a certain threshold for long enough, it will throttle you by stealing CPU time (hence the 99%).

sehrope · on July 25, 2013

Yes that's exactly what happened. I originally thought our build times were short enough that it wouldn't go over the threshold but it was. It's surprisingly easy to trigger the cpu throttling on a micro instance.

Would be nice if they could/would average out the cpu usage over a longer rolling window. That'd be perfect for a use case like this (build server) where you're idle the majority of the time but want to max out cpu during the build itself. Seems like the perfect use case for a shared server.

MattJ100 · on July 25, 2013

I've been having issues with steal time recently, but what I'm seeing isn't adequately explained by any of the articles and documentation I could find. Here is an example from one EC2 node:

  19:26:19     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
  06:04:05 PM  all   14.85    0.00    5.94    0.00    0.00    0.00   14.85    0.00   64.36

For me, when the machine was under load, %steal was almost always very close to %usr. It wasn't always the same, sometimes more and sometimes less. Can anyone explain how these numbers are related to each other?

falcolas · on July 25, 2013

They aren't related. Steal is the ratio of how many CPU cycles were promised to your system, but that your system didn't get. User is how much CPU (of the promised maximum) your programs are using directly (i.e. not time which is spent doing system calls (IO being the big one), which falls under the system bucket).

MattJ100 · on July 25, 2013

Then why are they the same in this case? Coincidence? (I won't believe that... but unfortunately I don't have more samples right now to demonstrate the correlation).

falcolas · on Aug 8, 2013

This is pretty old, so sorry if you don't get this, but I have been seeing something similar today:

     procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0  33960  44844 147332 466588    0    0     0    63  130  198 20  2 59  1 18
     0  0  33960  44844 147340 466652    0    0     1    66  123  187 19  2 59  1 18

Etc.

I believe what's happening here is that the CPU cycles are being requested and subsequently stolen, and then since the system is still idle, they're re-requested and given, all in the same time period polled by the monitoring tool.

It's just a theory, but it makes sense (at least to me ;).

gopalv · on July 25, 2013

When I ran into this issue in EC2, it was mitigated by leaving cpu0 relatively idle.

All apache processes were marked as taskset -c 1-7, the cpu steal and system load went down massively once that was in place.

mh- · on July 25, 2013

was this on an HVM instance?

JimmaDaRustla · on July 25, 2013

I use Munin on my VPSs, and it shows the steal time which is nice. I don't typically see it showing up other than a 1 pixel line on RamNode. Hopefully that doesn't change under higher loads in the future.

bearbin · on July 25, 2013

I see the same, although the VPS has basically zero load most of the time due to what it does. (Munin host and Buildserver)

vacri · on July 25, 2013

I do the same, and see the same... except on the munin master node, which sees 10% steal time...

Nimi · on July 26, 2013

It's been a while since I used AWS, so pardon me if the question is silly, but:

Is it really cost-effective to track metrics like steal time, instead of using a large instance and having the host machine for yourself?

thehme · on July 25, 2013

Interesting article. I wonder how this works in Windows.

falcolas · on July 25, 2013

Still capable of occurring, but I don't know how Windows would report it (haven't run Windows in a VM in a long time).

bluedino · on July 25, 2013

This is the same as CPU Ready in VMware?