This analysis of steal time is not entirely correct.
Steal time exists to fix a problem. When a hypervisor needs to pre-empt a running guest, without steal time, when the hypervisor eventually resumes that guest, as far as the guest can tell, the process that was running when the whole guest was pre-empted had run the entire time.
This means that if a guest is pre-empted, then CPU usage reporting in the guest becomes horribly wrong with some processes having much higher reported usage than they actually got. This affects fairness and can cause lots of bad things.
Steal time is simply a way to tell a guest that it was pre-empted. The guest OS can then use that information to correct its usage information and preserve fairness.
However, it is not a general indication of overcommit. When a guest idles a VCPU, that VCPU will be put on the scheduler queue. It may receive an event that would normally cause it to awaken the VCPU however if the system is overcommitted, it may take much longer for the VCPU to be woken up.
Most clouds are designed to allow multiple VCPUs per physical CPU too and there certainly is capping in place. You can still see steal time even though you are getting your full share.
Let me give an example:
1) You are capped at 50%. You run for your full 50%, go idle, the hypervisor realizes you've exhausted your slice, and doesn't schedule you until the next slice. No steal time is reported.
2) You are capped at 50%. You have a neighbor attempting to use his full time slice. Instead of getting to run for the first half of your slice with the neighbor running for the second half, the hypervisor carves up the slice into 10 slots and schedules you both in alternative slots. Both guests see 50% steal time.
You will get the same performance in both scenarios even though the steal time is reported differently.
My rule of thumb is pretty simple: if I see steal but still have an abundance of idle, I don't have a problem. If I see steal and low/no idle, I have a problem with an overcommited hypervisor.
It's derived from stress testing and production across a variety of virtualisation platforms, and it's generally proven pretty accurate.
Closely related to CPU Steal time is memory ballooning. If an instance is starting to require a lot of memory, and other instances are not, hypervisors (particularly vmware) will steal memory from other VMs on the same machine and give it to the misbehaving VM.
This can result in swapping on the unfortunate target VMs.
You can detect it by seeing a vmware program running using a lot of CPU (ironically no memory), and by watching your free memory percentage decrease while your programs are not actually consuming more memory.
My understanding is that memory sharing is generally not done in multi-tenant xen systems, and that this is one of the reasons why Xen is so dominant in that space.
OpenVZ, generally speaking, shares memory, but does not use ballooning; ballooning is specific to systems where each user has their own kernel.
Generally speaking, I think it's much less bad to oversubscribe CPU than memory. Among other things, if you take cpu away from a heavy user to give to a user who has not used their fair share, all the new user has to do is reload cpu cache from main memory, which is slow, but super fast compared to reloading main memory cache from disk, which is what the light user has to do when you take memory away from the heavy user.
Of course, the equation is quite different when the system is all owned by the same entity.
Using micro instances I've seen it go up to 99% during CPU intensive work (ex: app build). The hang ups waiting for it continue made me decide to switch the build server to an m1.small instance instead. It's idle the vast majority of the time but the extra $$ for it is totally worth it when you're running a build.
The steal % is usually zero on the m1.small instance. I just tried maxing out the cpu and watching "top" this is as high as it got:
To max out the cpu I ran the following in a separate ssh terminal while watching "top". An m1.small only has a single v-cpu so only a single running copy should be necessary.
Micro instances only provide "burst" CPU usage - if you keep above a certain threshold for long enough, it will throttle you by stealing CPU time (hence the 99%).
Yes that's exactly what happened. I originally thought our build times were short enough that it wouldn't go over the threshold but it was. It's surprisingly easy to trigger the cpu throttling on a micro instance.
Would be nice if they could/would average out the cpu usage over a longer rolling window. That'd be perfect for a use case like this (build server) where you're idle the majority of the time but want to max out cpu during the build itself. Seems like the perfect use case for a shared server.
I've been having issues with steal time recently, but what I'm seeing isn't adequately explained by any of the articles and documentation I could find. Here is an example from one EC2 node:
For me, when the machine was under load, %steal was almost always very close to %usr. It wasn't always the same, sometimes more and sometimes less. Can anyone explain how these numbers are related to each other?
They aren't related. Steal is the ratio of how many CPU cycles were promised to your system, but that your system didn't get. User is how much CPU (of the promised maximum) your programs are using directly (i.e. not time which is spent doing system calls (IO being the big one), which falls under the system bucket).
Then why are they the same in this case? Coincidence? (I won't believe that... but unfortunately I don't have more samples right now to demonstrate the correlation).
This is pretty old, so sorry if you don't get this, but I have been seeing something similar today:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 33960 44844 147332 466588 0 0 0 63 130 198 20 2 59 1 18
0 0 33960 44844 147340 466652 0 0 1 66 123 187 19 2 59 1 18
Etc.
I believe what's happening here is that the CPU cycles are being requested and subsequently stolen, and then since the system is still idle, they're re-requested and given, all in the same time period polled by the monitoring tool.
It's just a theory, but it makes sense (at least to me ;).
I use Munin on my VPSs, and it shows the steal time which is nice. I don't typically see it showing up other than a 1 pixel line on RamNode. Hopefully that doesn't change under higher loads in the future.
Steal time exists to fix a problem. When a hypervisor needs to pre-empt a running guest, without steal time, when the hypervisor eventually resumes that guest, as far as the guest can tell, the process that was running when the whole guest was pre-empted had run the entire time.
This means that if a guest is pre-empted, then CPU usage reporting in the guest becomes horribly wrong with some processes having much higher reported usage than they actually got. This affects fairness and can cause lots of bad things.
Steal time is simply a way to tell a guest that it was pre-empted. The guest OS can then use that information to correct its usage information and preserve fairness.
However, it is not a general indication of overcommit. When a guest idles a VCPU, that VCPU will be put on the scheduler queue. It may receive an event that would normally cause it to awaken the VCPU however if the system is overcommitted, it may take much longer for the VCPU to be woken up.
Most clouds are designed to allow multiple VCPUs per physical CPU too and there certainly is capping in place. You can still see steal time even though you are getting your full share.
Let me give an example:
1) You are capped at 50%. You run for your full 50%, go idle, the hypervisor realizes you've exhausted your slice, and doesn't schedule you until the next slice. No steal time is reported.
2) You are capped at 50%. You have a neighbor attempting to use his full time slice. Instead of getting to run for the first half of your slice with the neighbor running for the second half, the hypervisor carves up the slice into 10 slots and schedules you both in alternative slots. Both guests see 50% steal time.
You will get the same performance in both scenarios even though the steal time is reported differently.