Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Fastest VM Bytecode Interpreter (2010) (byteworm.com)
70 points by wtbob on April 6, 2015 | hide | past | favorite | 15 comments


  for (pr = pn, ip = i;; ip++) {
      op = &optable[ip->op];
...

That looks like a switch-loop interpreter, which is considerably slower than a direct threaded interpreter.

GCC supports goto * * which is a special mechanism that allows for going even faster for the inner loop.

You can see a very similar comparison, comparing the Mono JIT vs an interpreter from the early 2000s here - http://rhysweatherley.sys-con.com/node/38831


It's a direct threaded intrepreter. That code is just looking up the label pointer and stuffing it in the instruction's struct ahead of time.


I wrote a direct threaded interpreter where the code block for each opcode included fetch, increment, jump at the end. One possible problem is that execution never passes through a central point any more, which is of course why it's so fast. Then a friend pointed out the fastest way to read opcode, get branch target, and jump is to replace all the opcodes (create a second table) with the addresses of their implementations. This obviously saves an extra lookup, but the real eyeopener was that you can do PC++, fetch, jump all with one instruction. Use the host stack pointer as the PC in the interpreter and this become ret - return grabs an address, increments the pointer, and branches all in one. Of course you get program corruption on interrupts then. Needs to be either a system you have complete control over, or one with a second stack pointer.


http://flint.cs.yale.edu/jvmsem/doc/threaded.ps

You should also try to play nice with the branch prediction machinery. Doing unbalanced rets isn't good. Doing all the jumps or calls from a single jump or call site isn't good, either. Spread them around so you only have one destination per call site. The method in the above paper is pretty much the ultimate you can easily do without rolling out the really heavy machinery.



Right. I forget that some people use Windows and that Windows machines usually don't have PostScript viewers installed.


It's funny that it's such a nice, simple and old format, yet it's rarely supported natively. I wonder if browser pdf plugins can manage ps.


Thanks for the link, that was a great read!

I agree that the branch (target) prediction will fail horribly on a naive interpreter that calls different branch targets from the same static instruction over and over again. That alone explains the low performance.


Why are "unbalanced" rets not good?

Also, am I wrong that, at least in most Unixish systems, interrupts use a separate stack register from that of user space code, and thus, with care, one can avoid corruption of the direct thread addresses?


Because modern CPUs have a small internal return stack for target predictions for return instructions. If your calls and returns are not balanced, you get mispredictions (or it gives up and stops predicting altogether).

http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_...

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

This type of branch predictor went mainstream in the late 90's with CPUs such as the AMD K6-2.

It is not a correctness issue (as long as there's nothing else that thinks it can put stuff on the stack), just a performance issue.

You should not expect that the code snippets you "return" to will leave the stack alone -- spill/fill and callee-save instructions etc will kill the part of the "stack" that has already been "returned" from which is fine as long as you don't want to reuse it. But I think you will.


I had a similar implementation in a toy-VM I wrote, primarily to cope variable-length instructions. In retrospect that was my biggest mistake.


Historical link to mono's interpreter source (first link in the article tries to find it in master, where it's been removed):

https://github.com/mono/mono/blob/e85609a84907dc3a919bac289a...


Argh, I really dislike this article. It spreads a LOT of misinformation:

  1) His friend did NOT write a bytecode interpretor, he wrote a bytecode compiler: there is a HUGE difference between those two.

  2) No, it wasn't faster than assembly code. And the arrogance and audacity to make such a claim is obnoxious. Unless you gained access to Intel microcode, this is just plain wrong.
I wish I could downvote this: its just bad for aspiring compiler/interpreter developers. If you want some quality discussion see Mike Pall: http://article.gmane.org/gmane.comp.lang.lua.general/75426


2010 (the link to the Mono source is broken now)


Thanks, added.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: