JVM's gc is most likely significantly better. On the other hand golang's gc needs to collect less objects, in some cases orders of magnitude less.
If you compare a slice of structs with 1000 elements, it'll be one object (and allocation) in golang. Equivalent array in JVM requires the array itself + 1000 Objects, 1001 allocations. In this case, golang has lot less object graph to gc.
Of course slice of 1000 interfaces or pointers faces the same 1001 issue in golang as well.
You could emulate same gc load cost in JVM at cost of runtime performance by storing the objects in a byte array and [de]serialize as needed, but that's neither idiomatic or acceptable solution most of the time.
Could you explain what essential things JVM does better than Go at this point ? Does it stop the world less ? Or does it do more things in parallel ? Thanks
Although Go's GC is tunable to some extent, the open source HotSpot JVM are already has multiple GC implementations that you can choose based on your use case and further tune. There is also work being done in the OpenJDK project for a GC that can collect > 100GB heaps in < 10ms [1]. There are also alternative proprietary implementations available today that already have no stop the world collections [2]
If it is carefully tuned why it needs such a big GC tuning guide and 100s of JVM flags to tune runtime. Any Java product of consequence comes with custom GC settings meaning they do not find default ones suitable.
Because the JVM developers had customers who asked for the ability to tune the GC for their particular application.
Go will receive those feature requests too. The Go developers may not be as willing to provide so many knobs (which is a position I'm entirely sympathetic to, don't get me wrong). But the settings always exist, regardless of whether Google hammers in values for them or leaves them adjustable. GC is full of tradeoffs; they're fundamental to the problem.
The G1 collector has a single knob that is supposed to be a master knob: you pick your pause time goal. Lower means shorter pauses but overall more CPU time spent on collection. Higher means longer pauses but less time spent on collection and thus more CPU time spent on your app. Batch job? Give it a high goal. Latency sensitive game or server? Give it a low goal.
There are many other flags too, and you can tune them if you want to squeeze more performance out of your system, but you don't have to use them if you don't want to.
Depends what you compare it to. As I have written above, you can get low pause times with huge heaps today. In practice very few apps need such low pause times with such giant heaps and as such most users prefer to tolerate higher pauses to get more throughput. There are cases where that's not true, the high frequency trading world seems to be one, but that's why companies like Azul make money. You can get JVMs that never pause. Just not for free.
With respect to garbage collection only and ignoring things like reliable debugging support, the primary thing it does is compaction.
If your memory manager does not compact the heap (i.e. never moves anything), then this implies a couple of things:
1. You can run out of memory whilst still technically having enough bytes available for a requested allocation, if those bytes are not contiguous. Most allocators bucket allocations by size to try and avoid the worst of this, but ultimately if you don't move things around it can always bite you.
2. The allocator has to go find a space for something when you request space. As the heap gets more and more fragmented this can slow down. If your collector is able to move objects then you can do things generationally which means allocation is effectively free (just bump a pointer).
In the JVM world there are two state of the art collectors, the open source G1 and Azul's commercial C4 collector (C4 == continuous compacting concurrent collector). Both can compact the heap concurrently. It is considered an important feature for reliability because otherwise big programs can get into a state where they can't stay up forever because eventually their heap gets so fragmented that they have to restart. Note that not all programs suffer from this. It depends a lot on how a program uses memory, the types of allocations they do, their predictability, etc. But if your program does start to suffer from heap fragmentation then oh boy, is it ever painful to fix.
The Go team have made a collector that does not move things. This means it can superficially look very good compared to other runtimes, but it's comparing apples to oranges: the collectors aren't doing the same amount of work.
The two JVM collectors have a few other tricks up their sleeves. G1 can deduplicate strings on the heap. If you have a string like "GET" or "index.html" 1000 times in your heap, G1 can rewrite the pointers so there's only a single copy instead. C4's stand-out feature is that your app doesn't pause for GC ever, all collection is done whilst the app is running, and Azul's custom JVM is tuned to keep all other pause times absolutely minimal as well. However IIRC it needs some kernel patches in order to do this, due to the unique stresses it places on the Linux VMM subsystem.
While I agree that compaction is desirable in theory, empirically it's not really necessary. For example, there are no C/C++ malloc/free implementations that compact, because compaction would change the address of pointers, breaking the C language. Long-lived C and C++ applications seem to get by just fine without the ability to move objects in memory.
Java code also tends to make more allocations than Go code, simply because Java does not (yet) have value types, and Go does. This isn't really anything to do with the GC, but it does mean that Java _needs_ a more powerful GC just to handle the sometimes much greater volume of allocations. It also makes Java programmers sometimes have to resort to hacks like arrays of primitive types (I've done this before).
People like to talk about how important generational GC is, and how big a problem it is that Go doesn't have it. But I have also seen that if there is too high a volume of data in the young-gen in Java, short-lived objects get tenured anyway. In practice, the generational assumption isn't always true. If you use libraries like Protobuffers that create a ton of garbage, you can pretty easily exceed the GC's ability to keep up with short-lived garbage.
I'm really curious to see how Go's GC works out for big heaps in practice. I can say that my experience with Java heaps above 100 GB has not been good. (To be fair, most of my Java experience has been under CMS, not the new G1 collector.)
Experience with C/C++ is exactly why people tend to value compaction. I've absolutely encountered servers and other long-lived apps written in C++ that suffer from heap fragmentation, and required serious attention from skilled developers to try and fix things (sometimes by adding or removing fields from structures). It can be a huge time sink because the code isn't actually buggy and the problem is often not easily localised to one section of code. It's not common that you encounter big firefighting efforts though, because often for a server it's easier to just restart it in this sort of situation.
As an example, Windows has a special malloc called the "low fragmentation heap" specifically to help fight this kind of problem - if fragmentation was never an issue in practice, such a feature would not exist.
CMS was never designed for 100GB+ heaps so I am not surprised your experience was poor. G1 can handle such heaps although the Intel/HBase presentation suggested aiming for more like 100msec pause times is reasonable there.
The main thing I'd guess you have to watch out for with huge Go heaps is how long it takes to complete a collection. If it's really scanning the entire heap in each collection then I'd guess you can outrun the GC quite easily if your allocation rate is high.
It's true heaps can be a pain with C/C++. 64-bit is pretty ok, it's rare to have any issues.
32-bit is painful and messy. If possible, one thing that may help is to allocate large (virtual memory wise) objects once in the beginning of a new process and have separate heaps for different threads / purposes. Not only heap fragmentation can be issue, but also virtual memory fragmentation. Latter is usually what turns out to be fatal. One way to mitigate issues with multiple large allocations is to change memory mapping as needed... Yeah, it can get messy.
64-bit systems are way easier. Large allocations can be handled by allocating page size blocks of memory from OS (VirtualAlloc / mmap). OS can move and compact physical memory just fine. At most you'll end up with holes in the virtual memory mappings, but it's not a real issue with 64 bit systems.
Small allocations with some allocator that is smart enough to group allocations by 2^n size (or do some other smarter tricks to practically eliminate fragmentation).
Other ways are to use arenas or multiple heaps. For example per thread or per object.
There are also compactible heaps. You just need to lock the memory object before use to get a pointer to it and unlock when you're done. The heap manager is free to move the memory block as it pleases, because no one is allowed to have a pointer to the block. Harder to use, yes, but hey, no fragmentation!
Yeah, Java is better in some ways for being able to compact memory always. That said, I've also cursed it to hell for ending up in practically infinite gc loop when used memory is nearing maximum heap size.
Well, I can only say that your experience is different than mine. I worked with C++ for 10 years, on mostly server side software, and never encountered a problem that we traced back to heap fragmentation. I'm not sure exactly why this was the case... perhaps the use of object pools prevented it, or perhaps it just isn't that big of a problem on modern 64 bit servers.
At Cloudera, we still mostly use CMS because the version of G1 shipped in JDK6 wasn't considered mature, and we only recently upgraded to JDK7. We are currently looking into defaulting to G1, but it will take time to feel confident about that. G1 is not a silver bullet anyway. You can still get multi-minute pauses with heaps bigger than 100GB. A stop-the-world GC is still lurking in wait if certain conditions are met, and some workloads always trigger it... like starting the HDFS NameNode.
Ouch, not even upgrading to JDK8? That's not really a new JVM anymore now.
G1 has improved a lot over time. What I've been writing was based on the assumption of using the latest version of it.
Yes, full stop-the-world GCs are painful, but they'll be painful in any GC. If Go runs out of memory entirely then I assume they have to do the same thing.
"Great" is pushing it a bit. The JVM will not do inter-procedural escape analysis unless the called method is inlined into the callee and so the compiler can treat it as a single method for optimisation purposes. So forget about stack allocating an object high up the call stack even if it's only used lower down and could theoretically have been done so.
That said, JVMs do not actually stack allocate anything. They do a smarter optimisation called scalar replacement. The object is effectively decomposed into local variables that are then subject to further optimisation, for instance, completely deleting a field that isn't used.
Value types will be added to the JVM eventually in the Valhalla project. Go fans may note here that Go has value types, but this is a dodge - the bulk of the work being done so far in Valhalla is a major upgrade of the support for generics, because the Java (and .NET) teams believe that value types without generic specialisation is a fairly useless feature. If they didn't do that you could have MyValueType[] as an array, but not a List<MyValueType> or Map<String, MyValueType> which would make it fairly useless. Go gets around this problem by simply not letting users define their own generic data structures and baking a few simple ones into the language itself. This is hardly a solution.
It is straightforward because Go has first class value types which are most likely to be on stack vs Java where everything except primitives are reference type which are most likely to be on heap. Also Java data structures are really bloated.
Back when Java was introduced I was disappointed that they decided to ignore value types, specifically given that Cedar, Modula-3, Eiffel and Oberon variants all had them.
Also that they went VM instead of AOT like those languages main implementations.
Oh well, at least they are now on the roadmap for Java 10, 30 years later.
The golang object pool is a bit of a problem (compared to the JVM alternatives) due to lack of generics. You tend to need to do object pooling when you have tight performance requirements which is at odds with the type manipulation you have to do with the sync.Pool.
So the golang pool is good for the case where you have GC heavy but non-latency sensitive operations, but not the more general performance sensitive problems.
At the x86 level, myPool.Get(Index) is going to be at least as expensive as cmp/jae/mov (3 cycles), and myPool.Get().(myStruct) is going to be at least as expensive as cmp/jae/cmp/jne/mov (5 cycles). So unless you have some way of hiding the latency, the type check is 67% slower by cycle count.
The experience of every JIT developer is that dynamic type checks do matter a lot in hot paths.
Not disagreeing, but I think that was a bit inaccurate.
If that branch is mispredicted, we're talking about 12-20 cycles. Ok, I assume it's a range check and thus (nearly) always not taken. So if it's in hot path, it'll always be correctly predicted. Modern CPUs will most likely fuse cmp+jae into one micro-op, so predicted-not-taken + mov will take 2 cycles (+latency).
"cmp/jae/cmp/jne/mov" will of course be fused into 3 micro-ops. But don't you mean "cmp/jae/cmp/je/mov"? I'm assuming second compare is a NULL check (or at least that instructions are ordered that way second branch is practically never taken). I think that also takes 2 cycles (both branches execute on same clock cycle + mov), but not sure how fused predicted-not-takens behave.
L3 miss for that mov, well... might well be 200 cycles.
Ah yeah, I wasn't sure if fusion was going to happen. You're probably right in macro-op terms; sorry about that.
The first compare is a bounds check against the array backing the pool, and the second compare is against the type field on the interface, not a null check. Golang interfaces are "fat pointers" with two words: a data pointer and a vtable pointer. So the first cmp is against a register, while the second cmp is against memory, data dependent on the register index. The address of the cmp has to be at least checked to determine if it faults, so I would think at least some part of it would have to be serialized after the first branch, making it slower than the version without the type guard.
So your had rolled one will not have the typing overhead we are discussing, but it will have 2 much worse issues.
1. sync.Pool's have thread local storage, something your own pools will not have.
2. sync.Pool's are GC aware; meaning if the allocator is having trouble it can drain "free" pool objects to gain memory. Your custom pool will not have this integration.
I have a feeling that the performance you gained not type-checking you will loose by not having #1.
I think you are missing my point. So I'll restate it. sync.Pool does not help with GC issues compared to the JVM because the JVM also has object pools, further those object pools are actually better for the low latency case because the language does not force them to make a choice between dynamic type checks and specific use abstractions.
[edit] As pcwalton points out. My whole argument is actually null and void due to type erasure...doh.
Yeah, and they can collect the unreachable objects in the array when no references exist to the array itself.
From the QCon talk linked to in the slides it sounds like the Go GC is benefiting from the reduced number of objects being allocated and the fact that those objects can never move. Makes things a lot simpler if you can get away with it, but I can imagine a reference into an array, and the inability to move objects could combine in bad ways if you're unlucky.
In terms of what? Latency? I think Go is down to something like 10ms, and from what I understand, pressure on the GC can be relieved by using things such as `sync.Pool`.