What Apple calls a GPU core seems to be roughly the same as what Nvidia calls a “stream multiprocessor”.
For example a 1080 GTX GPU has 20 stream multiprocessors (SM), each containing 128 cores, each of which supports 16 threads.
Meanwhile Apple describes the M1 GPU as having 8 cores, where “each core is split into 16 Execution Units, which each contain eight Arithmetic Logic Units (ALUs). In total, the M1 GPU contains up to 128 Execution units or 1024 ALUs, which Apple says can execute up to 24,576 threads simultaneously and which have a maximum floating point (FP32) performance of 2.6 TFLOPs.”
So one option to get a single number for a rough comparison is to count threads. The 1080 GTX supports 40,960 threads while the M1 supports 24,576 threads.
There’s obviously a lot more to a GPU — for starters, varying clock speeds, ALUs can have different capabilities, memory bandwidth, etc. But at least counting threads gives a better idea of the processing bandwidth than talking about cores.
Just for clarification: The 1080 has 20 SMs with 128 FPUs each. Each FPU can perform 2 FLOPs per cycle (fused multiply adds). Combined with the frequency of 1607 MHz we land on the advertised 8.2 TFlop/s.
The fact that each SM can support 1024 threads (that's the maximum blocksize of CUDA on that card) doesn't do much for the theoretical flops. Only a fraction of those threads can be active at a time. The others are idling or waiting on their memory requests. This hides a lot of the i/o latency.
For sure. Just counting threads doesn't give anything like a complete picture of performance.
It's still somewhat interesting because threads are a low-level programming primitive. If you can come up with work for 40k simultaneous threads, you can use the GPU effectively. For some tasks this parallelization is obvious (a HD video frame has 2 million pixels and shading them independently is trivial), and of course often it's anything but.
If the architecture is vastly different, these comparisons become sort of meaningless, though. The ultimate performance is determined by all the tiny little bottlenecks, like how quickly the architecture can move data between different types of cores, memory, cache, etc.
Apple has always been really good at parallelism, which is why they get so much performance from less power consumption.
Yeah, it’s completely meaningless to compare but HN loves specs & seemingly hasn’t learned—after about 15 years of it being true—that direct spec comparisons are meaningless.
I see if with every major product announcement, the worst are usually Apple threads but it’s not constrained only there.
Question from a GPU novice. I presume a thread is the individual calculation perform on one element of the vector? Can the 128 cores, or 20 SM perform different operations at the same time or all 24,576 threads perform the same operation at the same time, on a vector of data of length of 24,576?
"128 * the-number-of-cores" of threads can make progress truly in parallel (at the same time).
24,576 threads (or however many, I didn't validate the number and it depends on the occupancy, which depends on thread resource usage, like registers => depends on the shader program code) is how many threads can be executed concurrently (as opposed to in parallel), as in, how many of them can simultaneously reside on the GPU. A subset of those at any time are actually executed in parallel, the rest are idle.
You can think of this situation as follows using an analogy with a CPU and an OS:
1. 128 * the-number-of-cores is the number of CPU cores(*1)
2. 24,576 threads is the number of threads in the system that the OS is switching between
Major differences with the GPU:
3. On a CPU context switch (getting a thread off the core, waking up a different thread, restoring the context, and proceeding) takes about 2,000 cycles. On a GPU _from the analogy_ that kind of thread switching takes ~1-10 cycles depending on the exact GPU design and various other details.
4. In CPU/OS world the context switching and scheduling on the OS side is done mostly in software, as the OS is indeed software. In GPU's case the scheduler and all the switching is implemented as fixed function hardware finely permeating the GPU design.
5. In CPU/OS world those 2,000 cycles per context switch is so much larger than a roundtrip to DRAM while executing a load instruction that happened to miss in all caches - which is about 400-800 cycles or so depending on the design - that OS never switches threads to hide latencies of loads, it's pointless. As far as performance is concerned (as opposed to maintaining the illusion of parallel execution of all programs on the computer), the thread switching is used to hide the latency of IO - non-volatile storage access, network access, user input, etc. (which takes millions of cycles or more - so it makes sense).
In the GPU world the switching is so fast, that the hardware scheduler absolutely does switch from thread to thread to hide latencies of loads (even the ones hitting in half of the caches, if that happens), in fact, hiding these latencies and thus keeping ALUs fed is the whole point of this basic design of pretty much all programmable GPUs that there ever were.
6. In real world CPU/OS, the threads that aren't running at the time reside (their local variables, etc) in the memory hierarchy, technically, some of it ends up in caches, but ultimately, the bulk of it on a loaded system is in system DRAM. On a GPU, or I suppose by now we have to say, on a traditional GPU, these resident threads (their local variables, etc) reside in on-chip SRAM that is a part of the GPU cores (not even in a chunk on a side, but close to execution units in many small chunks, one per core). While the amount of DRAM (CPU/OS) is a) huge, gigabytes, and b) easily configurable, the amount of thread state the GPU scheduler is shuffling around is measured typically in hundreds of KBs per GPU Core (so on the order of about "a few MBs" per GPU), and the equally sized SRAM storing this state is completely hardwired in the silicon design of the GPU and not configurable at all.
Hope that helps!
footnotes
(*1) a better analogy would be not "number of CPU cores", but "number-of-CPU-cores * SMT(HT) * number-of-lanes-in-AVX-registers", where number-of-lanes-in-AVX-registers is basically "AVX-register-width / 32" for FP32 processing which (the latter) yields about ~8 give or take 2x depending on the processor model. Whether to include SMT(HT) multiplier (2) in this analogy is also murky, there is an argument to be made for yes, and an argument to be made for no, and depends on the exact GPU design in question.
Also, your "128 cuda cores" of Skylake variety run at higher frequencies and work off of much bigger caches, so they are faster (in serial manner)...
...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time...
...until they are faster again when the shader program uses a lot of registers and GPU occupancy drops to the floor and latency hiding stops hiding that well.
> ...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time...
Is the GPU latency hiding mechanism equivalent to SMT/Hyperthreading, but with more threads per physical core? Or is there more machinery?
Also, how akin GPUs "stream multiprocessors"/cores are to CPUs ones at the microarchitectural level? Are they out-of-order? Do they do register renaming?
As you state, GPU latency hiding is basically equivalent to hyper threading, just with more threads per core. For example, for a 'generic' modern GPU, you might have:
A "Core" (Apple's term) / "Compute Unit" (AMD) / "Streaming Multiprocessor" (Nvidia) / "Core" (CPU world). This is the basic unit that gets replicated to build smaller/larger GPUs/CPUs
* Each "Core/CU/SM" supports 32-64 waves/simdgroups/warps (amd/apple/nvidia termology), or typically 2 threads (cpu terminology for hyperthreading). ie, this is the unit that has a program counter, and is used to find other work to do when one thread is unavailable. (this blurred on later Nvidia parts with Independent Thread Scheduling.)
* The instruction set typically has a 'vector width'. 4 for SSE/NEON, 8 for AVX, or typically 32 or 64 for GPUs (but can range from 4 to 128)
* Each Core/CU/SM can execute N vector instructions per cycle (2-4 is common in both CPUs and GPUs). For example, both Apple and Nvidia GPUs have 32-wide vectors and can execute 4 vectors of FP32 FMA/cycle. So 128 FPUs total, or 256 FMAs/cycle Each of these FPUs what Nvidia calls a "Core", which is why their core counts are so high.
In short, the terminology exchange rate is 1 "Apple GPU Core" == 128 "Nvidia GPU Cores", on equivalent GPUs.
I'll leave your first question to the other comment here from frogblast, as I really battled with how to answer it well, given my limited knowledge and being an elbow deep into an analogy, after all. I got a writer's block, and frogblast actually answered something :D
> how akin GPUs "stream multiprocessors"/cores are to CPUs ones at the microarchitectural level?
I'd say, if you want to get a feel for it in a manner directly relevant to recent designs, then reading through [1], [2], subsequent conversation between the two, and documents they reference should scratch that curiosity itch well enough, from the looks of it.
If you want a much more rigorous conversation, I could recommend the GPU portion of one of the lectures from CMU: [3], it's quite great IMO. It may lack a little bit in focus on contemporary design decisions that get actually shipped by tens of millions+ in products today and stray to alternatives a bit. It's the trade-off.
> Are they out-of-order?
Short answer: no.
GPUs may strive to achieve "out of order" by picking out a different warp entirely and making progress there, completely circumventing any register data dependencies and thus any need to track them, achieving a similar end objective in a drastically more area and power efficient manner than a Tomasulo's algorithm would.
For example a 1080 GTX GPU has 20 stream multiprocessors (SM), each containing 128 cores, each of which supports 16 threads.
Meanwhile Apple describes the M1 GPU as having 8 cores, where “each core is split into 16 Execution Units, which each contain eight Arithmetic Logic Units (ALUs). In total, the M1 GPU contains up to 128 Execution units or 1024 ALUs, which Apple says can execute up to 24,576 threads simultaneously and which have a maximum floating point (FP32) performance of 2.6 TFLOPs.”
So one option to get a single number for a rough comparison is to count threads. The 1080 GTX supports 40,960 threads while the M1 supports 24,576 threads.
There’s obviously a lot more to a GPU — for starters, varying clock speeds, ALUs can have different capabilities, memory bandwidth, etc. But at least counting threads gives a better idea of the processing bandwidth than talking about cores.