The DGX A100 was $200k at launch. I found a DGX H100 in the mid $300k area. And those are 8 GPU systems. So you need 32 of those, and each one will definitely cost more plus networking. Super low estimate would be $500k each for $16M total. But considering its moving from 98GB to 480GB RAM per GPU. Might be more like $1.5M per 8, round it to say $50M.
And at 1/8th the power per GB, you have 700 Watts / 96GB / 8 * 480GB come to around 450 Watts per. And 115kw for the 256.
What does this mean for the AI race? For example what if a newish company (newer than Google/Facebook/Microsoft/etc.) like Anthropic, Scale, Perplexity, or Stability is able to scrape together $5B USD funding and spend their hardware budget on these things. Say that can buy $1B of them and spend the rest on hackers and operating expenses (idk if that's realistic). So maybe they could purchase and operate like 20 of them. Say that they spend six months doing experimental things and then the next six months training their Tsar Model. If they follow the Chinchilla scaling laws and normal architectures, how good will these models be?
I have no expertise in GPU System used for AI Learning, but would It be possible to buy a bunch of consumer cards and get the same performance?
Or is this not possible because consumer cards go to 40 ish GB RAM and Models would not fit or „swapping“ like crazy and be slow.
Consumer cards only have PCIe 4.0, at most 24GB VRAM and the only recent model with NVLink, the RTX 3090, can only be connected to exactly one other card. It doesn't scale beyond that. So you are limited to PCIe 4.0 x16 speeds.
The NVLink interconnect on all the GPUs is a huge part of it, and cannot come even remotely close to that bandwidth with consumer goods. Then the density of RAM to compute and power is huge. A single 4090 is 450 watts, for 24GB where this is 20x the memory for the same watts. 2.3Mw or so. If you say $0.14 / kwh, thats something like $325 / hour in power costs to run. Not counting additional cooling you are definitely going to need. And I am sure there is inefficiency this doesn't cover but 240v 10,000+ Amps for that?
And at 1/8th the power per GB, you have 700 Watts / 96GB / 8 * 480GB come to around 450 Watts per. And 115kw for the 256.