This screams "Not Invented Here" syndrome. Massive yikes at the digram showing TCP in software in the OSI model. There have been hardware accelerated TCP stacks for decades. They called TCP Offload Engines, they work great, have done for ages. Why are you building one and giving it a new name? Seems like a pretty enormous amount of work and you would've gotten 90+% of the gains by just implementing a standard TOE.
I guess the only good reason I can think to do this yourself is that they'd left it so late to get to this that all the companies that were good at this got bought (Solareflare, Mellanox etc).
This is a bad take. Here's my take:
1. Standard IP and Ethernet are physically acceptable for their use case.
2. TCP/IP is optimized in a number of areas for un-reliable networks.
3. Their clusters are not unreliable.
4. Their servers already offer hardware acceleration that can be programmed, so remove aspects of TCP/IP that increase latency or might negatively affect throughput.
5. They get to continue to purchase the cheaper IP switches and retain their existing hardware without retooling everything.
As an afterthought, they publish this for marketing/engineering pull for people who like to optimize (do engineering) for specific situations, while supporting the ROI and keeping cost down.
> This is a bad take. Here's my take: (...) 5. They get to continue to purchase the cheaper IP switches and retain their existing hardware without retooling everything.
OP's TCP offload engines do not require any retooling at all, if you consider stuff like buying IP switches not retooling. All you need to do is basically buy a network card.
Also, if you read the article you'll eventually stumble upon the bar chart where they claim that the one-way write latency of TTPoE is only 0.7E-6 seconds faster than existing tech like infiniband, and it's also the only somewhat hard data they show. Does that justify the investment of developing their whole software+hardware stack?
I'm sure the project was fun and should look great on a CV, but overall it doesn't look like it passes the smell test.
They don't need the generality of full TCP for their cluster, so they're using a tweaked, incompatible subset. One that's been optimized for better performance on cheaper hardware than you can get with TCP h/w offload. In the offload case you're still paying the latency, wire protocol overhead, and efficiency costs of full TCP.
(Disclaimer: I work at Tesla, not related to this group, opinions on public info only)
And you're still paying the other performance and efficiency costs of TCP. ROCE also isn't a magic bullet.
Every system has a cost and tradeoffs. Just because someone took an unusual path doesn't mean that they were wrong. And the larger and more specialized their use case is, the less likely that a generic solution is the best match.
> And you're still paying the other performance and efficiency costs of TCP. ROCE also isn't a magic bullet.
Tesla's own charts show ROCE achieving also one-way write latencies in the single-digit microsecond range. If that doesn't qualify as magic bullet, what does that say about TTPoE?
"Instead of using typical supercomputer networking solutions like Infiniband, Tesla chose to adapt Ethernet to their needs with a modified transport layer."
So, they need to compare it to Infiniband, not TCP, and definitely not software TCP. And they need to explain how/if it works with standard huge capacity switches (which is at least a reason to prefer TCP over Infiniband).
There could be reasons to build this, AWS have something, but for Tesla to build their own stinks of NIH bad.
they are purpose building hardware for their specific application. debugging corner cases and making this robust is going to take them a decade. given that nobody else is interested in this non-standard solution, they dont have the benefit of the community debugging it, and improving on it in open-source.
appears to me to be a vanity effort as is the whole dojo project.
I have built dozen of different FPGA based cameras in the past. There is GigE vision protocol: https://en.m.wikipedia.org/wiki/GigE_Vision TCP is used for “normal” connection and UDP for low latency video data. Such system could be used for other low latency applications as well.
I do industrial controls so very familiar. IP is a lot of overhead that doesn't really do anything for the user in a tightly defined automation network local to a machine. EtherCAT goes a step lower and drops IP in favor of just sending Ethernet frames of type 0x88A4. It uses a unique ring topology. It does not use traditional switching or repeaters with the IO devices containing a special controller called the ESC, the EtherCAT Subordinate Controller. The master only needs a standard Ethernet controller. You can get cycle times in the 10's of microseconds allowing for up to 50khz update rates on IO devices. This allows you to do do high performance servo motor control where you close the current loop in the master CPU over 100mbit Ethernet and easily reach 10+ kHz update rates.
With FPGAs using commodity SFP or Ethernet PHY's you can certainly build stuff that runs circles around traditional Ethernet and associated overhead from protocols like IP.
This kind of computing must be a different kind of world than the one I work in. 80 microseconds of latency seems high to me when infiniband can do single digit latency with unreliable datagrams, which turn out to be mostly reliable due to the credit system.
Well, whether it matters depends on the workload: IB is basically remote DMA so if you need to pick and poke remote data I guess it'll work as another NUMA tier.
But for AI training, where you're simply shuffling around large stacks of matrices, my guess is latency constraints weaken.
Not really familiar with this space but I think the entire Dojo/DIY strategy was kicked off because Elon wanted to not get cornered on supply or cost by nvidia. And infiniband is an nvidia technology, so they wouldn’t use that simply from strategic POV.
Are there other technologies they could have used?
Also, the 80us is supposed to be the worst case, where typical is supposed to be <10us. Again not knowing anything about infiniband, what’s the typical perf? I tried to google but the people who are talking about it are in the know in ways I’m not.
OP is referring to "Credit based flow control", which is a way to ensure a sender does not overwhelm a receiver with more data than it can handle.
Usually, this is line-rate, but if the other side is slow for whatever reason (say the consumer is not draining data), you wouldn't want the sender to continue sending data.
If you also have N hosts sending data to 1 host, you would need some way of distributing the bandwidth among the N hosts. That's another scenario where the credit system comes. Think of it as an admission control for packets so as to guarantee that no packets are lost. Congestion control is a looser form of admission control that tolerates lossy networks, by retransmitting packets should they be lost.
They'd respond to your kind words, but there are two faulty cables in their token ring network, and as such, no redundant paths for the beacon frame to get through.
It's not, really. It's been a while since I've checked pricing so my data might be old, but an IB switch is in the same ballpark as an ethernet switch with the same port speed. Same for HCA's.
There's no analogue in the Infiniband world to dirt cheap 1GbE RJ-45 switches though.
> an IB switch is in the same ballpark as an ethernet switch with the same port speed
And both price tags will make Elon's "someone's scamming me with a 'you're an enterprise customer' surcharge" sense tingle. The price tags for anything enterprise networking related are seriously inflated, and I would not be surprised if just making your own NICs and switches is cheaper once you hit a certain deployment size.
Networking has never been so cheap at the highest end. Look at the road we've been through in the last 8 in years, rapidly going from 40 to 100 to 400 (200 was somewhat of a dud, 400 came too early) to 800 to 1600Gbps. It's amazing.
I'm having trouble feeding things at 400GB/s (not a typo, it's gigabyte/s) per H100 box.
On the other hand, silicon has never been so cheap.
The hardest things would've been the DDR4 and PCIE interface. But as they're using standard interfaces, and last generation. I'm sure they got a good discount on all that IP and it didn't cost them hardly any man hours to integrate. And Tesla might've even already had the licenses and IP setup as they make other ASICs.
I didn't do a budget or anything, but at even 10Ks of units, I could see how this could save money. Or at least not loose money. Assuming a comparable IB network card is ~$1000, which I also didn't price.
And there could be other potential cost offsetting features, like power savings.
I've replied in a thread, and what I've replied to above was about the enterprise tax and that it's surely cheaper to do your own, not if the original article is about enterprise or not.
> > I would not be surprised if just making your own NICs and switches is cheaper once you hit a certain deployment size.
> Ideas you’ll never hear at Google or meta
You'd be surprised. Google has a very strong tradition of "not-invented here" which extends to some of our production networking gear as well.
To be fair, at the time, some of this was justified because the available devices on the market couldn't support our use cases back then.
Per section 3.2 of the 2013 B4 paper [0]:
Even so, the main reason we chose to build our own hardware
was that no existing platform could support an SDN deployment,
i.e., one that could export low-level control over switch forwarding
behavior. Any extra costs from using custom switch hardware are
more than repaid by the efficiency gains available from supporting
novel services such as centralized TE.
You couldn't have picked a better/worse duo in tech to be wrong on such an assertion.
Adding to sibling comment about Google, Meta[1] built 2 large-scale production training clusters for science: one with Infiniband, the other one with a custom RDMA over RoCE fabric.
> Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently.
> With this in mind, we built one cluster with a remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric solution based on the Arista 7800 with Wedge400 and Minipack2 OCP rack switches.
Google, Meta and Netflix are among the most obsessive on optimizing their infrastructure - it's bold to assume they haven't looked at their COTS network gear and thought "hmmm..."
I don't believe the pricing is in the same ballpark.
There's also other differences, such as port counts.
AFAICT Spectrum switches at 400Gbps have up to 128 ports whereas equivalent Infiniband NDR Quantum only have 64 [0].
When building clusters of 32K+ GPUs the network cost, power, transceivers etc start to add up.
At least when I looked into it last, it was less about price and more about availability. NIC's are easy to get, but the switches were a 50+ week lead time. I'm sure it has come down now though at least for my business, we are just focused on ethernet now. We can do 128 NIC's @ 400G into a single Dell Z9864F using ethernet. Few will need proprietary solutions for this as there are lower hanging fruit.
This is all technically impressive but was it all technically necessary? Was infiniband really just not good enough? All this R&D for a custom protocol and custom NICs seems to just be a massive flex of Tesla's engineering muscle.
Infiniband suppliers charge crazy prices due to having little competition. It might actually be cheaper for them to design their own than to pay the Infiniband tax.
And supply chain independence. I've heard that some GPU clouds are delayed because their Infiniband hardware was delayed due to the Israel–Hamas war. Optimally you probably want to avoid critical hardware that's being manufactured in a high risk of disruption zone.
It's not only about being "good enough" but also about reliability and maintenance, new protocols and hardware may take time to mature while other solutions are already there. Ah, wait, Tesla doesn't care about those kind of things too much...
It's also a bit odd that they do not implement congestion control. Congestion control is fundamental unless you only have point-to-point data transfers, which is rarely the case. All-reduce operation during training requires N to 1 data transfer. In these scenarios the sender needs to control its data transfer rates so as to not overwhelm not just the receiver, but also the network... if this is not done, it will cause congestion collapse (https://en.wikipedia.org/wiki/Network_congestion#:~:text=ser...).
Current public SOTA seems to be “no congestion control”
> We proceeded without DCQCN for our 400G deployments. At this time, we have had over a year of experience with just PFC for flow control, without any other transport-level congestion control. We have observed stable performance and lack of persistent congestion for training collectives.
I probably shouldn't be commenting because I don't have any experience at this level, but given it's a closed system where they control supply and demand it seems they could manage away most congestion issues with scheduling/orchestration. They still have a primitive flow control in the protocol and it seems like you could create something akin to a virtual sliding window just by instrumenting the retransmits.
But now I am curious with the distribution of observed window sizes is in the wild.
Edit: I'd bet the simpler protocol is more vulnerable to various spoofing attacks though.
In principle, with perfect knowledge of flows at any given instant, you can assign credits/rate-of-transmission for each flow to prevent congestion. But, in practice this is somewhat nuanced to build, and there are various tradeoffs to consider: what happens if the flows are so short that coordinating with a centralised scheduler incurs a latency overhead that is comparable to the flow duration? There's been research to demonstrate that one can strike a sweet spot, but I don't think it's practical nor has it been really deployed in the wild. And of course, this scheduler has to be made reliable as it's a single point of failure.
Such ideas are, however, worth revisiting when the workload is unique enough (in this case, it is), and the performance gains are so big enough...
Maybe the protocol could have arbitration built in? If one was clever you could actually have the front of the packet set a priority header, and build the collision detection/avoidance right into the header.
Multiple parties communicate at the same time? Lower number priority electrically could pull the voltage low, dominating the transmission.
That way, priority messages always get through with no overhead or central communication required.
Yep, such ideas have been around. But congestion is a fundamental problem. Admission control is the only way to ensure there is no congestion collapse.
The technical issue is that you would need global arbitration to ensure that the _goodput_ (useful bytes delivered) is optimal. With training across 32k GPUs and more these days, global arbitration to ensure the correct packets are prioritised is going to be very difficult. If you are sending more traffic than the receiver's link capacity, packets _will_ get dropped, and it's suboptimal to transmit those dropped packets into the network as they waste link capacity elsewhere (upstream) within the network.
You are right, and I realized that a few min after posting. So tell me, why is Tesla not doing straight PCIe then? If it's the actual good/cheap way. It makes sense to me but that's more a feeling.
All this to...train a model? Having worked with automakers that agonize over pennies on a component how does Tesla amortize this much expense, if they even achieve full self-driving at all?
> Having worked with automakers that agonize over pennies on a component how does Tesla amortize this much expense, if they even achieve full self-driving at all?
Or Elon is using the resources of one company to do work for another company (?):
If they are so concerned with low latency, how come they are wasting an entire roundtrip (the "OPEN + OPEN/ACK") before sending any data?
I mean in TCP it's not allowed (Even though, super-theoretically, it's not completely forbidden) to carry a payload in the initial TCP SYN. If you're so latency-obsessed to create your own protocol, that's the first thing I'd address.
Interesting! No mention of UDP, or the application being run, or the GPU/TPUs on the nodes, so it'll have to be a mystery as to how much bang for their buck they're getting with this particular bit of work.
What's disappointing is that it's impossible to do a new protocol on the Internet because of all the middleware boxes that drop packets that aren't IMCP or TCP or UDP.
UDP means that upper layers have to figure out if the data is shit or not. to do that in hardware limits your data types, flows and developments. Its far more simple to have a "reliable" data transport system, than it is to deal with a lossy protocol in hardware.
It doesn't say they can't aggregate links. I wouldn't say the world has moved way past that yet, but Tesla probably doesn't want to be dependant, like everyone else but Apple, on Nvidia (InfiniBand).
> like everyone else but Apple, on Nvidia (InfiniBand)
Google also does not depend on NVIDIA, thx to TPUs. Rents NVIDIA GPUs to external customers - sure, it's a nice side business, but internally TPUs are king and there's no dependency on NVIDIA for that.
From what I know, Meta AI chips are used in production today, but are made for their recommendations tasks which is a very different IA than GPTs and LLMs for which they still rely on GPUs.
It's very hard to respond to a comment like that, since there's no specifics, just a plain disagreement.
On my side, I would like to point that the today HN thread ([1]) that discusses a paper GameNGen ([2]) that runs Doom with diffusion models was trained on TPUs.
I don't see a dependency on NVIDIA there.
If there's a more specific rebuttal to my original statement, please, don't hesitate to state it.