Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Tesla’s TTPoE at Hot Chips 2024: Replacing TCP for Low Latency Applications (chipsandcheese.com)
172 points by ksec on Aug 28, 2024 | hide | past | favorite | 91 comments


This screams "Not Invented Here" syndrome. Massive yikes at the digram showing TCP in software in the OSI model. There have been hardware accelerated TCP stacks for decades. They called TCP Offload Engines, they work great, have done for ages. Why are you building one and giving it a new name? Seems like a pretty enormous amount of work and you would've gotten 90+% of the gains by just implementing a standard TOE. I guess the only good reason I can think to do this yourself is that they'd left it so late to get to this that all the companies that were good at this got bought (Solareflare, Mellanox etc).


This is a bad take. Here's my take: 1. Standard IP and Ethernet are physically acceptable for their use case. 2. TCP/IP is optimized in a number of areas for un-reliable networks. 3. Their clusters are not unreliable. 4. Their servers already offer hardware acceleration that can be programmed, so remove aspects of TCP/IP that increase latency or might negatively affect throughput. 5. They get to continue to purchase the cheaper IP switches and retain their existing hardware without retooling everything.

As an afterthought, they publish this for marketing/engineering pull for people who like to optimize (do engineering) for specific situations, while supporting the ROI and keeping cost down.


> This is a bad take. Here's my take: (...) 5. They get to continue to purchase the cheaper IP switches and retain their existing hardware without retooling everything.

OP's TCP offload engines do not require any retooling at all, if you consider stuff like buying IP switches not retooling. All you need to do is basically buy a network card.

Also, if you read the article you'll eventually stumble upon the bar chart where they claim that the one-way write latency of TTPoE is only 0.7E-6 seconds faster than existing tech like infiniband, and it's also the only somewhat hard data they show. Does that justify the investment of developing their whole software+hardware stack?

I'm sure the project was fun and should look great on a CV, but overall it doesn't look like it passes the smell test.


They don't need the generality of full TCP for their cluster, so they're using a tweaked, incompatible subset. One that's been optimized for better performance on cheaper hardware than you can get with TCP h/w offload. In the offload case you're still paying the latency, wire protocol overhead, and efficiency costs of full TCP.

(Disclaimer: I work at Tesla, not related to this group, opinions on public info only)


> One that's been optimized for better performance on cheaper hardware than you can get with TCP h/w offload.

How many ≥10 Gbps chipsets that you'd find in a typical server do not have offload nowadays?

Further, once you're in the ≥50 Gbps card range you can often get ROCE, which helps with things like latency.


And you're still paying the other performance and efficiency costs of TCP. ROCE also isn't a magic bullet.

Every system has a cost and tradeoffs. Just because someone took an unusual path doesn't mean that they were wrong. And the larger and more specialized their use case is, the less likely that a generic solution is the best match.


> And you're still paying the other performance and efficiency costs of TCP. ROCE also isn't a magic bullet.

Tesla's own charts show ROCE achieving also one-way write latencies in the single-digit microsecond range. If that doesn't qualify as magic bullet, what does that say about TTPoE?


"Instead of using typical supercomputer networking solutions like Infiniband, Tesla chose to adapt Ethernet to their needs with a modified transport layer."

So, they need to compare it to Infiniband, not TCP, and definitely not software TCP. And they need to explain how/if it works with standard huge capacity switches (which is at least a reason to prefer TCP over Infiniband).

There could be reasons to build this, AWS have something, but for Tesla to build their own stinks of NIH bad.


agree 100% with this take.

they are purpose building hardware for their specific application. debugging corner cases and making this robust is going to take them a decade. given that nobody else is interested in this non-standard solution, they dont have the benefit of the community debugging it, and improving on it in open-source.

appears to me to be a vanity effort as is the whole dojo project.


I assume they ignore other technologies and research because new shiny things gives them visibility and therefor promotions.


I have built dozen of different FPGA based cameras in the past. There is GigE vision protocol: https://en.m.wikipedia.org/wiki/GigE_Vision TCP is used for “normal” connection and UDP for low latency video data. Such system could be used for other low latency applications as well.


I do industrial controls so very familiar. IP is a lot of overhead that doesn't really do anything for the user in a tightly defined automation network local to a machine. EtherCAT goes a step lower and drops IP in favor of just sending Ethernet frames of type 0x88A4. It uses a unique ring topology. It does not use traditional switching or repeaters with the IO devices containing a special controller called the ESC, the EtherCAT Subordinate Controller. The master only needs a standard Ethernet controller. You can get cycle times in the 10's of microseconds allowing for up to 50khz update rates on IO devices. This allows you to do do high performance servo motor control where you close the current loop in the master CPU over 100mbit Ethernet and easily reach 10+ kHz update rates.

With FPGAs using commodity SFP or Ethernet PHY's you can certainly build stuff that runs circles around traditional Ethernet and associated overhead from protocols like IP.


This kind of computing must be a different kind of world than the one I work in. 80 microseconds of latency seems high to me when infiniband can do single digit latency with unreliable datagrams, which turn out to be mostly reliable due to the credit system.


AFAIKS the protocol can tolerate up to about 80 microseconds of latency.

The graph at the end shows they measured (one way) latency at 1.3 microseconds (compared with 2.0 for IB).


Also PCIe is worth mentioning for it's credit-based full reliability (in the absence of hardware failures, which are still signaled).


Well, whether it matters depends on the workload: IB is basically remote DMA so if you need to pick and poke remote data I guess it'll work as another NUMA tier.

But for AI training, where you're simply shuffling around large stacks of matrices, my guess is latency constraints weaken.


Not really familiar with this space but I think the entire Dojo/DIY strategy was kicked off because Elon wanted to not get cornered on supply or cost by nvidia. And infiniband is an nvidia technology, so they wouldn’t use that simply from strategic POV.

Are there other technologies they could have used?

Also, the 80us is supposed to be the worst case, where typical is supposed to be <10us. Again not knowing anything about infiniband, what’s the typical perf? I tried to google but the people who are talking about it are in the know in ways I’m not.

Thanks!


Indeed, it seems that 80usec is just given as an upper bound based on the 1MB buffer at 100Gbps.

It is definitely possible to go much lower than 80usec on Ethernet. But obviously it depends on the scale, utilisation etc.

At the sizes of GPU clusters we're talking about these days - 32K and up - things get tricky.

The main alternative to Infiniband used in the industry is RoCE - Meta has written a lot about it [0].

There's several reasons to avoid Infiniband, such as cost, availability, vendor lock in, lack of experience etc.

Those are some of the reasons why many players are trying hard to make Ethernet work, see Ultra Ethernet [1].

[0] https://engineering.fb.com/2024/08/05/data-center-engineerin...

[1] https://ultraethernet.org/


It’s not even rare for Ethernet to be 1.5usec or less latency per switch. 80usec would be impossible to sell in any compute cluster.


> It’s not even rare for Ethernet to be 1.5usec or less latency per switch.

IIRC, Arista started off focusing on the financial market with low latency.

There's fairly well regarded in a general sense nowadays (at least /r/networking often has folks recommending them as a vendor).

"Measuring the latency of a 4ns switch":

* https://www.arista.com/assets/data/pdf/Latency-4ns-Switch-So...


RoCE is close enough, I think is how meta justifies it.


Is RoCE no good?


The problem is that it kind of relies on a lossless layer 2 (flow control) which has its own set of problems in large scale networks. This is what things like this try to solve: https://cloud.google.com/blog/topics/systems/introducing-fal...


whats the credit system?


OP is referring to "Credit based flow control", which is a way to ensure a sender does not overwhelm a receiver with more data than it can handle.

Usually, this is line-rate, but if the other side is slow for whatever reason (say the consumer is not draining data), you wouldn't want the sender to continue sending data.

If you also have N hosts sending data to 1 host, you would need some way of distributing the bandwidth among the N hosts. That's another scenario where the credit system comes. Think of it as an admission control for packets so as to guarantee that no packets are lost. Congestion control is a looser form of admission control that tolerates lossy networks, by retransmitting packets should they be lost.


Those token ring folks were on to something.


They'd respond to your kind words, but there are two faulty cables in their token ring network, and as such, no redundant paths for the beacon frame to get through.


I assume infiniband is much more expensive, but then again you have to offset all the development cost first.


It's not, really. It's been a while since I've checked pricing so my data might be old, but an IB switch is in the same ballpark as an ethernet switch with the same port speed. Same for HCA's.

There's no analogue in the Infiniband world to dirt cheap 1GbE RJ-45 switches though.


> an IB switch is in the same ballpark as an ethernet switch with the same port speed

And both price tags will make Elon's "someone's scamming me with a 'you're an enterprise customer' surcharge" sense tingle. The price tags for anything enterprise networking related are seriously inflated, and I would not be surprised if just making your own NICs and switches is cheaper once you hit a certain deployment size.


Networking has never been so cheap at the highest end. Look at the road we've been through in the last 8 in years, rapidly going from 40 to 100 to 400 (200 was somewhat of a dud, 400 came too early) to 800 to 1600Gbps. It's amazing.

I'm having trouble feeding things at 400GB/s (not a typo, it's gigabyte/s) per H100 box.

For 10 boxes ideally you want 4TB/s...


On the other hand, silicon has never been so cheap.

The hardest things would've been the DDR4 and PCIE interface. But as they're using standard interfaces, and last generation. I'm sure they got a good discount on all that IP and it didn't cost them hardly any man hours to integrate. And Tesla might've even already had the licenses and IP setup as they make other ASICs.

I didn't do a budget or anything, but at even 10Ks of units, I could see how this could save money. Or at least not loose money. Assuming a comparable IB network card is ~$1000, which I also didn't price.

And there could be other potential cost offsetting features, like power savings.


No, but this isn't high end (in 2024), or "enterprise". It's their own designed 100Gb dumb NIC.


I've replied in a thread, and what I've replied to above was about the enterprise tax and that it's surely cheaper to do your own, not if the original article is about enterprise or not.


I read the thread. What you replied to was this,

> > an IB switch is in the same ballpark as an ethernet switch with the same port speed

> And both price tags will make Elon's "someone's scamming me with a 'you're an enterprise customer' surcharge" sense tingle.

In the context of Tesla doing their own protocol and not-high-end NICs.


If you were starting a team from scratch this is surely true. But, if you can leverage existing teams and infrastructure, it’s very possible.


> The price tags for anything enterprise networking related are seriously inflated

I mean yeah, but thats why you have negotiators. List price is what suckers pay.

As soon as you start to buy in job lots, or the total price comes to >$500k then stuff becomes a lot cheaper all of a sudden (within reason)

Having said that Infiniband is an arse to deploy, but not as much as custom networking protocol on custom silicon.


Ideas you’ll never hear at Google or meta


> > I would not be surprised if just making your own NICs and switches is cheaper once you hit a certain deployment size.

> Ideas you’ll never hear at Google or meta

You'd be surprised. Google has a very strong tradition of "not-invented here" which extends to some of our production networking gear as well.

To be fair, at the time, some of this was justified because the available devices on the market couldn't support our use cases back then.

Per section 3.2 of the 2013 B4 paper [0]:

  Even so, the main reason we chose to build our own hardware
  was that no existing platform could support an SDN deployment,
  i.e., one that could export low-level control over switch forwarding
  behavior. Any extra costs from using custom switch hardware are
  more than repaid by the e›fficiency gains available from supporting
  novel services such as centralized TE.
https://cseweb.ucsd.edu/~vahdat/papers/b4-sigcomm13.pdf


You couldn't have picked a better/worse duo in tech to be wrong on such an assertion.

Adding to sibling comment about Google, Meta[1] built 2 large-scale production training clusters for science: one with Infiniband, the other one with a custom RDMA over RoCE fabric.

> Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently.

> With this in mind, we built one cluster with a remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric solution based on the Arista 7800 with Wedge400 and Minipack2 OCP rack switches.

Google, Meta and Netflix are among the most obsessive on optimizing their infrastructure - it's bold to assume they haven't looked at their COTS network gear and thought "hmmm..."

1. https://engineering.fb.com/2024/03/12/data-center-engineerin...


I mean the exact opposite, those were common what ifs


I don't believe the pricing is in the same ballpark.

There's also other differences, such as port counts. AFAICT Spectrum switches at 400Gbps have up to 128 ports whereas equivalent Infiniband NDR Quantum only have 64 [0].

When building clusters of 32K+ GPUs the network cost, power, transceivers etc start to add up.

[0] https://www.semianalysis.com/p/100000-h100-clusters-power-ne...


With IB (Quantum X800) you have 72x physical OSFP ports that can do 144 ports. Each electrical lane runs at 200G-PAM4, so it's 144x(4x200G-PAM4).

With the SN5600 for Ethernet (Spectrum-X), which is 64 physical ports, you're running each port at 8x100G-PAM4).


Is the Quantum X800 generally available now? Looks like it was announced in March. The SN5600 was released last year on the other end.


your data is likely old. IB demand has gone pretty wild since the AI boom.


At least when I looked into it last, it was less about price and more about availability. NIC's are easy to get, but the switches were a 50+ week lead time. I'm sure it has come down now though at least for my business, we are just focused on ethernet now. We can do 128 NIC's @ 400G into a single Dell Z9864F using ethernet. Few will need proprietary solutions for this as there are lower hanging fruit.


> infiniband

Or I guess even RoCE


This is all technically impressive but was it all technically necessary? Was infiniband really just not good enough? All this R&D for a custom protocol and custom NICs seems to just be a massive flex of Tesla's engineering muscle.


> Was infiniband really just not good enough

Infiniband suppliers charge crazy prices due to having little competition. It might actually be cheaper for them to design their own than to pay the Infiniband tax.


It wasn't technically necessary, but neither was RISC-V. It's a matter of licensing independence.


> It's a matter of licensing independence.

And supply chain independence. I've heard that some GPU clouds are delayed because their Infiniband hardware was delayed due to the Israel–Hamas war. Optimally you probably want to avoid critical hardware that's being manufactured in a high risk of disruption zone.


It's not only about being "good enough" but also about reliability and maintenance, new protocols and hardware may take time to mature while other solutions are already there. Ah, wait, Tesla doesn't care about those kind of things too much...


It's also a bit odd that they do not implement congestion control. Congestion control is fundamental unless you only have point-to-point data transfers, which is rarely the case. All-reduce operation during training requires N to 1 data transfer. In these scenarios the sender needs to control its data transfer rates so as to not overwhelm not just the receiver, but also the network... if this is not done, it will cause congestion collapse (https://en.wikipedia.org/wiki/Network_congestion#:~:text=ser...).


Current public SOTA seems to be “no congestion control”

> We proceeded without DCQCN for our 400G deployments. At this time, we have had over a year of experience with just PFC for flow control, without any other transport-level congestion control. We have observed stable performance and lack of persistent congestion for training collectives.

https://engineering.fb.com/2024/08/05/data-center-engineerin...


PFC is congestion control.


I probably shouldn't be commenting because I don't have any experience at this level, but given it's a closed system where they control supply and demand it seems they could manage away most congestion issues with scheduling/orchestration. They still have a primitive flow control in the protocol and it seems like you could create something akin to a virtual sliding window just by instrumenting the retransmits.

But now I am curious with the distribution of observed window sizes is in the wild.

Edit: I'd bet the simpler protocol is more vulnerable to various spoofing attacks though.

Edit2: Lol I hope the frame IDs are for illustrative purposes only - https://chipsandcheese.com/2024/08/27/teslas-ttpoe-at-hot-ch...


In principle, with perfect knowledge of flows at any given instant, you can assign credits/rate-of-transmission for each flow to prevent congestion. But, in practice this is somewhat nuanced to build, and there are various tradeoffs to consider: what happens if the flows are so short that coordinating with a centralised scheduler incurs a latency overhead that is comparable to the flow duration? There's been research to demonstrate that one can strike a sweet spot, but I don't think it's practical nor has it been really deployed in the wild. And of course, this scheduler has to be made reliable as it's a single point of failure.

Such ideas are, however, worth revisiting when the workload is unique enough (in this case, it is), and the performance gains are so big enough...


Maybe the protocol could have arbitration built in? If one was clever you could actually have the front of the packet set a priority header, and build the collision detection/avoidance right into the header.

Multiple parties communicate at the same time? Lower number priority electrically could pull the voltage low, dominating the transmission.

That way, priority messages always get through with no overhead or central communication required.


Yep, such ideas have been around. But congestion is a fundamental problem. Admission control is the only way to ensure there is no congestion collapse.

The technical issue is that you would need global arbitration to ensure that the _goodput_ (useful bytes delivered) is optimal. With training across 32k GPUs and more these days, global arbitration to ensure the correct packets are prioritised is going to be very difficult. If you are sending more traffic than the receiver's link capacity, packets _will_ get dropped, and it's suboptimal to transmit those dropped packets into the network as they waste link capacity elsewhere (upstream) within the network.


> I'd bet the simpler protocol is more vulnerable to various spoofing attacks though.

This is a protocol between compute nodes in a data center, it's layer 2 so there is no way to reach this over the internet.


That's how it always starts :)

But, point taken.


Tuned RoCE with udp is really low latency and no need to implement extra layer of silicon. May be there are more motivation then described in article


Isn't this what Dolphin Interconnect have been doing for a couple of decades? https://www.dolphinics.com/


No, they did it the actually good/cheap way: straight PCIe.


You are right, and I realized that a few min after posting. So tell me, why is Tesla not doing straight PCIe then? If it's the actual good/cheap way. It makes sense to me but that's more a feeling.


Seems like Tesla could really benefit from this about to be released optimizer that reduces intra-GPU communication [0].

[0] https://github.com/NousResearch/DisTrO/blob/main/A_Prelimina...


Congrats. You just reinvented the wheel: http://doc.cat-v.org/plan_9/4th_edition/papers/il/


There are already high speed, low latency video interfaces that have been around for ages. MIPI and HDMI.

There are ICs you can buy off the shelf for electronic routing and switching of these interfaces.


Meanwhile high frequency traders work 1-2 orders of manicure faster in the tens to hundreds of nanoseconds.


The "Tesla" in the article appears to refer to the car manufacturer, for those as confused as I.


All this to...train a model? Having worked with automakers that agonize over pennies on a component how does Tesla amortize this much expense, if they even achieve full self-driving at all?


> Having worked with automakers that agonize over pennies on a component how does Tesla amortize this much expense, if they even achieve full self-driving at all?

Or Elon is using the resources of one company to do work for another company (?):

* https://en.wikipedia.org/wiki/XAI_(company)

* https://electrek.co/2024/04/03/elon-musk-xai-poaches-enginee...


The $1 billion spent on Dojo amortized over 2 million cars sold per year is $500 per car. I'm betting it's viable for more than a year.


Maybe the point is to just continue to be able to point to FSD as some figure thing that you are working on. That should keep investors happy!

Once they’ve built this cluster, maybe they can make money renting out compute or something.


It's a plant not a product.


If they are so concerned with low latency, how come they are wasting an entire roundtrip (the "OPEN + OPEN/ACK") before sending any data?

I mean in TCP it's not allowed (Even though, super-theoretically, it's not completely forbidden) to carry a payload in the initial TCP SYN. If you're so latency-obsessed to create your own protocol, that's the first thing I'd address.


Do you need low latency right away, or just after the car has started or entered a sport/race mode?

If you don’t need it all the time, why bother?


Interesting! No mention of UDP, or the application being run, or the GPU/TPUs on the nodes, so it'll have to be a mystery as to how much bang for their buck they're getting with this particular bit of work.

What's disappointing is that it's impossible to do a new protocol on the Internet because of all the middleware boxes that drop packets that aren't IMCP or TCP or UDP.


UDP means that upper layers have to figure out if the data is shit or not. to do that in hardware limits your data types, flows and developments. Its far more simple to have a "reliable" data transport system, than it is to deal with a lossy protocol in hardware.


Yeah I never FIN my connections eithRST


100Gbps Ethernet cards? The world has moved way past that for training. Their accelerator stack must be really slow if this is good enough for them.


It doesn't say they can't aggregate links. I wouldn't say the world has moved way past that yet, but Tesla probably doesn't want to be dependant, like everyone else but Apple, on Nvidia (InfiniBand).


> like everyone else but Apple, on Nvidia (InfiniBand)

Google also does not depend on NVIDIA, thx to TPUs. Rents NVIDIA GPUs to external customers - sure, it's a nice side business, but internally TPUs are king and there's no dependency on NVIDIA for that.


> Google also does not depend on NVIDIA,

Deepmind says otherwise. training is most likley all on NVIDIA still. Same for Apple.

The difference is, nobody knows for sure with Apple, because they are a secrecy cult.


Likewise Microsoft and Meta have developed in house AI chips, but I don't know what fraction of their AI workloads run on them.


Not a meaningful amount :). Their “AI chips” are, for now, marketing.


From what I know, Meta AI chips are used in production today, but are made for their recommendations tasks which is a very different IA than GPTs and LLMs for which they still rely on GPUs.


This is not true at all.


It's very hard to respond to a comment like that, since there's no specifics, just a plain disagreement.

On my side, I would like to point that the today HN thread ([1]) that discusses a paper GameNGen ([2]) that runs Doom with diffusion models was trained on TPUs.

I don't see a dependency on NVIDIA there.

If there's a more specific rebuttal to my original statement, please, don't hesitate to state it.

1. https://news.ycombinator.com/item?id=41375548

2. https://gamengen.github.io/

3. https://arxiv.org/abs/2408.14837


TPUs are essentially garbage compared to NVIDIA hardware. TPUs are king of nothing, but a primary ingredient in Kool-Aid


You can get 400gbit ethernet from half a dozen vendors.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: