At Hot Chips 2024, Tesla showcased its groundbreaking DOJO supercomputer, revealing not just an advanced AI accelerator but also a custom-designed network protocol. This presentation, coupled with exclusive images of the “Mojo Dojo Compute Hall” (MDCH) in New York, demonstrates Tesla’s commitment to pushing the boundaries of AI computing infrastructure.
Tesla’s innovative approach extends beyond hardware to the network layer with the introduction of TTPoE. This custom protocol addresses the limitations of traditional TCP/IP for high-performance AI workloads. TTPoE operates as a peer-to-peer transport layer protocol executed in hardware, eliminating the need for specialized switches and relying primarily on Layer 2 transport.
Unlike lossless RDMA networks, TTPoE is designed to handle packet loss and retransmission efficiently. This approach allows for better congestion management at the local link channel level, rather than at the network or switch level. While TTPoE supports Quality of Service (QoS), Tesla has opted to disable this feature in the current implementation.
Tesla has integrated the TTPoE IP block into both FPGA and silicon designs, optimizing it for high-speed packet transmission. The microarchitecture of TTP bears a resemblance to an L3 cache, featuring a 1MB TX buffer in the current generation. This design suggests potential for scalability in future iterations.
The DOJO system’s networking prowess is exemplified by its 100Gbps NIC, named Mojo. Operating at under 20W, it incorporates 8GB of DDR4 memory and a dedicated Dojo DMA engine. This high-performance networking component is crucial for the system’s overall efficiency in handling massive AI workloads.
With DOJO and TTPoE, Tesla isn’t just computing AI tasks – it’s rewiring the very fabric of AI infrastructure.
Related Post
TSMC Spills Details on Production for Tesla’s Monstrous Next-Gen Dojo Chip
Elon Musk Confirms $500M Tesla Investment in Dojo Supercomputer