A RoCE network for distributed AI training at scale
AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 405B. This week at ACM SIGCOMM 2024 in Sydney, Australia, we are sharing details on the network we have built at Meta [...]
Read More...
The post A RoCE network for distributed AI training at scale appeared first on Engineering at Meta.
http://dlvr.it/TBXCRm
Read More...
The post A RoCE network for distributed AI training at scale appeared first on Engineering at Meta.
http://dlvr.it/TBXCRm
Komentar
Posting Komentar