I'm the primary author so happy to answer any questions you might have!
zxexz 7 hours ago [-]
This is awesome, can’t wait to try out these techniques. At least a week a year of my time for the past few years has gone towards recovering from a fault crashing a training run. Sometimes environment related, sometimes shared storage, sometimes just because a slightly faulty IB cable.
bjt12345 11 hours ago [-]
This is severely underrated work, why aren't there more mid sized companies helping this? Ultra Ethernet just got released.
foobiekr 7 hours ago [-]
Ultra Ethernet will do almost nothing. It’s a rubber stamped version of Broadcom’s design and Marcel/Cisco/etc will just add it to their asics. Remains to be seen if SpecrumX will or Connectix. If not, none of it matters.
These chips are $30m-$100m projects a pop. After the embarrassingly brutal failure of Barefoot nobody is going to do ASICs.
anonymousDan 3 hours ago [-]
What kind of failures are you typically concerned with here?
timzaman 12 hours ago [-]
300 L40s? What's this, 1998?
d4l3k 11 hours ago [-]
Hey Tim, how's it going?
Interested in lending PyTorch some compute? :)
torchft can handle much larger scales but for public multi-day demonstration run this is what we had available. Point of this blog was to demonstrate correctness of the quorum algorithm and recovery with a stock PyTorch stack and not so much peak flops.
Stay tuned though -- planning on doing some much larger demos on B200s!
kcorbitt 12 hours ago [-]
I was curious about this so I had o3 do a bit of research. Turns out 300 L40s have more compute than any supercomputer before 2013 (and arguably before 2016, depending on how you count reduced-precision FLOPs).
I'm the primary author so happy to answer any questions you might have!
These chips are $30m-$100m projects a pop. After the embarrassingly brutal failure of Barefoot nobody is going to do ASICs.
Interested in lending PyTorch some compute? :)
torchft can handle much larger scales but for public multi-day demonstration run this is what we had available. Point of this blog was to demonstrate correctness of the quorum algorithm and recovery with a stock PyTorch stack and not so much peak flops.
Stay tuned though -- planning on doing some much larger demos on B200s!
https://chatgpt.com/share/685dea79-26ec-8002-bd62-7ed83aedf4...