Fault Tolerant Llama training

62 points by Mougatine 4 days ago

d4l3k 18 hours ago

Hey, nice to see this here!

I'm the primary author so happy to answer any questions you might have!

bwfan123 4 hours ago

Why isnt there more investments into semi-synchronous training - is it that the convergence is iffy ? Also, it would be great to refactor this code into a typed language, so it is easier to reason about and maintain.
- d4l3k 2 hours ago
  
  Recently there's been a lot of interest and improvements in semi-synchronous training. The Streaming DiLoCo paper came out this year and is a big step forward for datacenter semi-sync.
  Historically it's been limited to areas like federated learning for low power/low network training but with the massive increase in number of GPUs it's becoming relevant even for training in datacenters.
  It is another variable ML researchers have to tune so does add some complexity and I expect most folks just aren't familiar with it yet.
  On "typed language": all of torchft is typed! The coordination/quorum layers are written in Rust w/ GRPC and the front-end is typed Python with Pyre since it has to interact with PyTorch and model code.

zxexz 12 hours ago

This is awesome, can’t wait to try out these techniques. At least a week a year of my time for the past few years has gone towards recovering from a fault crashing a training run. Sometimes environment related, sometimes shared storage, sometimes just because a slightly faulty IB cable.

d4l3k 2 hours ago

Let me know how it goes! If you're interested in chatting / run into any problems feel free to reach out via the links in my profile

bjt12345 17 hours ago

This is severely underrated work, why aren't there more mid sized companies helping this? Ultra Ethernet just got released.

foobiekr 13 hours ago

Ultra Ethernet will do almost nothing. It’s a rubber stamped version of Broadcom’s design and Marcel/Cisco/etc will just add it to their asics. Remains to be seen if SpecrumX will or Connectix. If not, none of it matters.
These chips are $30m-$100m projects a pop. After the embarrassingly brutal failure of Barefoot nobody is going to do ASICs.

anonymousDan 9 hours ago

What kind of failures are you typically concerned with here?

d4l3k 4 minutes ago

We want to be tolerant to application bugs and host/GPU failures that can be solved by replacing/restarting the machine. External services and network failures we don't have much control over so aren't aiming to solve that.
For specific types of failures check out the section on "Reliability and Operational Challenges" from the Llama 3 paper https://ai.meta.com/research/publications/the-llama-3-herd-o...

timzaman 18 hours ago

300 L40s? What's this, 1998?

d4l3k 17 hours ago

Hey Tim, how's it going?
Interested in lending PyTorch some compute? :)
torchft can handle much larger scales but for public multi-day demonstration run this is what we had available. Point of this blog was to demonstrate correctness of the quorum algorithm and recovery with a stock PyTorch stack and not so much peak flops.
Stay tuned though -- planning on doing some much larger demos on B200s!
kcorbitt 18 hours ago

I was curious about this so I had o3 do a bit of research. Turns out 300 L40s have more compute than any supercomputer before 2013 (and arguably before 2016, depending on how you count reduced-precision FLOPs).
https://chatgpt.com/share/685dea79-26ec-8002-bd62-7ed83aedf4...