Skip to content

Commit

Permalink
[FR][doc] Update README with reference to Flight Recorder (pytorch#599)
Browse files Browse the repository at this point in the history
Summary:
Update readme with reference to the flight recorder tutorial to help
users diagnose stuck jobs.

Test Plan:
none.
  • Loading branch information
c-p-i-o authored Oct 7, 2024
1 parent ce5a73e commit 40a1026
Showing 1 changed file with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,14 @@ If your gpu count per node is not 8, adjust:

in the SBATCH command section.


## Debugging
### Troubleshooting Jobs that Timeout
If you encounter jobs that timeout, you'll need to debug them to identify the root cause. To help with this process, we've enabled Flight Recorder, a tool that continuously collects diagnostic information about your jobs.
When a job times out, Flight Recorder automatically generates dump files on every rank containing valuable debugging data. You can find these dump files in the `job.dump_folder` directory.
To learn how to analyze and diagnose issues using these logs, follow our step-by-step tutorial [link](https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html).


## License

This code is made available under [BSD 3 license](./LICENSE). However you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, data, etc.

0 comments on commit 40a1026

Please sign in to comment.