Skip to content

Commit 8a8331e

Browse files
c-p-i-osvekars
andauthored
[c10d][Doc] Add a flight recorder tutorial (#3040)
* [c10d][Doc] Add a flight recorder tutorial Summary: Recreated pull request #3024 because of bad merges. Re-add flight recorder tutorial with prior comments addressed. Test Plan: Ran: rst2html5 flight_recorder_tutorial.rst flight_recorder_tutorial.html Co-authored-by: Svetlana Karslioglu <[email protected]>
1 parent 129e318 commit 8a8331e

File tree

3 files changed

+201
-7
lines changed

3 files changed

+201
-7
lines changed

prototype_source/README.txt

+6-3
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Prototype Tutorials
77
2. graph_mode_static_quantization_tutorial.py
88
Graph Mode Post Training Static Quantization in PyTorch
99
https://pytorch.org/tutorials/prototype/graph_mode_static_quantization_tutorial.html
10-
10+
1111
3. graph_mode_dynamic_bert_tutorial.rst
1212
Graph Mode Dynamic Quantization on BERT
1313
https://github.com/pytorch/tutorials/blob/main/prototype_source/graph_mode_dynamic_bert_tutorial.rst
@@ -30,9 +30,12 @@ Prototype Tutorials
3030

3131
8. fx_graph_mode_ptq_dynamic.py
3232
FX Graph Mode Post Training Dynamic Quantization
33-
https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_dynamic.html
33+
https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_dynamic.html
3434

3535
9. fx_graph_mode_quant_guide.py
3636
FX Graph Mode Quantization User Guide
37-
https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html
37+
https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html
3838

39+
10 flight_recorder_tutorial.rst
40+
Flight Recorder User Guide
41+
https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
(prototype) Flight Recorder for Debugging Stuck Jobs
2+
====================================================
3+
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`_, `Junjie Wang <https://github.com/fduwjj>`_
4+
5+
What you will learn
6+
-------------------
7+
* Learn about a new tool for debugging stuck jobs during distributed training.
8+
* Learn how you can enable the tool and use the collected data for analyzing stuck jobs.
9+
10+
Prerequisites
11+
-------------
12+
- PyTorch version 2.5 or later.
13+
14+
Overview
15+
--------
16+
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
17+
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
18+
that require significant computational resources.
19+
An engineer’s goal is to complete an AI training job as quickly as possible and make continuous improvements so that
20+
subsequent training can be done faster. A trained, usable model is the final desired outcome.
21+
One of the biggest impediment to completing training is the concept of a *stuck job*.
22+
23+
A distributed AI training job is considered `stuck` when it stops making meaningful progress for an extended period of
24+
time.
25+
26+
A job can get stuck for various reasons:
27+
28+
- **Data Starvation:** This occurs when the training job is not receiving data at the expected rate, possibly due to issues with the data pipeline or the data source.
29+
30+
- **Resource Constraints:** If the system running the job does not have enough computational resources (such as CPU, GPU, or memory), the job might not be able to proceed.
31+
32+
- **Network Issues:** In a distributed training setup, different parts of the model or data may be processed on different devices. If there are network issues, communication between these devices may be disrupted, causing the job to get stuck.
33+
34+
- **Software Bugs or Errors:** Errors in the training code or the underlying libraries and frameworks can also cause a job to get stuck.
35+
36+
- **Synchronization Issues:** In distributed training, different parts of the computation are often run in parallel and need to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can occur if one or more ranks fail to join a collective while the remaining ranks have joined. This results in an indefinite wait for the job to progress.
37+
38+
Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
39+
information is used to help identify the root causes of issues when jobs become stuck.
40+
Flight Recorder consists of two core parts:
41+
42+
- The collection portion: when enabled, information about collectives is recorded in an in-memory circular buffer. Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
43+
44+
- An analyzer script is available in the `tools/flight_recorder <https://github.com/pytorch/pytorch/tree/main/tools/flight_recorder>`__ directory (details below).
45+
The analyzer script runs known heuristics using the collected data and attempts to automatically identify the underlying issue that caused the job to stall.
46+
47+
Enabling Flight Recorder
48+
------------------------
49+
There are two required environment variables to get the initial version of Flight Recorder working.
50+
51+
- ``TORCH_NCCL_TRACE_BUFFER_SIZE = (0, N)``: Setting ``N`` to a positive number enables collection.
52+
``N`` represents the number of entries that will be kept internally in a circular buffer.
53+
We recommended to set this value at *2000*.
54+
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout.
55+
If enabled, there will be one file per rank output in the job's running directory.
56+
57+
**Optional settings:**
58+
59+
- ``TORCH_NCCL_TRACE_CPP_STACK = (true, false)``: Setting this to true enables C++ stack traces to be captured in Flight Recorder.
60+
C++ stack traces can be useful in providing the exact code path from a PyTorch Python call down to the primitive
61+
C++ implementation. Also see ``TORCH_SYMBOLIZE_MODE`` in additional settings.
62+
- ``TORCH_NCCL_ENABLE_TIMING = (true, false)``: Setting this to ``true`` will enable additional cuda events at the start of each collective and
63+
records the *duration* of each collective. This may incur some CPU overhead. In the collected data, the
64+
*duration* field indicates how long each collective took to execute.
65+
66+
Additional Settings
67+
-------------------
68+
69+
- ``TORCH_SYMBOLIZE_MODE = (dladdr, addr2line, fast)``: This setting determines the program used to retrieve C++ traces from a running program.
70+
The default setting is ``addr2line``.
71+
72+
``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``.
73+
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data.
74+
75+
Retrieving Flight Recorder Data via an API
76+
------------------------------------------
77+
78+
You can also retrieve Flight Recorder data with an API call.
79+
The API with the default arguments is shown below:
80+
81+
.. code:: python
82+
83+
torch._C._distributed_c10d._dump_nccl_trace(includeCollectives=True, includeStackTraces=True, onlyActive=False)
84+
85+
To view the data, you can ``unpickle`` it as shown below:
86+
87+
.. code:: python
88+
89+
t = pickle.loads(torch._C._distributed_c10d._dump_nccl_trace())
90+
print(t)
91+
92+
Flight Recorder File Formats
93+
----------------------------
94+
95+
Flight Recorder files are dumped in ``pickle`` format. Files are written to local disks or mounted shared NFS
96+
folders.
97+
98+
The contents of a Flight Recorder ``unpickled`` file are shown below:
99+
100+
.. code-block:: json
101+
102+
{
103+
"version": "2.5",
104+
"pg_config": {
105+
"0": {
106+
"name": "0",
107+
"desc": "default_pg",
108+
"ranks": "[0, 1]"
109+
}
110+
},
111+
"pg_status": {
112+
"0": {
113+
"last_enqueued_collective": 2,
114+
"last_started_collective": -1,
115+
"last_completed_collective": 2
116+
}
117+
},
118+
"entries": [
119+
{
120+
"frames": [
121+
{
122+
"name": "test_short_pickle",
123+
"filename": "pytorch/test/distributed/test_c10d_nccl.py",
124+
"line": 3647
125+
},
126+
{
127+
"name": "spawn_main",
128+
"filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
129+
"line": 116
130+
},
131+
{
132+
"name": "<module>",
133+
"filename": "<string>",
134+
"line": 1
135+
}
136+
],
137+
"record_id": 0,
138+
"pg_id": 0,
139+
"process_group": ("0", "default_pg"),
140+
"collective_seq_id": 1,
141+
"p2p_seq_id": 0,
142+
"op_id": 1,
143+
"profiling_name": "nccl:all_reduce",
144+
"time_created_ns": 1724779239936775119,
145+
"input_sizes": [[3, 4]],
146+
"input_dtypes": ["Float"],
147+
"output_sizes": [[3, 4]],
148+
"output_dtypes": ["Float"],
149+
"state": "completed",
150+
"time_discovered_started_ns": null,
151+
"time_discovered_completed_ns": 1724779239975811724,
152+
"retired": true,
153+
"timeout_ms": 600000,
154+
"is_p2p": false
155+
},
156+
...
157+
]
158+
}
159+
160+
Analyzing Flight Recorder Dumps
161+
-------------------------------
162+
163+
We have convenient scripts available in `pytorch/tools/flight_recorder` directory for analyzing captured
164+
data.
165+
166+
To run the convenience script, follow these steps:
167+
168+
1. Copy all files from a rank into a single directory.
169+
170+
2. To run the script, use this command:
171+
172+
.. code:: python
173+
174+
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
175+
176+
Conclusion
177+
----------
178+
In this tutorial, we have learned about a new PyTorch diagnostic tool called Flight Recorder.
179+
We have discussed how to enable Flight Recorder to collect diagnostic data from a machine.
180+
Additionally, we explored how to analyze the data captured from the Flight Recorder using a
181+
convenience script located in the `tools/flight_recorder <https://github.com/pytorch/pytorch/tree/main/tools/flight_recorder>`__
182+
directory of the PyTorch repository.

prototype_source/prototype_index.rst

+13-4
Original file line numberDiff line numberDiff line change
@@ -80,8 +80,8 @@ Prototype features are not available as part of binary distributions like PyPI o
8080
:card_description: Learn how to use Post Training Quantization in PyTorch 2 Export.
8181
:image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
8282
:link: ../prototype/pt2e_quant_ptq.html
83-
:tags: Quantization
84-
83+
:tags: Quantization
84+
8585
.. customcarditem::
8686
:header: PyTorch 2 Export Quantization-Aware Training
8787
:card_description: Learn how to use Quantization-Aware-Training in PyTorch 2 Export.
@@ -203,11 +203,11 @@ Prototype features are not available as part of binary distributions like PyPI o
203203

204204
.. customcarditem::
205205
:header: MaskedTensor: Simplifying Adagrad Sparse Semantics
206-
:card_description: See a showcase on how masked tensors can enable sparse semantics and provide for a cleaner dev experience
206+
:card_description: See a showcase on how masked tensors can enable sparse semantics and provide for a cleaner dev experience
207207
:image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
208208
:link: ../prototype/maskedtensor_adagrad.html
209209
:tags: MaskedTensor
210-
210+
211211
.. Model-Optimization
212212
213213
.. customcarditem::
@@ -217,6 +217,14 @@ Prototype features are not available as part of binary distributions like PyPI o
217217
:link: ../prototype/inductor_cpp_wrapper_tutorial.html
218218
:tags: Model-Optimization
219219

220+
.. Distributed
221+
.. customcarditem::
222+
:header: Flight Recorder Tutorial
223+
:card_description: Debug stuck jobs easily with Flight Recorder
224+
:image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
225+
:link: ../prototype/flight_recorder_tutorial.html
226+
:tags: Distributed, Debugging, FlightRecorder
227+
220228
.. End of tutorial card section
221229
222230
.. raw:: html
@@ -238,6 +246,7 @@ Prototype features are not available as part of binary distributions like PyPI o
238246
prototype/fx_graph_mode_quant_guide.html
239247
prototype/fx_graph_mode_ptq_dynamic.html
240248
prototype/fx_graph_mode_ptq_static.html
249+
prototype/flight_recorder_tutorial.html
241250
prototype/graph_mode_dynamic_bert_tutorial.html
242251
prototype/inductor_cpp_wrapper_tutorial.html
243252
prototype/pt2e_quantizer.html

0 commit comments

Comments
 (0)