Skip to content

Commit 0787d86

Browse files
authoredAug 18, 2021
STORM-3791 update metric documentation (apache#3409)
* STORM-3791 update metric documentation
1 parent 16ca56e commit 0787d86

File tree

4 files changed

+14
-15
lines changed

4 files changed

+14
-15
lines changed
 

Diff for: ‎docs/ClusterMetrics.md

+2
Original file line numberDiff line numberDiff line change
@@ -180,11 +180,13 @@ Metrics associated with the supervisor, which launches the workers for a topolog
180180
| supervisor:blob-localization-duration | timer | Approximately how long it takes to get the blob we want after it is requested. |
181181
| supervisor:current-reserved-memory-mb | gauge | total amount of memory reserved for workers on the supervisor (MB) |
182182
| supervisor:current-used-memory-mb | gauge | memory currently used as measured by the supervisor (this typically requires cgroups) (MB) |
183+
| supervisor:health-check-timeouts | meter | tracks timeouts executing health check scripts |
183184
| supervisor:local-resource-file-not-found-when-releasing-slot | meter | number of times file-not-found exception happens when reading local blobs upon releasing slots |
184185
| supervisor:num-blob-update-version-changed | meter | number of times a version of a blob changes. |
185186
| supervisor:num-cleanup-exceptions | meter | exceptions thrown during container cleanup. |
186187
| supervisor:num-force-kill-exceptions | meter | exceptions thrown during force kill. |
187188
| supervisor:num-kill-exceptions | meter | exceptions thrown during kill. |
189+
| supervisor:num-kill-worker-errors | meter | errors killing workers. |
188190
| supervisor:num-launched | meter | number of times the supervisor is launched. |
189191
| supervisor:num-shell-exceptions | meter | number of exceptions calling shell commands. |
190192
| supervisor:num-slots-used-gauge | gauge | number of slots used on the supervisor. |

Diff for: ‎docs/LocalityAwareness.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ If the downstream executor located on the same worker as the executor `E`, the l
4848
The capacity of a bolt executor on Storm UI is calculated as:
4949
* (number executed * average execute latency) / measurement time
5050

51-
It basically means how busy this executor is. If this is around 1.0, the corresponding Bolt is running as fast as it can.
51+
It basically means how busy this executor is. If this is around 1.0, the corresponding Bolt is running as fast as it can. A `__capacity` metric exists to track this value for each executor.
5252

5353
The `Capacity` is not related to the `Load`:
5454

Diff for: ‎docs/Metrics.md

+10-14
Original file line numberDiff line numberDiff line change
@@ -213,37 +213,33 @@ This metric records how many errors were reported by a spout/bolt. It is the tot
213213
214214
#### Queue Metrics
215215
216-
Each bolt or spout instance in a topology has a receive queue and a send queue. Each worker also has a queue for sending messages to other workers. All of these have metrics that are reported.
216+
Each bolt or spout instance in a topology has a receive queue. Each worker also has a worker transfer queue for sending messages to other workers. All of these have metrics that are reported.
217217
218-
The receive queue metrics are reported under the `__receive` name and send queue metrics are reported under the `__sendqueue` for the given bolt/spout they are a part of. The metrics for the queue that sends messages to other workers is under the `__transfer` metric name for the system bolt (`__system`).
218+
The receive queue metrics are reported under the `receive_queue` name. The metrics for the queue that sends messages to other workers is under the `worker-transfer-queue` metric name for the system bolt (`__system`).
219219
220-
They all have the form.
220+
These queues report the following metrics:
221221
222222
```
223223
{
224224
"arrival_rate_secs": 1229.1195171893523,
225225
"overflow": 0,
226-
"read_pos": 103445,
227-
"write_pos": 103448,
228226
"sojourn_time_ms": 2.440771591407277,
229227
"capacity": 1024,
230-
"population": 19
231-
"tuple_population": 200
228+
"population": 19,
229+
"pct_full": "0.018".
230+
"insert_failures": "0",
231+
"dropped_messages": "0"
232232
}
233233
```
234-
In storm we sometimes batch multiple tuples into a single entry in the disruptor queue. This batching is an optimization that has been in storm in some form since the beginning, but the metrics did not always reflect this so be careful with how you interpret the metrics and pay attention to which metrics are for tuples and which metrics are for entries in the disruptor queue. The `__receive` and `__transfer` queues can have batching but the `__sendqueue` should not.
235234
236235
`arrival_rate_secs` is an estimation of the number of tuples that are inserted into the queue in one second, although it is actually the dequeue rate.
237236
The `sojourn_time_ms` is calculated from the arrival rate and is an estimate of how many milliseconds each tuple sits in the queue before it is processed.
238-
Prior to STORM-2621 (v1.1.1, v1.2.0, and v2.0.0) these were the rate of entries, not of tuples.
239237
240-
A disruptor queue has a set maximum number of entries. If the regular queue fills up an overflow queue takes over. The number of tuple batches stored in this overflow section are represented by the `overflow` metric. Storm also does some micro batching of tuples for performance/efficiency reasons so you may see the overflow with a very small number in it even if the queue is not full.
238+
The queue has a set maximum number of entries. If the regular queue fills up an overflow queue takes over. The number of tuples stored in this overflow section are represented by the `overflow` metric. Note that an overflow queue is only used for executors to receive tuples from remote workers. It doesn't apply to intra-worker tuple transfer.
241239
242-
`read_pos` and `write_pos` are internal disruptor accounting numbers. You can think of them almost as the total number of entries written (`write_pos`) or read (`read_pos`) since the queue was created. They allow for integer overflow so if you use them please take that into account.
240+
`capacity` is the maximum number of entries in the queue. `population` is the number of entries currently filled in the queue. 'pct_full' tracks the percentage of capacity in use.
243241
244-
`capacity` is the maximum number of entries in the disruptor queue. `population` is the number of entries currently filled in the queue.
245-
246-
`tuple_population` is the number of tuples currently in the queue as opposed to the number of entries. This was added at the same time as STORM-2621 (v1.1.1, v1.2.0, and v2.0.0)
242+
'insert_failures' tracks the number of failures inserting into the queue. 'dropped_messages' tracks messages dropped due to the overflow queue being full.
247243
248244
#### System Bolt (Worker) Metrics
249245

Diff for: ‎docs/metrics_v2.md

+1
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,7 @@ to using the long metric name, but can report the short name by configuring repo
145145
## Backwards Compatibility Notes
146146

147147
1. V2 metrics can also be reported to the Metrics Consumers registered with `topology.metrics.consumer.register` by enabling the `topology.enable.v2.metrics.tick` configuration.
148+
The rate that they will reported to Metric Consumers is controlled by `topology.v2.metrics.tick.interval.seconds`, defaulting to every 60 seconds.
148149

149150
2. Starting from storm 2.3, the config `storm.metrics.reporters` is deprecated in favor of `topology.metrics.reporters`.
150151

0 commit comments

Comments
 (0)
Please sign in to comment.