You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ClusterMetrics.md
+2
Original file line number
Diff line number
Diff line change
@@ -180,11 +180,13 @@ Metrics associated with the supervisor, which launches the workers for a topolog
180
180
| supervisor:blob-localization-duration| timer | Approximately how long it takes to get the blob we want after it is requested. |
181
181
| supervisor:current-reserved-memory-mb| gauge | total amount of memory reserved for workers on the supervisor (MB) |
182
182
| supervisor:current-used-memory-mb| gauge | memory currently used as measured by the supervisor (this typically requires cgroups) (MB) |
183
+
| supervisor:health-check-timeouts| meter | tracks timeouts executing health check scripts |
183
184
| supervisor:local-resource-file-not-found-when-releasing-slot| meter | number of times file-not-found exception happens when reading local blobs upon releasing slots |
184
185
| supervisor:num-blob-update-version-changed| meter | number of times a version of a blob changes. |
185
186
| supervisor:num-cleanup-exceptions| meter | exceptions thrown during container cleanup. |
186
187
| supervisor:num-force-kill-exceptions| meter | exceptions thrown during force kill. |
187
188
| supervisor:num-kill-exceptions| meter | exceptions thrown during kill. |
189
+
| supervisor:num-kill-worker-errors| meter | errors killing workers. |
188
190
| supervisor:num-launched| meter | number of times the supervisor is launched. |
189
191
| supervisor:num-shell-exceptions| meter | number of exceptions calling shell commands. |
190
192
| supervisor:num-slots-used-gauge| gauge | number of slots used on the supervisor. |
Copy file name to clipboardExpand all lines: docs/LocalityAwareness.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -48,7 +48,7 @@ If the downstream executor located on the same worker as the executor `E`, the l
48
48
The capacity of a bolt executor on Storm UI is calculated as:
49
49
* (number executed * average execute latency) / measurement time
50
50
51
-
It basically means how busy this executor is. If this is around 1.0, the corresponding Bolt is running as fast as it can.
51
+
It basically means how busy this executor is. If this is around 1.0, the corresponding Bolt is running as fast as it can. A `__capacity` metric exists to track this value for each executor.
Copy file name to clipboardExpand all lines: docs/Metrics.md
+10-14
Original file line number
Diff line number
Diff line change
@@ -213,37 +213,33 @@ This metric records how many errors were reported by a spout/bolt. It is the tot
213
213
214
214
#### Queue Metrics
215
215
216
-
Each bolt or spout instance in a topology has a receive queue and a send queue. Each worker also has a queue for sending messages to other workers. All of these have metrics that are reported.
216
+
Each bolt or spout instance in a topology has a receive queue. Each worker also has a worker transfer queue for sending messages to other workers. All of these have metrics that are reported.
217
217
218
-
The receive queue metrics are reported under the `__receive` name and send queue metrics are reported under the `__sendqueue` for the given bolt/spout they are a part of. The metrics for the queue that sends messages to other workers is under the `__transfer` metric name for the system bolt (`__system`).
218
+
The receive queue metrics are reported under the `receive_queue` name. The metrics for the queue that sends messages to other workers is under the `worker-transfer-queue` metric name for the system bolt (`__system`).
219
219
220
-
They all have the form.
220
+
These queues report the following metrics:
221
221
222
222
```
223
223
{
224
224
"arrival_rate_secs": 1229.1195171893523,
225
225
"overflow": 0,
226
-
"read_pos": 103445,
227
-
"write_pos": 103448,
228
226
"sojourn_time_ms": 2.440771591407277,
229
227
"capacity": 1024,
230
-
"population": 19
231
-
"tuple_population": 200
228
+
"population": 19,
229
+
"pct_full": "0.018".
230
+
"insert_failures": "0",
231
+
"dropped_messages": "0"
232
232
}
233
233
```
234
-
In storm we sometimes batch multiple tuples into a single entry in the disruptor queue. This batching is an optimization that has been in storm in some form since the beginning, but the metrics did not always reflect this so be careful with how you interpret the metrics and pay attention to which metrics are for tuples and which metrics are for entries in the disruptor queue. The `__receive` and `__transfer` queues can have batching but the `__sendqueue` should not.
235
234
236
235
`arrival_rate_secs` is an estimation of the number of tuples that are inserted into the queue in one second, although it is actually the dequeue rate.
237
236
The `sojourn_time_ms` is calculated from the arrival rate and is an estimate of how many milliseconds each tuple sits in the queue before it is processed.
238
-
Prior to STORM-2621 (v1.1.1, v1.2.0, and v2.0.0) these were the rate of entries, not of tuples.
239
237
240
-
A disruptor queue has a set maximum number of entries. If the regular queue fills up an overflow queue takes over. The number of tuple batches stored in this overflow section are represented by the `overflow` metric. Storm also does some micro batching of tuples for performance/efficiency reasons so you may see the overflow with a very small number in it even if the queue is not full.
238
+
The queue has a set maximum number of entries. If the regular queue fills up an overflow queue takes over. The number of tuples stored in this overflow section are represented by the `overflow` metric. Note that an overflow queue is only used for executors to receive tuples from remote workers. It doesn't apply to intra-worker tuple transfer.
241
239
242
-
`read_pos` and `write_pos` are internal disruptor accounting numbers. You can think of them almost as the total number of entries written (`write_pos`) or read (`read_pos`) since the queue was created. They allow for integer overflow so if you use them please take that into account.
240
+
`capacity` is the maximum number of entries in the queue. `population` is the number of entries currently filled in the queue. 'pct_full' tracks the percentage of capacity in use.
243
241
244
-
`capacity` is the maximum number of entries in the disruptor queue. `population` is the number of entries currently filled in the queue.
245
-
246
-
`tuple_population` is the number of tuples currently in the queue as opposed to the number of entries. This was added at the same time as STORM-2621 (v1.1.1, v1.2.0, and v2.0.0)
242
+
'insert_failures' tracks the number of failures inserting into the queue. 'dropped_messages' tracks messages dropped due to the overflow queue being full.
Copy file name to clipboardExpand all lines: docs/metrics_v2.md
+1
Original file line number
Diff line number
Diff line change
@@ -145,6 +145,7 @@ to using the long metric name, but can report the short name by configuring repo
145
145
## Backwards Compatibility Notes
146
146
147
147
1. V2 metrics can also be reported to the Metrics Consumers registered with `topology.metrics.consumer.register` by enabling the `topology.enable.v2.metrics.tick` configuration.
148
+
The rate that they will reported to Metric Consumers is controlled by `topology.v2.metrics.tick.interval.seconds`, defaulting to every 60 seconds.
148
149
149
150
2. Starting from storm 2.3, the config `storm.metrics.reporters` is deprecated in favor of `topology.metrics.reporters`.
0 commit comments