STORM-3791 update metric documentation (apache#3409)

agresch · web-flow · commit 0787d86010d8 · 2021-08-18T09:49:38.000-05:00
* STORM-3791 update metric documentation
diff --git a/docs/ClusterMetrics.md b/docs/ClusterMetrics.md
@@ -180,11 +180,13 @@ Metrics associated with the supervisor, which launches the workers for a topolog
 | supervisor:blob-localization-duration | timer | Approximately how long it takes to get the blob we want after it is requested. |
 | supervisor:current-reserved-memory-mb | gauge | total amount of memory reserved for workers on the supervisor (MB) |
 | supervisor:current-used-memory-mb | gauge | memory currently used as measured by the supervisor (this typically requires cgroups) (MB) |
+| supervisor:health-check-timeouts | meter | tracks timeouts executing health check scripts |
 | supervisor:local-resource-file-not-found-when-releasing-slot | meter | number of times file-not-found exception happens when reading local blobs upon releasing slots |
 | supervisor:num-blob-update-version-changed | meter | number of times a version of a blob changes. |
 | supervisor:num-cleanup-exceptions | meter | exceptions thrown during container cleanup. |
 | supervisor:num-force-kill-exceptions | meter | exceptions thrown during force kill. |
 | supervisor:num-kill-exceptions | meter | exceptions thrown during kill. |
+| supervisor:num-kill-worker-errors | meter | errors killing workers. |
 | supervisor:num-launched | meter | number of times the supervisor is launched. |
 | supervisor:num-shell-exceptions | meter | number of exceptions calling shell commands. |
 | supervisor:num-slots-used-gauge | gauge | number of slots used on the supervisor. |
diff --git a/docs/LocalityAwareness.md b/docs/LocalityAwareness.md
@@ -48,7 +48,7 @@ If the downstream executor located on the same worker as the executor `E`, the l
 The capacity of a bolt executor on Storm UI is calculated as:
   * (number executed * average execute latency) / measurement time
 
-It basically means how busy this executor is. If this is around 1.0, the corresponding Bolt is running as fast as it can. 
+It basically means how busy this executor is. If this is around 1.0, the corresponding Bolt is running as fast as it can. A `__capacity` metric exists to track this value for each executor.
 
 The `Capacity` is not related to the `Load`:
 
diff --git a/docs/Metrics.md b/docs/Metrics.md
@@ -213,37 +213,33 @@ This metric records how many errors were reported by a spout/bolt. It is the tot
 
 #### Queue Metrics
 
-Each bolt or spout instance in a topology has a receive queue and a send queue.  Each worker also has a queue for sending messages to other workers.  All of these have metrics that are reported.
+Each bolt or spout instance in a topology has a receive queue.  Each worker also has a worker transfer queue for sending messages to other workers.  All of these have metrics that are reported.
 
-The receive queue metrics are reported under the `__receive` name and send queue metrics are reported under the `__sendqueue` for the given bolt/spout they are a part of.  The metrics for the queue that sends messages to other workers is under the `__transfer` metric name for the system bolt (`__system`).
+The receive queue metrics are reported under the `receive_queue` name.  The metrics for the queue that sends messages to other workers is under the `worker-transfer-queue` metric name for the system bolt (`__system`).
 
-They all have the form.
+These queues report the following metrics:
 
 ```
 {
     "arrival_rate_secs": 1229.1195171893523,
     "overflow": 0,
-    "read_pos": 103445,
-    "write_pos": 103448,
     "sojourn_time_ms": 2.440771591407277,
     "capacity": 1024,
-    "population": 19
-    "tuple_population": 200
+    "population": 19,
+    "pct_full": "0.018".
+    "insert_failures": "0",
+    "dropped_messages": "0"
 }
 ```
-In storm we sometimes batch multiple tuples into a single entry in the disruptor queue. This batching is an optimization that has been in storm in some form since the beginning, but the metrics did not always reflect this so be careful with how you interpret the metrics and pay attention to which metrics are for tuples and which metrics are for entries in the disruptor queue. The `__receive` and `__transfer` queues can have batching but the `__sendqueue` should not.
 
 `arrival_rate_secs` is an estimation of the number of tuples that are inserted into the queue in one second, although it is actually the dequeue rate.
 The `sojourn_time_ms` is calculated from the arrival rate and is an estimate of how many milliseconds each tuple sits in the queue before it is processed.
-Prior to STORM-2621 (v1.1.1, v1.2.0, and v2.0.0) these were the rate of entries, not of tuples.
 
-A disruptor queue has a set maximum number of entries.  If the regular queue fills up an overflow queue takes over.  The number of tuple batches stored in this overflow section are represented by the `overflow` metric.  Storm also does some micro batching of tuples for performance/efficiency reasons so you may see the overflow with a very small number in it even if the queue is not full.
+The queue has a set maximum number of entries.  If the regular queue fills up an overflow queue takes over.  The number of tuples stored in this overflow section are represented by the `overflow` metric.  Note that an overflow queue is only used for executors to receive tuples from remote workers. It doesn't apply to intra-worker tuple transfer.
 
-`read_pos` and `write_pos` are internal disruptor accounting numbers.  You can think of them almost as the total number of entries written (`write_pos`) or read (`read_pos`) since the queue was created.  They allow for integer overflow so if you use them please take that into account.
+`capacity` is the maximum number of entries in the queue. `population` is the number of entries currently filled in the queue. 'pct_full' tracks the percentage of capacity in use.
 
-`capacity` is the maximum number of entries in the disruptor queue. `population` is the number of entries currently filled in the queue.
-
-`tuple_population` is the number of tuples currently in the queue as opposed to the number of entries.  This was added at the same time as STORM-2621 (v1.1.1, v1.2.0, and v2.0.0)
+'insert_failures' tracks the number of failures inserting into the queue. 'dropped_messages' tracks messages dropped due to the overflow queue being full.
 
 #### System Bolt (Worker) Metrics
 
diff --git a/docs/metrics_v2.md b/docs/metrics_v2.md
@@ -145,6 +145,7 @@ to using the long metric name, but can report the short name by configuring repo
 ## Backwards Compatibility Notes
 
 1. V2 metrics can also be reported to the Metrics Consumers registered with `topology.metrics.consumer.register` by enabling the `topology.enable.v2.metrics.tick` configuration.
+The rate that they will reported to Metric Consumers is controlled by `topology.v2.metrics.tick.interval.seconds`, defaulting to every 60 seconds.
 
 2. Starting from storm 2.3, the config `storm.metrics.reporters` is deprecated in favor of `topology.metrics.reporters`.