Skip to content

Commit 747e231

Browse files
committed
merge-squash origin/master into public-attestation
1 parent c452fb1 commit 747e231

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+13806
-994
lines changed

Diff for: .maintain/monitoring/README.md

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
## Substrate Dashboard
2+
3+
We are using a very slightly modified version of the Robonomics dashboard https://grafana.com/grafana/dashboards/13015 which has Substrate prometheus metrics aswell as node exporter metrics.
4+
5+
You can find our version in `./grafana-dashboard.json`
6+
7+
## Prometheus and Alert Manager config
8+
9+
Two files `prometheus.yaml` and `alerting-rules.yaml` are used for prometheus and alert manager config respectively. The simple configuration lets us scrape Substrate and Node Exporter metrics, giving us alerts through Alert Manager if theres node downtime. Please refer to the setup guide for more information.
10+
11+
## Setup guide
12+
13+
The good people at robonomics have created a nice guide to get you started: https://github.com/hubobubo/robonomics/wiki/Robonomics-(XRT)-metrics-using-Prometheus-and-Grafana - you can follow this and import the dashboard JSON here or their panel from Grafana.

Diff for: .maintain/monitoring/alerting-rules.yaml

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
groups:
2+
- name: alert_rules
3+
rules:
4+
- alert: InstanceDown
5+
expr: up == 0
6+
for: 1m
7+
labels:
8+
severity: critical
9+
annotations:
10+
summary: "Instance [{{ $labels.instance }}] down"
11+
description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 1 minute."
12+
13+
- alert: HostOutOfMemory
14+
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
15+
for: 2m
16+
labels:
17+
severity: warning
18+
annotations:
19+
summary: Host out of memory (instance {{ $labels.instance }})
20+
description: "Node memory is filling up (< 10% left)"
21+
22+
- alert: HostHighCpuLoad
23+
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
24+
for: 0m
25+
labels:
26+
severity: warning
27+
annotations:
28+
summary: Host high CPU load (instance {{ $labels.instance }})
29+
description: "CPU load is > 80%"

0 commit comments

Comments
 (0)