-
Notifications
You must be signed in to change notification settings - Fork 58
Alerting Gateway Architecture
- Alerting terminology
- Architecture
- Initialization Phase
- Components
- Datasource Implementations
The gateway-side Alerting server, henceforth called Alerting plugin
is responsible for managing & applying the following user configurations:
- endpoints, specifications used by the
Alerting Cluster
to dispatch alerts - conditions, specifications applied to datasources to evaluate observability data
The alerting plugin is also responsible for connecting to datasources:
- Aggregate available info from datasource into templates for the client
- Manage dependencies on datasources that make the observations required to evaluate conditions
The alerting plugin exposes an API to dynamically install and scale the Alerting Cluster
, that delegates necessary updates to the controller through a cluster driver
In short, the alerting gateway plugin process can be broken down into:
- Initialization Phase : setting up required dependencies by the core alerting gateway plugin to run
- Backend Components : Servers that handle the logic and requests behind opni alerting features
The initialization phase of the alerting plugin is responsible for setting the up the correct adapters to:
- persistent storage
- the
Alerting Cluster
, managed by the Alerting Controller - external datasources : opni backends & internal gateway state
In more detail, when the alerting plugin initializes it must:
- Set up cluster driver, an adapter to the alerting controller
- Wrap the cluster driver in an API (OpsServer)
- Acquire the storage api clients
- Setup datasources:
- 4.1. Acquire gateway internal streams, and setup persistent streams to watch the data streamed
- 4.2. Acquire metrics ops backend client, and scrape the status api and send it to a persistent stream
- 4.3. Acquire metrics admin client, to CRUD cortex rule objects
- Reindex (re-apply) user configurations, if the external datasource dependencies aren't loaded. For details on how external datasource dependencies are loaded see Datasource Implementations
There are currently 2 datasources for Opni Alerting
- Internal : system critical information exposed by the gateway
- metrics : information exposed in metrics format by the Opni Metrics backend
The currently supported opni conditions map to:
- Agent disconnect -> internal datasource
- Capability unhealthy -> internal datasource
- Monitoring backend -> internal datasource
- Prometheus Query -> metrics datasource
- Kube State -> metrics datasource
Alerting conditions backed by internal datasources are evaluated using custom internal evaluator objects. These conditions are not evaluated using metrics, because we must have a way to observe the opni system with as little assumptions as possible.
Each internal datasource sets up a persistent stream that scrapes information on a stream/unary API exposed by the gateway.
The information on these persistent streams is backed by a durable consumer with a small buffer to replay information.
- Internal data APIs are read and then grouped by a given key, typically this key represents a unique cluster id. This is done for scaling internal conditions across multiple downstream clusters.
- For internal evaluators, all changes to them are propagated through an evaluation context. The values in the evaluation context are responsible for:
- Telling the subscriber where to read its messages from AND how to convert those messages to a format the inner evaluator understands.
- Telling the evaluator when to trigger/resolve alerts based on messages received from the subscriber over time.
Alerting conditions backed by metrics datasource are evaluated using prometheus rule objects.
More specifically, opni alerting manages these dependent objects via a CRUD rules API exposed by the metrics admin API.
Multiple rule objects are CRUDed (and organized by groups) depending on the template of the metrics-backed conditions
Architecture
- Backends
- Core Components