Skip to content
This repository was archived by the owner on Feb 28, 2025. It is now read-only.

Alerting Gateway Architecture

Alexandre Lamarre edited this page Jan 24, 2023 · 5 revisions

Table of Contents

Architecture

High Level Overview

Alerting Gateway

Alerting Gateway

The gateway-side Alerting server, henceforth called Alerting plugin is responsible for managing & applying the following user configurations:

  • endpoints, specifications used by the Alerting Cluster to dispatch alerts
  • conditions, specifications applied to datasources to evaluate observability data

The alerting plugin is also responsible for connecting to datasources:

  • Aggregate available info from datasource into templates for the client
  • Manage dependencies on datasources that make the observations required to evaluate conditions

The alerting plugin exposes an API to dynamically install and scale the Alerting Cluster, that delegates necessary updates to the controller through a cluster driver

In short, the alerting gateway plugin process can be broken down into:

  • Initialization Phase : setting up required dependencies by the core alerting gateway plugin to run
  • Backend Components : Servers that handle the logic and requests behind opni alerting features

Initialization Phase

The initialization phase of the alerting plugin is responsible for setting the up the correct adapters to:

  • persistent storage
  • the Alerting Cluster, managed by the Alerting Controller
  • external datasources : opni backends & internal gateway state

In more detail, when the alerting plugin initializes it must:

  1. Set up cluster driver, an adapter to the alerting controller
  2. Wrap the cluster driver in an API (OpsServer)
  3. Acquire the storage api clients
  4. Setup datasources:
  • 4.1. Acquire gateway internal streams, and setup persistent streams to watch the data streamed
  • 4.2. Acquire metrics ops backend client, and scrape the status api and send it to a persistent stream
  • 4.3. Acquire metrics admin client, to CRUD cortex rule objects
  1. Reindex (re-apply) user configurations, if the external datasource dependencies aren't loaded. For details on how external datasource dependencies are loaded see Datasource Implementations

Components

Datasource Implementations

There are currently 2 datasources for Opni Alerting

  • Internal : system critical information exposed by the gateway
  • metrics : information exposed in metrics format by the Opni Metrics backend

The currently supported opni conditions map to:

  • Agent disconnect -> internal datasource
  • Capability unhealthy -> internal datasource
  • Monitoring backend -> internal datasource
  • Prometheus Query -> metrics datasource
  • Kube State -> metrics datasource

Internal

Alerting conditions backed by internal datasources are evaluated using custom internal evaluator objects. These conditions are not evaluated using metrics, because we must have a way to observe the opni system with as little assumptions as possible.

Each internal datasource sets up a persistent stream that scrapes information on a stream/unary API exposed by the gateway.

The information on these persistent streams is backed by a durable consumer with a small buffer to replay information.

internal evaluator

  • Internal data APIs are read and then grouped by a given key, typically this key represents a unique cluster id. This is done for scaling internal conditions across multiple downstream clusters.
  • For internal evaluators, all changes to them are propagated through an evaluation context. The values in the evaluation context are responsible for:
    • Telling the subscriber where to read its messages from AND how to convert those messages to a format the inner evaluator understands.
    • Telling the evaluator when to trigger/resolve alerts based on messages received from the subscriber over time.

Metrics

Alerting conditions backed by metrics datasource are evaluated using prometheus rule objects.

More specifically, opni alerting manages these dependent objects via a CRUD rules API exposed by the metrics admin API.

Multiple rule objects are CRUDed (and organized by groups) depending on the template of the metrics-backed conditions

Clone this wiki locally