Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add setting to limit parallel runner pods #545

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions api/v1alpha1/terraformrepository_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,14 @@ type TerraformRepositorySpec struct {
// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
// Important: Run "make" to regenerate code after modifying this file

Repository TerraformRepositoryRepository `json:"repository,omitempty"`
TerraformConfig TerraformConfig `json:"terraform,omitempty"`
TerragruntConfig TerragruntConfig `json:"terragrunt,omitempty"`
OpenTofuConfig OpenTofuConfig `json:"opentofu,omitempty"`
RemediationStrategy RemediationStrategy `json:"remediationStrategy,omitempty"`
OverrideRunnerSpec OverrideRunnerSpec `json:"overrideRunnerSpec,omitempty"`
RunHistoryPolicy RunHistoryPolicy `json:"runHistoryPolicy,omitempty"`
Repository TerraformRepositoryRepository `json:"repository,omitempty"`
TerraformConfig TerraformConfig `json:"terraform,omitempty"`
TerragruntConfig TerragruntConfig `json:"terragrunt,omitempty"`
OpenTofuConfig OpenTofuConfig `json:"opentofu,omitempty"`
RemediationStrategy RemediationStrategy `json:"remediationStrategy,omitempty"`
OverrideRunnerSpec OverrideRunnerSpec `json:"overrideRunnerSpec,omitempty"`
RunHistoryPolicy RunHistoryPolicy `json:"runHistoryPolicy,omitempty"`
MaxConcurrentRunnerPods int `json:"maxConcurrentRuns,omitempty"`
}

type TerraformRepositoryRepository struct {
Expand Down
1 change: 1 addition & 0 deletions cmd/controllers/start.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ func buildControllersStartCmd(app *burrito.App) *cobra.Command {
cmd.Flags().DurationVar(&app.Config.Controller.Timers.FailureGracePeriod, "failure-grace-period", defaultFailureGracePeriod, "initial time before retry, goes exponential function of number failure. Must end with s, m or h.")
cmd.Flags().IntVar(&app.Config.Controller.TerraformMaxRetries, "terraform-max-retries", 5, "default number of retries for terraform actions (can be overriden in CRDs)")
cmd.Flags().IntVar(&app.Config.Controller.MaxConcurrentReconciles, "max-concurrent-reconciles", 1, "maximum number of concurrent reconciles")
cmd.Flags().IntVar(&app.Config.Controller.MaxConcurrentRunnerPods, "max-runner-pods", 0, "maximum number of concurrent runner pods")
cmd.Flags().BoolVar(&app.Config.Controller.LeaderElection.Enabled, "leader-election", true, "whether leader election is enabled or not, default to true")
cmd.Flags().StringVar(&app.Config.Controller.LeaderElection.ID, "leader-election-id", "6d185457.terraform.padok.cloud", "lease id used for leader election")
cmd.Flags().StringVar(&app.Config.Controller.HealthProbeBindAddress, "health-probe-bind-address", ":8081", "address to bind the metrics server embedded in the controllers")
Expand Down
4 changes: 3 additions & 1 deletion deploy/charts/burrito/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,13 @@ config:
# -- Duration to wait before retrying on error
onError: 10s
# -- Duration to wait before retrying on locked layer
waitAction: 1m
waitAction: 10s
# -- Duration to wait before retrying on failure (increases exponentially with the amount of failed retries)
failureGracePeriod: 30
# -- Maximum number of concurrent reconciles for the controller, increse this value if you have a lot of resources to reconcile
maxConcurrentReconciles: 1
# -- Maximum number of concurrent runners pods. 0 means no limit
MaxConcurrentRunnerPods: 0
# -- Maximum number of retries for Terraform operations (plan, apply...)
terraformMaxRetries: 3
# TODO: enable repository controller by default
Expand Down
5 changes: 5 additions & 0 deletions docs/operator-manual/advanced-configuration.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Advanced configuration

Here are some important configuration options that can be set to customize Burrito's behavior.
They can be set in the Helm chart [values](https://github.com/padok-team/burrito/blob/main/deploy/charts/burrito/values.yaml) or as environment variables.

## Controllers' configuration

| Environment variable | Description | Default |
Expand All @@ -16,6 +19,8 @@
| `BURRITO_CONTROLLER_HEALTHPROBEBINDADDRESS` | address to bind the health probe server embedded in the controllers | `:8081` |
| `BURRITO_CONTROLLER_METRICSBINDADDRESS` | address to bind the metrics server embedded in the controllers | `:8080` |
| `BURRITO_CONTROLLER_KUBERNETESWEBHOOKPORT` | port used by the validating webhook server embedded in the controllers | `9443` |
| `BURRITO_CONTROLLER_MAXCONCURRENTRECONCILES` | number of parallel resource reconciliation performed by the contoller | `0` |
| `BURRITO_CONTROLLER_MAXCONCURRENTRUNNERPODS` | maximum number for pods that run in parallel to perform plan/apply (0=inf) | `0` |

## Server's configuration

Expand Down
15 changes: 15 additions & 0 deletions docs/operator-manual/runner-scheduling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Fine-tuning the scheduling of runner pods

Burrito creates runner pods to execute plans and apply changes on your infrastructure. The scheduling of these pods can be fine-tuned to better fit your needs. (e.g. to avoid running too many pods at the same time, or to reduce the cost of your underlying infrastructure).

## Limit the number of runner pods in parallel

By default, Burrito does not limit the number of runner pods that can run in parallel. This can lead to a high number of pods running at the same time, which can be costly or can overload your infrastructure.

It is possible to limit the number of runner pods that can run in parallel by setting the `BURRITO_CONTROLLER_MAXCONCURRENTRUNNERPODS` environment variable in the controller, or by setting the `config.burrito.controller.maxConcurrentRunnerPods` value in the [Helm chart values file](https://github.com/padok-team/burrito/blob/main/deploy/charts/burrito/values.yaml).

You can also set this value in the TerraformRepository CRD by setting the `spec.maxConcurrentRunnerPods` field.

If the value of this parameter is set to `0`, there is no limit to the number of runner pods that can run in parallel.

When Burrito creates a pod, if the setting is both set in the controller and in the TerraformRepository, the TerraformRepository value will take precedence.
2 changes: 2 additions & 0 deletions internal/burrito/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ type ControllerConfig struct {
GitlabConfig GitlabConfig `mapstructure:"gitlabConfig"`
RunParallelism int `mapstructure:"runParallelism"`
MaxConcurrentReconciles int `mapstructure:"maxConcurrentReconciles"`
MaxConcurrentRunnerPods int `mapstructure:"maxConcurrentRunnerPods"`
}

type GithubConfig struct {
Expand Down Expand Up @@ -229,6 +230,7 @@ func TestConfig() *Config {
Controller: ControllerConfig{
TerraformMaxRetries: 5,
MaxConcurrentReconciles: 1,
MaxConcurrentRunnerPods: 0,
Timers: ControllerTimers{
DriftDetection: 20 * time.Minute,
WaitAction: 5 * time.Minute,
Expand Down
3 changes: 3 additions & 0 deletions internal/burrito/config/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ func TestConfig_FromYamlFile(t *testing.T) {
},
TerraformMaxRetries: 5,
MaxConcurrentReconciles: 1,
MaxConcurrentRunnerPods: 0,
Types: []string{"layer", "repository", "run", "pullrequest"},
LeaderElection: config.LeaderElectionConfig{
Enabled: true,
Expand Down Expand Up @@ -146,6 +147,7 @@ func TestConfig_EnvVarOverrides(t *testing.T) {
setEnvVar(t, "BURRITO_CONTROLLER_TIMERS_WAITACTION", "30s", &envVarList)
setEnvVar(t, "BURRITO_CONTROLLER_TIMERS_FAILUREGRACEPERIOD", "1m", &envVarList)
setEnvVar(t, "BURRITO_CONTROLLER_MAXCONCURRENTRECONCILES", "3", &envVarList)
setEnvVar(t, "BURRITO_CONTROLLER_MAXCONCURRENTRUNNERPODS", "10", &envVarList)
setEnvVar(t, "BURRITO_CONTROLLER_TERRAFORMMAXRETRIES", "32", &envVarList)
setEnvVar(t, "BURRITO_CONTROLLER_LEADERELECTION_ID", "other-leader-id", &envVarList)
setEnvVar(t, "BURRITO_CONTROLLER_GITHUBCONFIG_APPID", "123456", &envVarList)
Expand Down Expand Up @@ -205,6 +207,7 @@ func TestConfig_EnvVarOverrides(t *testing.T) {
FailureGracePeriod: 1 * time.Minute,
},
MaxConcurrentReconciles: 3,
MaxConcurrentRunnerPods: 10,
TerraformMaxRetries: 32,
Types: []string{"layer", "repository"},
LeaderElection: config.LeaderElectionConfig{
Expand Down
1 change: 1 addition & 0 deletions internal/burrito/config/testdata/test-config-1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ controller:
failureGracePeriod: 15s
terraformMaxRetries: 5
maxConcurrentReconciles: 1
maxConcurrentRunnerPods: 0
types: ["layer", "repository", "run", "pullrequest"]
leaderElection:
enabled: true
Expand Down
1 change: 1 addition & 0 deletions internal/controllers/terraformrun/pod.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ const (

func getDefaultLabels(run *configv1alpha1.TerraformRun) map[string]string {
return map[string]string{
"burrito/component": "runner",
"burrito/managed-by": run.Name,
"burrito/action": string(run.Spec.Action),
}
Expand Down
35 changes: 35 additions & 0 deletions internal/controllers/terraformrun/states.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@ import (
log "github.com/sirupsen/logrus"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/labels"
"k8s.io/apimachinery/pkg/selection"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
)

type RunInfo struct {
Expand Down Expand Up @@ -79,6 +82,38 @@ func (s *Initial) getHandler() Handler {
log.Errorf("could not set lock on run %s for layer %s, requeuing resource: %s", run.Name, layer.Name, err)
return ctrl.Result{RequeueAfter: r.Config.Controller.Timers.OnError}, RunInfo{}
}
// If a global parameter is set, use it, otherwise use the repository parameter
maxConcurrentRunnerPods := r.Config.Controller.MaxConcurrentRunnerPods
if repo.Spec.MaxConcurrentRunnerPods > 0 {
maxConcurrentRunnerPods = repo.Spec.MaxConcurrentRunnerPods
}
if maxConcurrentRunnerPods > 0 {
// count all running burrito pods to avoid exceeding the maximum number of concurrent runs
runningPods := &corev1.PodList{}
labelSelector := labels.NewSelector()
requirement, err := labels.NewRequirement("burrito/component", selection.Equals, []string{"runner"})
if err != nil {
r.Recorder.Event(run, corev1.EventTypeWarning, "Run", "Could not list running pods")
log.Errorf("could not list running pods: %s", err)
return ctrl.Result{RequeueAfter: r.Config.Controller.Timers.OnError}, RunInfo{}
}
labelSelector = labelSelector.Add(*requirement)
err = r.Client.List(
ctx,
runningPods,
client.MatchingLabelsSelector{Selector: labelSelector},
)
if err != nil {
r.Recorder.Event(run, corev1.EventTypeWarning, "Run", "Could not list running pods")
log.Errorf("could not list running pods: %s", err)
return ctrl.Result{RequeueAfter: r.Config.Controller.Timers.OnError}, RunInfo{}
}
if len(runningPods.Items) >= r.Config.Controller.MaxConcurrentRunnerPods {
r.Recorder.Event(run, corev1.EventTypeWarning, "Run", "Max concurrent runs reached. Requeuing resource...")
log.Infof("max concurrent runs reached, requeuing resource")
return ctrl.Result{RequeueAfter: r.Config.Controller.Timers.WaitAction}, RunInfo{}
}
}
pod := r.getPod(run, layer, repo)
err = r.Client.Create(ctx, &pod)
if err != nil {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ spec:
spec:
description: TerraformRepositorySpec defines the desired state of TerraformRepository
properties:
maxConcurrentRuns:
type: integer
opentofu:
properties:
enabled:
Expand Down
2 changes: 2 additions & 0 deletions manifests/install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3515,6 +3515,8 @@ spec:
spec:
description: TerraformRepositorySpec defines the desired state of TerraformRepository
properties:
maxConcurrentRuns:
type: integer
opentofu:
properties:
enabled:
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ nav:
- operator-manual/multi-tenant-architecture.md
- operator-manual/datastore.md
- operator-manual/provider-caching.md
- operator-manual/runner-scheduling.md
- User Guide:
- user-guide/index.md
- user-guide/override-runner.md
Expand Down