Skip to content

Commit 5beb9d7

Browse files
authored
feat: Add ASG lifecycle management Lambda function (#392)
This introduces an Auto Scaling Group instance termination lifecycle hook using Lambda and related resources. The Lambda function is a Python script that is triggered when the persistent runner instance in the ASG is terminated. The function receives the instance ID of the "parent" runner and queries for spawned instances that it launched to terminate. Additionally, it will check for other "orphaned" instances that have a `gitlab-runner-parent-id` tag that doesn't match an existing instance. This resolves the issue where spawned instances could be orphaned when their parent runner is terminated. This feature is disabled by default. The user data script is updated to provide the 'parent' instance ID as a tag named 'gitlab-runner-parent-id' on spawned instances. A new sub-module is provided called "terminate-workers". It is optional to use this feature, and the input variable `asg_terminate_lifecycle_hook_create` can be toggled `true` or `false` for this behavior.
1 parent c6b7014 commit 5beb9d7

File tree

12 files changed

+638
-3
lines changed

12 files changed

+638
-3
lines changed

README.md

+32
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,19 @@ Once you have created the parameter, you must remove the variable `runners_token
143143

144144
Finally, the runner still supports the manual runner creation. No changes are required. Please keep in mind that this setup will be removed in future releases.
145145

146+
### Auto Scaling Group Instance Termination
147+
148+
The Auto Scaling Group may be configured with a
149+
[lifecycle hook](https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html)
150+
that executes a provided Lambda function when the runner is terminated to
151+
terminate additional instances that were spawned.
152+
153+
The use of the termination lifecycle can be toggled using the
154+
`asg_termination_lifecycle_hook_create` variable.
155+
156+
When using this feature, a `builds/` directory relative to the root module will
157+
persist that contains the packaged Lambda function.
158+
146159
### Access runner instance
147160

148161
A few option are provided to access the runner instance:
@@ -259,25 +272,33 @@ terraform destroy
259272
| Name | Source | Version |
260273
|------|--------|---------|
261274
| <a name="module_cache"></a> [cache](#module\_cache) | ./modules/cache | n/a |
275+
| <a name="module_terminate_instances_lifecycle_function"></a> [terminate\_instances\_lifecycle\_function](#module\_terminate\_instances\_lifecycle\_function) | ./modules/terminate-instances | n/a |
262276

263277
## Resources
264278

265279
| Name | Type |
266280
|------|------|
281+
| [archive_file.terminate_runner_instances_lambda](https://registry.terraform.io/providers/hashicorp/archive/latest/docs/data-sources/file) | data source |
267282
| [aws_autoscaling_group.gitlab_runner_instance](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_group) | resource |
283+
| [aws_autoscaling_lifecycle_hook.terminate_instances](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_lifecycle_hook) | resource |
268284
| [aws_autoscaling_schedule.scale_in](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_schedule) | resource |
269285
| [aws_autoscaling_schedule.scale_out](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_schedule) | resource |
286+
| [aws_cloudwatch_event_rule.terminate_instances](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_rule) | resource |
287+
| [aws_cloudwatch_event_target.terminate_instances](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_target) | resource |
270288
| [aws_cloudwatch_log_group.environment](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_log_group) | resource |
289+
| [aws_cloudwatch_log_group.lambda](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_log_group) | resource |
271290
| [aws_eip.gitlab_runner](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/eip) | resource |
272291
| [aws_iam_instance_profile.docker_machine](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_instance_profile) | resource |
273292
| [aws_iam_instance_profile.instance](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_instance_profile) | resource |
274293
| [aws_iam_policy.eip](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
275294
| [aws_iam_policy.instance_docker_machine_policy](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
276295
| [aws_iam_policy.instance_session_manager_policy](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
296+
| [aws_iam_policy.lambda](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
277297
| [aws_iam_policy.service_linked_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
278298
| [aws_iam_policy.ssm](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
279299
| [aws_iam_role.docker_machine](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource |
280300
| [aws_iam_role.instance](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource |
301+
| [aws_iam_role.lambda](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource |
281302
| [aws_iam_role_policy.instance](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy) | resource |
282303
| [aws_iam_role_policy_attachment.docker_machine_cache_instance](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
283304
| [aws_iam_role_policy_attachment.docker_machine_session_manager_aws_managed](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
@@ -286,11 +307,17 @@ terraform destroy
286307
| [aws_iam_role_policy_attachment.instance_docker_machine_policy](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
287308
| [aws_iam_role_policy_attachment.instance_session_manager_aws_managed](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
288309
| [aws_iam_role_policy_attachment.instance_session_manager_policy](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
310+
| [aws_iam_role_policy_attachment.lambda](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
289311
| [aws_iam_role_policy_attachment.service_linked_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
290312
| [aws_iam_role_policy_attachment.ssm](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
291313
| [aws_iam_role_policy_attachment.user_defined_policies](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
314+
| [aws_iam_policy_document.assume_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/iam_policy_document) | data source |
315+
| [aws_iam_policy_document.lambda](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/iam_policy_document) | data source |
292316
| [aws_kms_alias.default](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/kms_alias) | resource |
293317
| [aws_kms_key.default](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/kms_key) | resource |
318+
| [aws_lambda_function.terminate_runner_instances](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function) | resource |
319+
| [aws_lambda_permission.current_version_triggers](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_permission) | resource |
320+
| [aws_lambda_permission.unqualified_alias_triggers](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_permission) | resource |
294321
| [aws_launch_template.gitlab_runner_instance](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/launch_template) | resource |
295322
| [aws_security_group.docker_machine](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | resource |
296323
| [aws_security_group.runner](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | resource |
@@ -322,6 +349,11 @@ terraform destroy
322349
| <a name="input_arn_format"></a> [arn\_format](#input\_arn\_format) | ARN format to be used. May be changed to support deployment in GovCloud/China regions. | `string` | `"arn:aws"` | no |
323350
| <a name="input_asg_delete_timeout"></a> [asg\_delete\_timeout](#input\_asg\_delete\_timeout) | Timeout when trying to delete the Runner ASG. | `string` | `"10m"` | no |
324351
| <a name="input_asg_max_instance_lifetime"></a> [asg\_max\_instance\_lifetime](#input\_asg\_max\_instance\_lifetime) | The seconds before an instance is refreshed in the ASG. | `number` | `null` | no |
352+
| <a name="input_asg_terminate_lifecycle_hook_create"></a> [asg\_terminate\_lifecycle\_hook\_create](#input\_asg\_terminate\_lifecycle\_hook\_create) | Boolean toggling the creation of the ASG instance terminate lifecycle hook. | `bool` | `true` | no |
353+
| <a name="input_asg_terminate_lifecycle_hook_heartbeat_timeout"></a> [asg\_terminate\_lifecycle\_hook\_heartbeat\_timeout](#input\_asg\_terminate\_lifecycle\_hook\_heartbeat\_timeout) | The amount of time, in seconds, for the instances to remain in wait state. | `number` | `90` | no |
354+
| <a name="input_asg_terminate_lifecycle_hook_name"></a> [asg\_terminate\_lifecycle\_hook\_name](#input\_asg\_terminate\_lifecycle\_hook\_name) | Specifies a custom name for the ASG terminate lifecycle hook and related resources. | `string` | `null` | no |
355+
| <a name="input_asg_terminate_lifecycle_lambda_memory_size"></a> [asg\_terminate\_lifecycle\_lambda\_memory\_size](#input\_asg\_terminate\_lifecycle\_lambda\_memory\_size) | The memory size in MB to allocate to the terminate-instances Lambda function. | `number` | `128` | no |
356+
| <a name="input_asg_terminate_lifecycle_lambda_timeout"></a> [asg\_terminate\_lifecycle\_lambda\_timeout](#input\_asg\_terminate\_lifecycle\_lambda\_timeout) | Amount of time the terminate-instances Lambda Function has to run in seconds. | `number` | `30` | no |
325357
| <a name="input_aws_region"></a> [aws\_region](#input\_aws\_region) | AWS region. | `string` | n/a | yes |
326358
| <a name="input_cache_bucket"></a> [cache\_bucket](#input\_cache\_bucket) | Configuration to control the creation of the cache bucket. By default the bucket will be created and used as shared cache. To use the same cache across multiple runners disable the creation of the cache and provide a policy and bucket name. See the public runner example for more details. | `map(any)` | <pre>{<br> "bucket": "",<br> "create": true,<br> "policy": ""<br>}</pre> | no |
327359
| <a name="input_cache_bucket_name_include_account_id"></a> [cache\_bucket\_name\_include\_account\_id](#input\_cache\_bucket\_name\_include\_account\_id) | Boolean to add current account ID to cache bucket name. | `bool` | `true` | no |

main.tf

+23-2
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ locals {
111111
runners_additional_volumes = local.runners_additional_volumes
112112
docker_machine_options = length(local.docker_machine_options_string) == 1 ? "" : local.docker_machine_options_string
113113
runners_name = var.runners_name
114-
runners_tags = replace(var.overrides["name_docker_machine_runners"] == "" ? format(
114+
runners_tags = replace(replace(var.overrides["name_docker_machine_runners"] == "" ? format(
115115
"Name,%s-docker-machine,%s,%s",
116116
var.environment,
117117
local.tags_string,
@@ -121,7 +121,7 @@ locals {
121121
local.tags_string,
122122
local.runner_tags_string,
123123
var.overrides["name_docker_machine_runners"],
124-
), ",,", ",")
124+
), ",,", ","), "/,$/", "")
125125
runners_token = var.runners_token
126126
runners_executor = var.runners_executor
127127
runners_limit = var.runners_limit
@@ -504,3 +504,24 @@ resource "aws_iam_role_policy_attachment" "eip" {
504504
role = aws_iam_role.instance.name
505505
policy_arn = aws_iam_policy.eip[0].arn
506506
}
507+
508+
################################################################################
509+
### Lambda function for ASG instance termination lifecycle hook
510+
################################################################################
511+
module "terminate_instances_lifecycle_function" {
512+
source = "./modules/terminate-instances"
513+
514+
count = var.asg_terminate_lifecycle_hook_create ? 1 : 0
515+
516+
name = var.asg_terminate_lifecycle_hook_name == null ? "terminate-instances" : var.asg_terminate_lifecycle_hook_name
517+
environment = var.environment
518+
asg_arn = aws_autoscaling_group.gitlab_runner_instance.arn
519+
asg_name = aws_autoscaling_group.gitlab_runner_instance.name
520+
cloudwatch_logging_retention_in_days = var.cloudwatch_logging_retention_in_days
521+
lambda_memory_size = var.asg_terminate_lifecycle_lambda_memory_size
522+
lifecycle_heartbeat_timeout = var.asg_terminate_lifecycle_hook_heartbeat_timeout
523+
name_iam_objects = local.name_iam_objects
524+
role_permissions_boundary = var.permissions_boundary == "" ? null : "${var.arn_format}:iam::${data.aws_caller_identity.current.account_id}:policy/${var.permissions_boundary}"
525+
lambda_timeout = var.asg_terminate_lifecycle_lambda_timeout
526+
tags = local.tags
527+
}

0 commit comments

Comments
 (0)