Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ASG lifecycle management Lambda function #392

Merged
merged 1 commit into from
Feb 25, 2022
Merged

Add ASG lifecycle management Lambda function #392

merged 1 commit into from
Feb 25, 2022

Conversation

joshbeard
Copy link
Contributor

@joshbeard joshbeard commented Oct 15, 2021

Description

This introduces an Auto Scaling Group instance termination lifecycle
hook using Lambda and related resources. The Lambda function is a Python
script that is triggered when the persistent runner instance in the ASG
is terminated. The function receives the instance ID of the "parent"
runner and queries for spawned instances that it launched to terminate.
Additionally, it will check for other "orphaned" instances that have a
gitlab-runner-parent-id tag that doesn't match an existing instance. This
resolves the issue where spawned instances could be orphaned when their
parent runner is terminated.

This feature is disabled by default.

The user data script is updated to provide the 'parent' instance ID as a
tag named 'gitlab-runner-parent-id' on spawned instances.

A new sub-module is provided called "terminate-workers". It is optional
to use this feature, and the input variable
asg_terminate_lifecycle_hook_create can be toggled true or false
for this behavior.

This partially addresses concerns discussed in issue #214

Migrations required

NO

Verification

I've tested existing standard configuration with this addition disabled and enabled with expected results.

  • On a default setup without enabling this feature, there are 2 changes - the user data is updated to provide the gitlab-runner and gitlab-runner-parent tags to spawned worker instances.

    Plan: 0 to add, 2 to change, 0 to destroy.
    
  • When enabled with its default configuration, there are 10 resource additions and the two changes mentioned previously.

    Plan: 10 to add, 2 to change, 0 to destroy.
    

Documentation

We use pre-commit to update the Terraform inputs and outputs in the documentation via terraform-docs. Ensure you have installed those components.

Documentation was updated, tflint and pre-commit hooks ran.

@kayman-mk
Copy link
Collaborator

What a nice feature! I need this too.

I deployed a new runner into my development environment and noticed the following:

asg_terminate_lifecycle_hook_create was not set:

  • the name of my docker+machine instance was changed. It got "gitlab-runner" appended. No idea where this is from. The AWS Console showed "Gitlab Runner eu-central-1b TESTgitlab-runner". I am using name_docker_machine_runners = "Gitlab Runner ${each.value.availability_zone} TEST" to set the name.
  • my docker+machine instance got a new tag with key = true and value = gitlab-runner-parent. This looks quite strange as the key of the tag is not meaningful.

asg_terminate_lifecycle_hook_create = true:

  • a new dependency was introduced. Now I need python installed on my machine. Can we get rid of it?
Error: can't find external program "python.exe"
│
│   with module.gitlab_runner.module.gitlab_runner_test["subnet-0bec5588fdad57ecd"].module.terminate_workers_lifecycle_function[0].module.terminate_runner_workers_lambda.data.external.archive_prepare[0],
│   on .terraform\modules\gitlab_runner.gitlab_runner_test.terminate_workers_lifecycle_function.terminate_runner_workers_lambda\package.tf line 7, in data "external" "archive_prepare":
│    7: data "external" "archive_prepare" {
  • after installing Python I got some strange error messages. The script builds a zip file and then fails with
 zip: adding content of directory: modules/gitlab_runner/gitlab_runner_local/modules/terminate-workers/lambda/
│ zip: adding: lambda_function.py
│ zip: adding: requirements.txt
│ Traceback (most recent call last):
│   File "/mnt/sales-team/platform/gitlab-ci-runner/.terraform/modules/gitlab_runner.gitlab_runner_test.terminate_workers_lifecycle_function.terminate_runner_workers_lambda/package.py", line 1270, in <module>
│     main()
│   File "/mnt/sales-team/platform/gitlab-ci-runner/.terraform/modules/gitlab_runner.gitlab_runner_test.terminate_workers_lifecycle_function.terminate_runner_workers_lambda/package.py", line 1266, in main
│     exit(args.command(args))
│   File "/mnt/sales-team/platform/gitlab-ci-runner/.terraform/modules/gitlab_runner.gitlab_runner_test.terminate_workers_lifecycle_function.terminate_runner_workers_lambda/package.py", line 1158, in build_command
│     bpm.execute(build_plan, zs, query)
│   File "/mnt/sales-team/platform/gitlab-ci-runner/.terraform/modules/gitlab_runner.gitlab_runner_test.terminate_workers_lifecycle_function.terminate_runner_workers_lambda/package.py", line 318, in __exit__
│     self.close()
│   File "/mnt/sales-team/platform/gitlab-ci-runner/.terraform/modules/gitlab_runner.gitlab_runner_test.terminate_workers_lifecycle_function.terminate_runner_workers_lambda/package.py", line 308, in close
│     os.replace(self._tmp_filename, self.filename)
│ FileNotFoundError: [Errno 2] No such file or directory: 'builds/de2e9dd0e52807f975c7bdb95a410fb17ad8cb3648a76229913b8e154420a753.zip.tmp' -> 'builds/de2e9dd0e52807f975c7bdb95a410fb17ad8cb3648a76229913b8e154420a753.zip'

There is a "build" directory but it does not contain the "zip.tmp" but the "zip" file instead. Seems that it has been moved.

@kayman-mk
Copy link
Collaborator

Ok, got it running with asg_terminate_lifecycle_hook_create = true.

  • Found the same naming problems as decribed above: Name of instance and tags.
  • The lifecycle hook is present now. But killing the Agent does not kill my docker+machine runner. I waited for some minutes but it is still there. As my Agent is automatically restartet via the ASG, could it be an issue with the Heartbeat Timeout of the hook? I didn't touch it and expected to work out of the box. It shows 90 seconds.

@kayman-mk
Copy link
Collaborator

I expect one additional parameter only: terminate_orphaned_spot_instances = true. Everything else should be handled internally.

@joshbeard
Copy link
Contributor Author

Thanks for the review, @kayman-mk! I'll work on fixes and adjustments from your feedback. The Python requirement comes from the use of an external module (https://registry.terraform.io/modules/terraform-aws-modules/lambda/) for the Lambda function and is mentioned as a prerequisite when using this feature in the documentation updates in this pull request. However, I think we can get rid of this and handle this simple use case in this module. I'll do some refactoring for this.

On the issue of it not actually doing anything - I'm curious if you can check the CloudWatch event rule to see if it triggered an invocation and if it failed? There's also a CloudWatch log stream for the function itself. On the worker instance, do you see the gitlab-runner-parent tag with a value of the parent instance ID?

@joshbeard joshbeard marked this pull request as draft October 16, 2021 16:11
@kayman-mk
Copy link
Collaborator

Nope, I do not see the tag. I guess it's the missing comma in thePARENT_TAG

@joshbeard
Copy link
Contributor Author

I've made adjustments from @kayman-mk's review, including removing the use of an external module and thus eliminating the 'python/pip' dependencies.

@joshbeard joshbeard marked this pull request as ready for review October 19, 2021 02:34
@npalm
Copy link
Collaborator

npalm commented Oct 19, 2021

@joshbeard will check asap, please can you rebase?

@joshbeard
Copy link
Contributor Author

@npalm rebased

Copy link
Collaborator

@npalm npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshbeard thanks for your contribution. Made some small remakrs, please can youhave a look. I also need todo a bit more testing.

@joshbeard
Copy link
Contributor Author

@npalm Thank you so much for the thorough review and consideration! I will resolve the issues you pointed out, rebase, and push within the next couple of days.

@joshbeard
Copy link
Contributor Author

@npalm Thank you so much for the thorough review and consideration! I will resolve the issues you pointed out, rebase, and push within the next couple of days.

Couple of days = couple of weeks ;) I've been away for the past couple of weeks.

@joshbeard
Copy link
Contributor Author

joshbeard commented Dec 2, 2021

@npalm @kayman-mk I've finally gotten back around to this and have made updates based on your most recent feedback in commit a290efa.

I noticed the README is updated with some unrelated changes from the pre-commit hook. Not sure why. I left it as a separate commit for now to make it easier to adjust if you want me to remove the unrelated changes (which still seem to be accurate?) That's visible specifically in ab8a2b7

I'll rebase this after follow-up reviews and potential tweaks.

@npalm
Copy link
Collaborator

npalm commented Dec 9, 2021

Great work, added to my backlog.

@npalm
Copy link
Collaborator

npalm commented Dec 16, 2021

Will check in the weekend

@kayman-mk
Copy link
Collaborator

@joshbeard @npalm I did a quick check in my environment.

  • Deploying the module: Success
  • Killing the runner manually: The executor was killed by the Lambda function. Success
  • Didn't find any suspicious log entries in the Lambda log.
  • Executors have a new tag: gitlab-runner-parent-id

What I noticed:

  • Lambda timeout is 3s. This might be too short as the duration of the call was 2s.
  • I have to enable the new function via asg_terminate_lifecycle_hook_create. This doesn't feel handy. I think it is a base functionality which should always be enabled to be sure not to waste money.
  • Log messages are JSON formatted and have a structured field "Level". Really appreciated. Makes log analysis easy.

@npalm
Copy link
Collaborator

npalm commented Jan 13, 2022

@kayman-mk I also will do a check. Would suggest to make the lambda memory as well the time out configurable with a reasonable default

@joshbeard
Copy link
Contributor Author

@kayman-mk I also will do a check. Would suggest to make the lambda memory as well the time out configurable with a reasonable default

@npalm I've made the lambda memory size and the timeout configurable: https://github.com/npalm/terraform-aws-gitlab-runner/pull/392/files#diff-05b5a57c136b6ff596500bcbfdcff145ef6cddea2a0e86d184d9daa9a65a288e

@joshbeard
Copy link
Contributor Author

Rebased against "develop"

@GertVil
Copy link

GertVil commented Jan 30, 2022

Really looking forward to having this merged, because this was one of the "issues" I documented before we could look into using this module at work. Do you have any idea when this can be merged @npalm? (Thank you for the awesome work!)

@npalm
Copy link
Collaborator

npalm commented Feb 2, 2022

Really looking forward to having this merged, because this was one of the "issues" I documented before we could look into using this module at work. Do you have any idea when this can be merged @npalm? (Thank you for the awesome work!)

@GertVil I will try to check the PR tonight. Can you provide a short direction how to test the lambda?

@GertVil
Copy link

GertVil commented Feb 2, 2022

Really looking forward to having this merged, because this was one of the "issues" I documented before we could look into using this module at work. Do you have any idea when this can be merged @npalm? (Thank you for the awesome work!)

@GertVil I will try to check the PR tonight. Can you provide a short direction how to test the lambda?

I'm afraid that @joshbeard will probably be in a better position to provide you with that information since he created all this so he probably has a scenario that he has used before.

Copy link
Collaborator

@npalm npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshbeard great job! works like a charm! Only one small change request, see inline. Can you also run a rebase please.

@joshbeard
Copy link
Contributor Author

@joshbeard great job! works like a charm! Only one small change request, see inline. Can you also run a rebase please.

@npalm Adjustments made and rebased against 'develop'

@npalm
Copy link
Collaborator

npalm commented Feb 24, 2022

Sorry, missed the update. Will merge the PR tomorrow. If you have time for a rebase, please. Othewrise I will fix it locally!

This introduces an Auto Scaling Group instance termination lifecycle
hook using Lambda and related resources. The Lambda function is a Python
script that is triggered when the persistent runner instance in the ASG
is terminated. The function receives the instance ID of the "parent"
runner and queries for spawned instances that it launched to terminate.
Additionally, it will check for other "orphaned" instances that have a
`gitlab-runner-parent-id` tag that doesn't match an existing instance. This
resolves the issue where spawned instances could be orphaned when their
parent runner is terminated.

This feature is disabled by default.

The user data script is updated to provide the 'parent' instance ID as a
tag named 'gitlab-runner-parent-id' on spawned instances.

A new sub-module is provided called "terminate-workers". It is optional
to use this feature, and the input variable
`asg_terminate_lifecycle_hook_create` can be toggled `true` or `false`
for this behavior.
@joshbeard
Copy link
Contributor Author

Rebased again. The README changed a bit since my last rebase, as I'm sure you're aware. I added the resources to the "resources" section, but that may get overwritten with the current pre-commit config.

@npalm
Copy link
Collaborator

npalm commented Feb 25, 2022

@joshbeard thx, checking. Just a qucik tip for a next time. Fore reviewing it helps if you are not squashing, only rebase. Now I have to go over the full PR again. No worries we get this merged!

Copy link
Collaborator

@npalm npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshbeard All good, thanks!!!

@npalm npalm merged commit 5beb9d7 into cattle-ops:develop Feb 25, 2022
@joshbeard
Copy link
Contributor Author

@joshbeard thx, checking. Just a qucik tip for a next time. Fore reviewing it helps if you are not squashing, only rebase. Now I have to go over the full PR again. No worries we get this merged!

I'm sorry about that. Thanks for the review.

@joshbeard
Copy link
Contributor Author

@joshbeard All good, thanks!!!

Great! Thanks for your work on this, the review, feedback, and including this in the project!

semantic-releaser bot pushed a commit that referenced this pull request Feb 25, 2022
## [4.40.0](4.39.1...4.40.0) (2022-02-25)

### Features

* Add ASG lifecycle management Lambda function ([#392](#392)) ([5beb9d7](5beb9d7))
* Skip runner download and install if it's already done ([#446](#446)) ([54c10f3](54c10f3))
@semantic-releaser
Copy link
Contributor

🎉 This PR is included in version 4.40.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

npalm pushed a commit that referenced this pull request Mar 7, 2022
This introduces an Auto Scaling Group instance termination lifecycle
hook using Lambda and related resources. The Lambda function is a Python
script that is triggered when the persistent runner instance in the ASG
is terminated. The function receives the instance ID of the "parent"
runner and queries for spawned instances that it launched to terminate.
Additionally, it will check for other "orphaned" instances that have a
`gitlab-runner-parent-id` tag that doesn't match an existing instance. This
resolves the issue where spawned instances could be orphaned when their
parent runner is terminated.

This feature is disabled by default.

The user data script is updated to provide the 'parent' instance ID as a
tag named 'gitlab-runner-parent-id' on spawned instances.

A new sub-module is provided called "terminate-workers". It is optional
to use this feature, and the input variable
`asg_terminate_lifecycle_hook_create` can be toggled `true` or `false`
for this behavior.
npalm added a commit that referenced this pull request May 17, 2022
This introduces an Auto Scaling Group instance termination lifecycle
hook using Lambda and related resources. The Lambda function is a Python
script that is triggered when the persistent runner instance in the ASG
is terminated. The function receives the instance ID of the "parent"
runner and queries for spawned instances that it launched to terminate.
Additionally, it will check for other "orphaned" instances that have a
`gitlab-runner-parent-id` tag that doesn't match an existing instance. This
resolves the issue where spawned instances could be orphaned when their
parent runner is terminated.

This feature is disabled by default.

The user data script is updated to provide the 'parent' instance ID as a
tag named 'gitlab-runner-parent-id' on spawned instances.

A new sub-module is provided called "terminate-workers". It is optional
to use this feature, and the input variable
`asg_terminate_lifecycle_hook_create` can be toggled `true` or `false`
for this behavior.
npalm added a commit that referenced this pull request May 19, 2022
This introduces an Auto Scaling Group instance termination lifecycle
hook using Lambda and related resources. The Lambda function is a Python
script that is triggered when the persistent runner instance in the ASG
is terminated. The function receives the instance ID of the "parent"
runner and queries for spawned instances that it launched to terminate.
Additionally, it will check for other "orphaned" instances that have a
`gitlab-runner-parent-id` tag that doesn't match an existing instance. This
resolves the issue where spawned instances could be orphaned when their
parent runner is terminated.

This feature is disabled by default.

The user data script is updated to provide the 'parent' instance ID as a
tag named 'gitlab-runner-parent-id' on spawned instances.

A new sub-module is provided called "terminate-workers". It is optional
to use this feature, and the input variable
`asg_terminate_lifecycle_hook_create` can be toggled `true` or `false`
for this behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants