Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 15.1.0 upgrade forces node group re-creation, causing production outages #1318

Closed
Ghazgkull opened this issue Apr 23, 2021 · 8 comments · Fixed by #1372
Closed

Version 15.1.0 upgrade forces node group re-creation, causing production outages #1318

Ghazgkull opened this issue Apr 23, 2021 · 8 comments · Fixed by #1372

Comments

@Ghazgkull
Copy link

Ghazgkull commented Apr 23, 2021

Description

Version 15.1.0 of this module introduced a breaking change to the configuration of node groups with the move from a single instance_type property to a list of instance_types.

I understand that being able to optionally pass in a list of instance_types is a great feature for some users, but forcing the change on everyone is really a problem. This change causes all existing node groups to be re-created which has the effect of causing a service outage for everyone using this module.

The ability to pass in multiple instance_types should be implemented in a way that doesn't force the re-creation of node groups. For example, by retaining the option to pass in a single instance_type.

Versions

$ terraform -version
Terraform v0.15.0
on darwin_amd64

  • provider registry.terraform.io/gavinbunney/kubectl v1.10.0
  • provider registry.terraform.io/hashicorp/aws v3.37.0
  • provider registry.terraform.io/hashicorp/external v2.1.0
  • provider registry.terraform.io/hashicorp/kubernetes v2.1.0
  • provider registry.terraform.io/hashicorp/local v2.1.0
  • provider registry.terraform.io/hashicorp/null v3.1.0
  • provider registry.terraform.io/hashicorp/random v3.1.0
  • provider registry.terraform.io/hashicorp/template v2.2.0
  • provider tf.platforms.nike.com/platforms/cerberus v0.8.1
  • Module: 15.1.0

Reproduction

Steps to reproduce the behavior:

  1. Start with an existing EKS cluster deployed with this module on version 13.x
  2. Upgrade to the latest version of this module.
@calvinbui
Copy link

It may also be the AMI version was updated. You have to pin that down.

@camhine
Copy link

camhine commented May 4, 2021

I'm in the same situation 😞

Is there a documented workaround?

I'm considering:

  1. Spin up some temporary K8s nodes outside of this module.
  2. Move workload across to those temporary nodes, draining the ones managed by this module.
  3. Upgrading to 15.1.0 and applying - which will recreate the original nodes without service disruption (as those nodes were drained already)
  4. Draining/terminating the temporary nodes.

I think that approach would work... but it's not ideal.

@jlundy2
Copy link

jlundy2 commented May 4, 2021

Getting this and I've tried to go back to just upgrade from 12.20 to anything later and this happens. Looks like it's tied to a userdata update.

@pgaulon
Copy link

pgaulon commented May 5, 2021

There is still the Disaster Recovery option.
Of course that is a last resort thing to do.

In my case, switching to 15.2.0 the problem came from the resource random_pet:

+/- resource "random_pet" "node_groups" {
      ~ id        = "modern-rattler" -> (known after apply)
      ~ keepers   = { # forces replacement
          + "ami_type"                  = "AL2_x86_64"
          + "disk_size"                 = "20"
          - "instance_type"             = "t3.medium" -> null
          + "instance_types"            = "t3.medium"
            # (5 unchanged elements hidden)
        }
        # (2 unchanged attributes hidden)
    }

To switch it without resources changes, I adapted the state to that change:

terraform state pull > old-state
cp old-state new-state
# edit new-state
terraform state push new-state
terraform plan -out terraform.tfplan

The change consisted in the serial to make a new valid state, and the changes of the ami_type, disk_size and instance_types:

4c4
<   "serial": 18,
---
>   "serial": 14,
6045,6046d6044
<               "ami_type": "AL2_x86_64",
<               "disk_size": "20",
6048,6049c6046
<               "instance_type": null,
<               "instance_types": "t3.medium",
---
>               "instance_type": "t3.medium",

Again, last resort.

@baturayozcan
Copy link

I have the same issue and it blocks all the work. When will this be fixed?

@soloradish
Copy link

soloradish commented May 10, 2021

Me too

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.eks.module.node_groups.random_pet.node_groups["default"] will be updated in-place
  ~ resource "random_pet" "node_groups" {
        id        = "content-cub"
        # (3 unchanged attributes hidden)
    }

After a fresh success apply. the second run will get `update in-place' message.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

module.eks.module.node_groups.random_pet.node_groups["default"]: Modifying... [id=content-cub]
╷
│ Error: doesn't support update
│
│   with module.eks.module.node_groups.random_pet.node_groups["default"],
│   on .terraform/modules/eks/modules/node_groups/random.tf line 1, in resource "random_pet" "node_groups":
│    1: resource "random_pet" "node_groups" {
│
╵

@barryib
Copy link
Member

barryib commented May 19, 2021

Please see #1372

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants