Version 15.1.0 upgrade forces node group re-creation, causing production outages #1318

Ghazgkull · 2021-04-23T16:53:59Z

Description

Version 15.1.0 of this module introduced a breaking change to the configuration of node groups with the move from a single instance_type property to a list of instance_types.

I understand that being able to optionally pass in a list of instance_types is a great feature for some users, but forcing the change on everyone is really a problem. This change causes all existing node groups to be re-created which has the effect of causing a service outage for everyone using this module.

The ability to pass in multiple instance_types should be implemented in a way that doesn't force the re-creation of node groups. For example, by retaining the option to pass in a single instance_type.

Versions

$ terraform -version
Terraform v0.15.0
on darwin_amd64

provider registry.terraform.io/gavinbunney/kubectl v1.10.0
provider registry.terraform.io/hashicorp/aws v3.37.0
provider registry.terraform.io/hashicorp/external v2.1.0
provider registry.terraform.io/hashicorp/kubernetes v2.1.0
provider registry.terraform.io/hashicorp/local v2.1.0
provider registry.terraform.io/hashicorp/null v3.1.0
provider registry.terraform.io/hashicorp/random v3.1.0
provider registry.terraform.io/hashicorp/template v2.2.0
provider tf.platforms.nike.com/platforms/cerberus v0.8.1

Module: 15.1.0

Reproduction

Steps to reproduce the behavior:

Start with an existing EKS cluster deployed with this module on version 13.x
Upgrade to the latest version of this module.

The text was updated successfully, but these errors were encountered:

calvinbui · 2021-04-29T01:29:44Z

It may also be the AMI version was updated. You have to pin that down.

camhine · 2021-05-04T04:43:59Z

I'm in the same situation 😞

Is there a documented workaround?

I'm considering:

Spin up some temporary K8s nodes outside of this module.
Move workload across to those temporary nodes, draining the ones managed by this module.
Upgrading to 15.1.0 and applying - which will recreate the original nodes without service disruption (as those nodes were drained already)
Draining/terminating the temporary nodes.

I think that approach would work... but it's not ideal.

jlundy2 · 2021-05-04T22:34:47Z

Getting this and I've tried to go back to just upgrade from 12.20 to anything later and this happens. Looks like it's tied to a userdata update.

pgaulon · 2021-05-05T08:47:18Z

There is still the Disaster Recovery option.
Of course that is a last resort thing to do.

In my case, switching to 15.2.0 the problem came from the resource random_pet:

+/- resource "random_pet" "node_groups" {
      ~ id        = "modern-rattler" -> (known after apply)
      ~ keepers   = { # forces replacement
          + "ami_type"                  = "AL2_x86_64"
          + "disk_size"                 = "20"
          - "instance_type"             = "t3.medium" -> null
          + "instance_types"            = "t3.medium"
            # (5 unchanged elements hidden)
        }
        # (2 unchanged attributes hidden)
    }

To switch it without resources changes, I adapted the state to that change:

terraform state pull > old-state
cp old-state new-state
# edit new-state
terraform state push new-state
terraform plan -out terraform.tfplan

The change consisted in the serial to make a new valid state, and the changes of the ami_type, disk_size and instance_types:

4c4
<   "serial": 18,
---
>   "serial": 14,
6045,6046d6044
<               "ami_type": "AL2_x86_64",
<               "disk_size": "20",
6048,6049c6046
<               "instance_type": null,
<               "instance_types": "t3.medium",
---
>               "instance_type": "t3.medium",

Again, last resort.

baturayozcan · 2021-05-06T09:51:08Z

I have the same issue and it blocks all the work. When will this be fixed?

soloradish · 2021-05-10T13:37:45Z

Me too

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.eks.module.node_groups.random_pet.node_groups["default"] will be updated in-place
  ~ resource "random_pet" "node_groups" {
        id        = "content-cub"
        # (3 unchanged attributes hidden)
    }

After a fresh success apply. the second run will get `update in-place' message.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

module.eks.module.node_groups.random_pet.node_groups["default"]: Modifying... [id=content-cub]
╷
│ Error: doesn't support update
│
│   with module.eks.module.node_groups.random_pet.node_groups["default"],
│   on .terraform/modules/eks/modules/node_groups/random.tf line 1, in resource "random_pet" "node_groups":
│    1: resource "random_pet" "node_groups" {
│
╵

barryib · 2021-05-19T13:31:16Z

Please see #1372

github-actions · 2022-11-21T02:28:17Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

soloradish mentioned this issue May 13, 2021

fix: Disable "random_pet" when not using "asg_recreate_on_change" #1278

Closed

1 task

barryib mentioned this issue May 19, 2021

feat: Drop random pets from Managed Node Groups #1372

Merged

2 tasks

barryib closed this as completed in #1372 May 27, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 15.1.0 upgrade forces node group re-creation, causing production outages #1318

Version 15.1.0 upgrade forces node group re-creation, causing production outages #1318

Ghazgkull commented Apr 23, 2021 •

edited

Loading

calvinbui commented Apr 29, 2021

camhine commented May 4, 2021

jlundy2 commented May 4, 2021

pgaulon commented May 5, 2021

baturayozcan commented May 6, 2021

soloradish commented May 10, 2021 •

edited

Loading

barryib commented May 19, 2021

github-actions bot commented Nov 21, 2022

Version 15.1.0 upgrade forces node group re-creation, causing production outages #1318

Version 15.1.0 upgrade forces node group re-creation, causing production outages #1318

Comments

Ghazgkull commented Apr 23, 2021 • edited Loading

Description

Versions

Reproduction

calvinbui commented Apr 29, 2021

camhine commented May 4, 2021

jlundy2 commented May 4, 2021

pgaulon commented May 5, 2021

baturayozcan commented May 6, 2021

soloradish commented May 10, 2021 • edited Loading

barryib commented May 19, 2021

github-actions bot commented Nov 21, 2022

Ghazgkull commented Apr 23, 2021 •

edited

Loading

soloradish commented May 10, 2021 •

edited

Loading