Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting cluster_endpoint_public_access_cidrs a value makes the nodegroup not join cluster #1867

Closed
abhilashkar opened this issue Feb 11, 2022 · 27 comments · Fixed by #1981
Closed

Comments

@abhilashkar
Copy link

abhilashkar commented Feb 11, 2022

Description

While creating a EKS cluster if I were to give addresses in cluster_endpoint_public_access_cidrs , the nodegroup isnt able to join the cluster. Cluster creation reports the following error

module.eks.module.eks_managed_node_group["default_node_group"].aws_eks_node_group.this[0]: Still creating... [26m41s elapsed]

│ Error: error waiting for EKS Node Group (abc- default_node_group-20220210235029685300000001) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
│ * i-053f38656f5925c75: NodeCreationFailure: Instances failed to join the kubernetes cluster

If I were to change the cluster_endpoint_public_access_cidrs ( from 0.0.0.0/0 to any other ip) after the cluster creation it works fine

⚠️ Note

Before you submit an issue, please perform the following first:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists

Versions

  • Terraform: 1.0.11
  • Provider(s):
  • Module: eks

Reproduction

Steps to reproduce the behavior:
Create a cluster using the managed nodegroup example with a value set for cluster_endpoint_public_access_cidrs

Code Snippet to Reproduce

module "eks" {
#source = "../.."
source = "terraform-aws-modules/eks/aws"

cluster_name = local.name
cluster_version = local.cluster_version
cluster_endpoint_private_access = true
cluster_endpoint_public_access = true
cluster_endpoint_public_access_cidrs = var.cluster_endpoint_public_access_cidrs

Expected behavior

Cluster creation should be successful and nodegroups should join the cluster

Actual behavior

I could see the ip addresses in Public access source allowlist of the cluster but I dont see the nodegroups under that as the terraform errors out stating NodeCreationFailure: Instances failed to join the kubernetes cluster

Terminal Output Screenshot(s)

Error: error waiting for EKS Node Group (eks:default_node_group-20220210235029685300000001) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
│ * i-053f38656f5925c75: NodeCreationFailure: Instances failed to join the kubernetes cluster

Additional context

My requirement is to have a Nodegroup created in a private subnet ( SDWAN connected) and have them talk to the EKS cluster which has private and public endpoint. In the public endpoint I want to restrict the IP addresses which can connect to it.

@thamjieying
Copy link

I also face the same issue

@bryantbiggs
Copy link
Member

can you provide a full reproduction please

@abhilashkar
Copy link
Author

@bryantbiggs I am trying to create a EKS cluster with managed node group , All i do is supply a list of external ips
cluster_endpoint_public_access_cidrs = var.cluster_endpoint_public_access_cidrs
if this value is 0.0.0.0/0 , terraform succeeds. If I specify a list of ip addresses in it , the terraform fails with the reason

Error: error waiting for EKS Node Group (abc- default_node_group-20220210235029685300000001) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:

@abhilashkar
Copy link
Author

module "eks" {
#source = "../.."
source = "terraform-aws-modules/eks/aws"

cluster_name = local.name
cluster_version = local.cluster_version
cluster_endpoint_private_access = true
cluster_endpoint_public_access = true
cluster_endpoint_public_access_cidrs = var.cluster_endpoint_public_access_cidrs

@abhilashkar
Copy link
Author

If I comment cluster_endpoint_public_access_cidrs or give a value of 0.0.0.0/0 the code succeeds

@abhilashkar
Copy link
Author

Im using the example under examples/eks_managed_node_group

@thedusansky
Copy link

thedusansky commented Feb 18, 2022

I have the same issue here. Basically when cluster_endpoint_public_access_cidrs is limited to some CIDRs, the node groups can't join the cluster and timed out.

cluster_endpoint_private_access = true
cluster_endpoint_public_access = true

Terraform v0.13.7
aws provider 2.64.0

@Zorgji
Copy link

Zorgji commented Feb 21, 2022

Mine is working with provider version 4.2.

@dejwsz
Copy link

dejwsz commented Feb 21, 2022

Maybe this may help? #1889 (comment)

@tculp
Copy link
Contributor

tculp commented Mar 9, 2022

@bryantbiggs I get this same issue, here is a pretty minimal example showing the issue:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.27"
    }
  }
  required_version = ">= 0.14.9"
}

provider "aws" {
  # Must match the profile name in your ~/.okta_aws_login_config file
  profile = "<profile>"
  region  = "us-east-1"

}

locals {
  name            = "test-1"
  cluster_version = "1.20"
  region          = "us-east-1"
}

data "aws_caller_identity" "current" {}

################################################################################
# EKS Module
################################################################################

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "18.8.1"

  cluster_name                    = local.name
  cluster_version                 = local.cluster_version
  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  cluster_endpoint_public_access_cidrs = [<cidrs redacted>]

  vpc_id     = module.vpc.vpc_id
  subnet_ids = concat(
    module.vpc.private_subnets,
    module.vpc.public_subnets,
  )


  eks_managed_node_group_defaults = {
    disk_size      = 50
    instance_types = ["m5.large"]
    # iam_role_attach_cni_policy = true
  }

  eks_managed_node_groups = {
    # Default node group - as provided by AWS EKS
    default_node_group = {

    }
  }

}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "3.12.0"

  name = local.name
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = true
  one_nat_gateway_per_az = false
}

resource "tls_private_key" "this" {
  algorithm = "RSA"
}

resource "aws_key_pair" "this" {
  key_name_prefix = local.name
  public_key      = tls_private_key.this.public_key_openssh

}

@tculp
Copy link
Contributor

tculp commented Mar 9, 2022

@bryantbiggs With regards to the example above, setting the public_access_cidrs to

cluster_endpoint_public_access_cidrs = concat(
    [<redacted cidrs>],
    ["${module.vpc.nat_public_ips[0]}/32"],
  )

makes it work.

EDIT:

Alternatively, it seems that adding the following to the VPC config sometimes makes it work too, but this seems to be inconsistent in how long the nodegroup takes to connect, varying between ~2m and ~9m

  enable_dns_support   = true
  enable_dns_hostnames = true

@onlinebaba
Copy link

@tculp Would you please help clarifying the reference to NAT public IPs?

["${module.vpc.nat_public_ips[0]}/32"],

@tchristie-meazure
Copy link

Adding the NAT addresses of the VPC the cluster is in does solve the issue.

I have enable_dns_support and enable_dns_hostnames but neither solved the problem. I also don't have a private endpoint exposed - only a public one.

@antonbabenko
Copy link
Member

This issue has been resolved in version 18.19.0 🎉

@pen-pal
Copy link
Contributor

pen-pal commented Apr 12, 2022

how did you guys solve the issue?
In my case, node joins the cluster, however the aws_node fails with Readiness and Liveness Probe failed

 Warning  Unhealthy  5m10s (x577 over 168m)  kubelet  (combined from similar events): Readiness probe failed: { │
│ "level":"info","ts":"2022-04-12T13:28:52.198Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout:  │
│ failed to connect service \":50051\" within 5s"}           

This is my configuration

module "eks" {
  source = "./modules/eks"
  config = local.config
  vpc_id = local.vpc.id
  vpc_subnets                 = local.private_subnets_ids
  AWS_ACCOUNT_ARN_OF_DEPLOYER = local.aws.aws_account_arn_cicd

  cluster_name                       = local.name_prefix
  cluster_version                    = var.cluster_version
  cluster_endpoint_private_access    = true
  cluster_endpoint_public_access     = true
  cluster_security_group_name        = local.name_prefix
  cluster_security_group_description = var.cluster_security_group_description
  iam_role_name                      = local.name_prefix
  enable_irsa                        = true

  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
    }
    kube-proxy = {
      resolve_conflicts = "OVERWRITE"
    }
  }

  eks_managed_node_groups = {
    # Complete
    complete = {
      name            = "${local.config.environment}-${local.config.service}"
      use_name_prefix = true

      min_size     = 1
      max_size     = 3
      desired_size = 1
      ami_id       = data.aws_ami.eks_default.image_id

      enable_bootstrap_user_data = true
      iam_role_attach_cni_policy = true

      post_bootstrap_user_data = <<-EOT
      cd /tmp
      sudo yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
      sudo systemctl enable amazon-ssm-agent
      sudo systemctl start amazon-ssm-agent
      echo "you are free little kubelet!"
      EOT

      disk_size            = 50
      force_update_version = true
      instance_types       = ["m5.large"]
      labels = {
        GithubRepo = "terraform-aws-eks"
        GithubOrg  = "terraform-aws-modules"
      }

      update_config = {
        max_unavailable_percentage = 50 # or set `max_unavailable`
      }

      description = "EKS managed node group example launch template"

      ebs_optimized           = true
      disable_api_termination = false
      enable_monitoring       = true
      #vpc_security_group_ids  = [aws_security_group.additional.id]

      metadata_options = {
        http_endpoint               = "enabled"
        http_tokens                 = "required"
        http_put_response_hop_limit = 2
        instance_metadata_tags      = "disabled"
      }
    }
  }
}

@bryantbiggs
Copy link
Member

@pen-pal https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/network_connectivity.md#public-endpoint-w-restricted-cidrs

There is no fix required by the module, its up to users to ensure VPC network connectivity is setup properly

@pen-pal
Copy link
Contributor

pen-pal commented Apr 12, 2022

@pen-pal https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/network_connectivity.md#public-endpoint-w-restricted-cidrs

There is no fix required by the module, its up to users to ensure VPC network connectivity is setup properly

In my case both private and public access are set to true, but yet it fails.
Is there something missing in this configuration.
If, I do not add vpc-cni module, the nodes does not even join the cluster at all.

To add more to this, the same configuration works and node joins the cluster, if I spin up a completely new eks cluster.

PS: I am upgrading from v17.x to v18.x and using the latest tag on my module

@bryantbiggs
Copy link
Member

if you have a full reproduction I can take a look - but I would look at the examples as these all work as intended

@pen-pal
Copy link
Contributor

pen-pal commented Apr 12, 2022

if you have a full reproduction I can take a look - but I would look at the examples as these all work as intended

what should I share with you to reproduce the issue?

@bryantbiggs
Copy link
Member

if you have a full reproduction I can take a look - but I would look at the examples as these all work as intended

what should I share with you to reproduce the issue?

a deployable reproduction

@pen-pal
Copy link
Contributor

pen-pal commented Apr 12, 2022

if you have a full reproduction I can take a look - but I would look at the examples as these all work as intended

what should I share with you to reproduce the issue?

a deployable reproduction

Here is the configuration as asked
The config for cluster config v17.x.
The config for cluster config v18.x

@bryantbiggs
Copy link
Member

you don't need to provide the AMI ID unless you are using a custom AMI. also, if you are just figuring things out then I suggest starting with the minimal amount of configuration and only tweaking/modifying when its necessary

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 3.0"

  name = "cluster01"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  public_subnet_tags = {
    "kubernetes.io/cluster/cluster01" = "shared"
    "kubernetes.io/role/elb"          = 1
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/cluster01" = "shared"
    "kubernetes.io/role/internal-elb" = 1
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 18.0"

  cluster_name    = "cluster01"
  cluster_version = "1.21"

  cluster_endpoint_private_access    = true
  cluster_security_group_name        = "cluster01-security-group"
  cluster_security_group_description = "EKS cluster security group."

  iam_role_name = "cluster01-iam-role"
  enable_irsa   = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
    }
    kube-proxy = {
      resolve_conflicts = "OVERWRITE"
    }
  }

  eks_managed_node_groups = {
    complete = {
      name = "nodegroup01"

      min_size     = 1
      max_size     = 3
      desired_size = 1

      force_update_version = true
      instance_types       = ["m5.large"]

      update_config = {
        max_unavailable_percentage = 50 # or set `max_unavailable`
      }

      metadata_options = {
        http_endpoint               = "enabled"
        http_tokens                 = "required"
        http_put_response_hop_limit = 2
        instance_metadata_tags      = "disabled"
      }
    }
  }
}

@pen-pal
Copy link
Contributor

pen-pal commented Apr 12, 2022

you don't need to provide the AMI ID unless you are using a custom AMI. also, if you are just figuring things out then I suggest starting with the minimal amount of configuration and only tweaking/modifying when its necessary

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 3.0"

  name = "cluster01"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  public_subnet_tags = {
    "kubernetes.io/cluster/cluster01" = "shared"
    "kubernetes.io/role/elb"          = 1
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/cluster01" = "shared"
    "kubernetes.io/role/internal-elb" = 1
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 18.0"

  cluster_name    = "cluster01"
  cluster_version = "1.21"

  cluster_endpoint_private_access    = true
  cluster_security_group_name        = "cluster01-security-group"
  cluster_security_group_description = "EKS cluster security group."

  iam_role_name = "cluster01-iam-role"
  enable_irsa   = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
    }
    kube-proxy = {
      resolve_conflicts = "OVERWRITE"
    }
  }

  eks_managed_node_groups = {
    complete = {
      name = "nodegroup01"

      min_size     = 1
      max_size     = 3
      desired_size = 1

      force_update_version = true
      instance_types       = ["m5.large"]

      update_config = {
        max_unavailable_percentage = 50 # or set `max_unavailable`
      }

      metadata_options = {
        http_endpoint               = "enabled"
        http_tokens                 = "required"
        http_put_response_hop_limit = 2
        instance_metadata_tags      = "disabled"
      }
    }
  }
}

the reason for using ami_id was to make sure the post_bootstrap_user_data gets executed. as it was not allowing the user-data to be used.

@bryantbiggs
Copy link
Member

does it have to execute after the node joins the cluster? you can use the pre_bootstrap_user_data field

@bryantbiggs
Copy link
Member

also, your config does not restrict the clusters public endpoint so I don't see how this is relevant to the original issue above

@pen-pal
Copy link
Contributor

pen-pal commented Apr 12, 2022

also, your config does not restrict the clusters public endpoint so I don't see how this is relevant to the original issue above

@bryantbiggs yes.
however, the node even though it joins the cluster is unable to run aws_node because of the error I shared on this comment #1867 (comment)
Also, if I remove the vpc_cni plugin, node does not join the cluster at all

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet