Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eks_managed_node_groups with irsa enabled fails #1894

Closed
OHaimanov opened this issue Feb 22, 2022 · 6 comments · Fixed by #1915
Closed

eks_managed_node_groups with irsa enabled fails #1894

OHaimanov opened this issue Feb 22, 2022 · 6 comments · Fixed by #1915

Comments

@OHaimanov
Copy link

Description

Hi Team, try to create eks cluster based on example with irsa enabling.
But faced with issue that aws-node didn't starts.

Maybe it is bug, or maybe i did something wrong or missed, could you please look on this?

Versions

  • Terraform:
    1.15
  • Provider(s):
    "hashicorp/aws" >= 4.1.0
    "gavinbunney/kubectl" >= 1.13.1
  • Module:
    terraform-aws-modules/eks/aws
    version = "18.7.2"

Reproduction

Steps to reproduce the behavior:

Code Snippet to Reproduce

locals {
  oidc_url = replace(module.eks_cluster.cluster_oidc_issuer_url, "https://", "")
}

module "eks_cluster" {
  source  = "terraform-aws-modules/eks/aws"
  version = "18.7.2"

  cluster_name    = var.rancher_cluster_name
  cluster_version = "1.21"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
    }
    kube-proxy = {}
    vpc-cni = {
      resolve_conflicts        = "OVERWRITE"
      service_account_role_arn = module.vpc_cni_irsa.iam_role_arn
    }
  }

  # Extend cluster security group rules
  cluster_security_group_additional_rules = {
    egress_nodes_ephemeral_ports_tcp = {
      description                = "To node 1025-65535"
      protocol                   = "tcp"
      from_port                  = 1025
      to_port                    = 65535
      type                       = "egress"
      source_node_security_group = true
    }
  }

  # Extend node-to-node security group rules
  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
    }
  }

  eks_managed_node_group_defaults = {
    ami_type       = "AL2_x86_64"
    disk_size      = 50
    instance_types = var.asg_instance_types

    # We are using the IRSA created below for permissions
    iam_role_attach_cni_policy = false
  }
  eks_managed_node_groups = {
    default_node_group = {
      create_launch_template = false
      launch_template_name   = ""
    }
  }
}
module "vpc_cni_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "4.13.0"

  role_name             = join("-", [var.iac_environment_tag, var.rancher_cluster_name, "vpc-cni"])
  attach_vpc_cni_policy = true
  vpc_cni_enable_ipv4   = true

  oidc_providers = {
    main = {
      provider_arn               = module.eks_cluster.oidc_provider_arn
      namespace_service_accounts = ["kube-system:aws-node"]
    }
  }

  tags = {
    Name = "vpc-cni"
  }
}

Expected behavior

All cluster nodes Up and running

Actual behavior

Cluster nodes in Not ready state

Warning | Unhealthy | 24 minutes | kubelet | Readiness probe failed: {"level":"info","ts":"2022-02-22T14:43:46.341Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
-- | -- | -- | -- | --
Warning | Unhealthy | 23 minutes | kubelet | Readiness probe failed: {"level":"info","ts":"2022-02-22T14:43:56.337Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
Warning | Unhealthy | 23 minutes | kubelet | Readiness probe failed: {"level":"info","ts":"2022-02-22T14:44:06.341Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
Warning | Unhealthy | 23 minutes | kubelet | Liveness probe failed: {"level":"info","ts":"2022-02-22T14:44:14.728Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
Warning | Unhealthy | 23 minutes | kubelet | Readiness probe failed: {"level":"info","ts":"2022-02-22T14:44:16.336Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
Warning | Unhealthy | 23 minutes | kubelet | (combined from similar events): Readiness probe failed: {"level":"info","ts":"2022-02-22T14:58:06.333Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
Normal | Killing | 23 minutes | kubelet | Container aws-node failed liveness probe, will be restarted
Normal | Pulled | 23 minutes | kubelet | Container image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.10.1-eksbuild.1" already present on machine
Warning | BackOff | 13 minutes | kubelet | Back-off restarting failed container
@axkng
Copy link

axkng commented Feb 23, 2022

I can confirm this.
Having the same problem at the moment.
If you set iam_role_attach_cni_policy = true for the managed nodes it works.
Only did that for testing, as I want to stick to the better practices.

@OHaimanov
Copy link
Author

OHaimanov commented Feb 23, 2022

I can confirm this. Having the same problem at the moment. If you set iam_role_attach_cni_policy = true for the managed nodes it works. Only did that for testing, as I want to stick to the better practices.

Yes looks like addon arn attachment didn't applies during module run. If initially create cluster with iam_role_attach_cni_policy = true and then update addon to use separate iam role and remove policy from cluster all works fine, but not as in example

@MadsRC
Copy link

MadsRC commented Mar 2, 2022

I just spend a day debugging why my nodes wouldn't attach to new clusters. Turns out I was running into this exact issue.

Setting iam_role_attach_cni_policy = true for the initial creation did the trick, and iam_role_attach_cni_policy = false was then applied afterwards...

Not pretty, but it works

@bryantbiggs
Copy link
Member

yes, this was unfortunate to discover as well - I have updated the eks-managed-node-group example and added in some notes as well in another PR #1915

I've also added this scenario into the container roadmap proposal I submitted as well aws/containers-roadmap#1666

@antonbabenko
Copy link
Member

This issue has been resolved in version 18.8.0 🎉

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
5 participants