-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the complexity of this module getting too high? #635
Comments
Yep. I've felt this way for quite a while now. And it's already too late. Too many new "features" have been added without letting things settle down and stabilize. And, it's not fair to you as the maintainer, or to me as the user trying to run this thing in production. I'm cringing right now thinking about upgrading my prod cluster to v7.x. There's just no way I can reasonably safely do it. IMO: deprecate this module and make several smaller, focused ones. Think long and hard about each new addition. Solicit feedback from multiple sources before leaping into a new feature. Even if somebody shows up with a PR with it all done if it ups the complexity of the module that should be considered carefully. Another idea I have is to stop trying to support windows. There are unfortunately instances where a local_exec needs to run, and lots of complexity has been added trying to support windows alongside *nix. Instead, we can say "this is supported by linux (and probably mac). If you are on Windows, here is instructions on how to run this thing in a docker container. |
Examples of smaller, focused modules:
|
At a bare minimum, documentation should be added for each new feature. On top of just adding the new params in the table. But I don't think that's going far enough to solve the predicament this module is already in. |
I agree with ditching the support for windows as |
First, thank you @max-rocket-internet for maintaining this module for everyone. As far as complexity I would agree this module is certainly complex but I would not expect much less when trying to maintain both halves of a managed Kubernetes Cluster. It is also very reasonable to assume it will only get more complex based on the AWS roadmap. As for paths forward I agree with @RothAndrew and @alaa in regards to documentation and dropping windows support for now. As for the module itself, I would hate to see this module depreciated after seeing it come this far. It would be better in my opinion to determine the direction it should go and instead work towards that as a community. It certainly does make sense to shard the module out in a way but unfortunately no matter how you look at it they will always be co-dependent on one another. I struggle to visualize a world where I would use an eks-fargate module without using eks-cluster along side it. That being said separating cluster configuration from worker configuration by means of a sub module may be a more manageable way to handle feature creep. To add to this, if there is any way I as an individual can lend a hand I would certainly love to be more involved. |
I'm going to voice the other side of the no-Windows argument:
Maybe splitting up in to sub modules would isolate some of the complexity rather than running as a single flat layer? Similar to how the RDS module is designed. Upgrade paths will be painful to get to that state, of course. And the Submodules could either be directly used by the top level as required (RDS style) or as version linked partner modules that users include as necessary, similar to how security-group works. Would need much documenting and examples in the second use case though due to I think trying to coordinate stand-alone intertwined modules will be a maintenance and user nightmare. But it's not something I've tried. Burning the module and starting again will present its own problems. There will still need to be some upgrade path for users who used this module. Could something be done to increase release velocity? But without swamping Max with administrative tasks?
|
To be clear, I'm not thinking we should do either of these, I was just thinking perhaps we could restructure or split parts of it 🙂 |
Another suggestion: start rejecting PRs that want to introduce breaking changes that aren't providing great documentation on how to upgrade and preferably automated migration. |
Splitting it up into submodules like RDS does seems like a really good idea. However, unless Terraform supports easy migration of resources that I don't know of, it seems like this would require a significant effort to migrate existing projects to the new version of the module. 😞 Even if it gets split, some PRs, such as #555 and #580, would require similar changes as they did now, since they affect multiple modules. At the moment, I'm really interested in IRSA and Fargate support, since we currently resort to specifying those resources outside of the module. Maybe we can start with adding those two as a separate submodules? They don't depend on much apart from If I can help with getting these features merged (e.g. help with splitting, writing examples), please let me know. |
While refactoring projects because the eks module changes (or goes away) will be a pain for those who depend on it, the module should not be held prisoner by it. A possible path: deprecate the current module, and fix only security and critical functionality problems. Create new module(s) which break up the parts (eks, node groups, etc). This will avoid breaking existing users, but will make a cleaner path forward. Existing users can pick when they spend the time to switch to the new hotness. |
Deprecate this module will be a nightmare for existing users, I can't imagine i need to delete current clusters then recreate them using new modules or import all resources to new modules.I'll feel less pain if we could continue to maintain this module and deprecate and move features to smaller modules gradually. |
OK here's my thoughts...
This was my initial thought also.
I like this idea but there's no one to do it apart from maintainers. People want their PRs merged but they rarely want to review other PRs (apart from @dpiddockcmp)
Cool. I think so too. But this only solves
Yes. I was thinking perhaps modules for:
I think so too. I don't have enough time for this module let alone a bunch of others.
We need more maintainers. It's as simple as that. I have less time these days.
That's a great idea.
Sure but AFAIK IRSA is only a single resource? https://github.com/terraform-aws-modules/terraform-aws-eks/pull/632/files
@antonbabenko please chime in. I think we should consider creating some new modules to split out some of the functionality of this module. Personally I think separate github repos but not really up to me. |
Here are my 5 cents, though I have not used or followed the development of this module closely for some time - I feel that everyone who commented on this page is already on the same page (lol). The complexity of this repository is large but still controllable because of the good patterns used here. Good job, everyone! I don't think that splitting this module into submodules or separate repos for Fargate, Managed Nodes Group, Normal Worker Groups, etc will give any benefits to users and maintainers who will have to manage similar EKS-related-things in multiple places. I also don't think we need to add more maintainers, because most of the time is usually spent on answering issues and triaging bugs, and this can be done by anyone. Adding more people will not increase velocity, but releasing more often (after every PR is merged) will. If there is not an obvious feature, consider adding examples show-casing it. Same for bugs if they keep reoccurring. |
To me this module is quite complex, but it's still maintainable and readable. I'm not against splitting this module into submodules, but I'm against any kind of code duplication and the really annoying migration path. Furthermore, I'm not really confident this will help. Thinking out loud, I would say we have a lack of automated tests which can tell us easily if a new feature introduce bugs or regressions. If a change breaks something. Features will came quickly as kubernetes and EKS are evolving really fast, and people will always want to get those feature as quick as possible. There is no way to do that without automation. I know it's a long way to go, and I don't really know how we can achieve that goal with free resources, but I think it's worth digging a little deeper to get confident on changes we make on this module. |
I don't spend much time on issues these days. I will also have less time in the future.
Very true. Hard to achieve without spending money though. We now have 4 ways of creating workers:
I think 1 and 2 can be merged now with TF 0.12. But 3 and 4 are very different, perhaps they could be put into |
I think in a healthy ecosystem there can be multiple good ways of doing it. I think much of the success of the Fwiw, we took the alternative approach presented in this ticket:
I think this approach lends itself to those who want more control over the architecture of the kubernetes cluster itself. We've also wired it up with For our use-case, the approach has worked well because it allows for any combination of node pools in pretty much any possible configuration, without increasing complexity and the risk of regressions. I think soon we'll be adding support for SpotInst ocean node pools as well, as we've had good success using that with |
A lot has already been mentioned.
I think a lot of the complexity boils down to the increasing number of possible ways to create worker nodes, that all share the same underlying code. Also from a user-perspective it becomes increasingly hard to determine what can be defined in Separating out the 4 current ways to create workers into separate submodules would I think help maintainability without needing to alienate current user-base:
|
Splitting into some sub modules seems like a popular idea in this issue so how about this plan? To do now
This will move any complexity from MNGs and Fargate out of the core of this module and doesn't break any backwards compatibility. To do soon
Then perhaps later down the line we could move LC/LT worker groups into a separate module if it looks like a good idea. |
@max-rocket-internet Sounds like a plan. At least something to start exploring. Start and see what challenges we run into. Having actual code to evaluate. E.g. managing |
I've noticed a significant build up of non-breaking changes and new features under |
I agree an think it has become to complex. Breaking the module apart and allowing for smaller modules focusing on different parts is ideal. I have been for example reviewing how to add an IAM policy to workers. thanks for the hardwork. |
I'd strongly urge that we not break node groups into a child module. Submodules are great for keeping your code DRY when you need to do the same thing multiple times, however when only called once they have the opposite effect and actually increase code duplication. Turning node groups into a submodule is a great example of this: You still have all the code that would otherwise be in The complexity of this module is generally proportional to the complexity of AWS' Kubernetes implementation. The reason we've accommodated four ways of creating worker nodes is because AWS offers four ways to do so, with significant inconsistencies between them. IMO a huge part of this module's value is that it can reduce these inconsistencies by abstracting away complexity such as how eks node groups use the I'd like to propose sane defaults as alternative solution for reducing complexity. It's important to think through if a feature addresses an actual use case, for example:
|
Sounds like there are 2 things here that can be done:
What is a breaking change?Google has a good guide for their API that we could use. What's our version? Backwards-compatible (non-breaking) changes
Backwards-incompatible (breaking) changes
Stop or Limit the frequency of breaking changesThis module is the Terraform equivalent of an API, but it releases breaking changes as if it were still in beta. Compare it to the VPC module which stayed on 1.X for almost 2 years, and only went to 2.X when it went to Terraform v0.12 support. Suggestion: Stop accepting breaking changes from public PRs. Instead, set up a regular major version release cadence of X months, and solicit feedback from the community on what breaking changes might be necessary. Then decide what makes it in the next major release, implement the changes (yourself or by soliciting PRs to a feature branch instead of master), create a solid migration guide, create an automated migrator tool if possible Release non-breaking changes more frequentlyOne of the principles of DevOps is improving flow by reducing batch size. Make releases as small as possible by releasing non-breaking changes (minor or patch releases) as frequently as possible. If a PR adds a new feature and it is complete, well documented, doesn't have breaking changes, and is generally "ready to go", there's really no reason to not immediately do a release. Doing very small, very frequent releases provides a number of advantages:
|
Currently, yes, we need all 4 😞
You have to be able to dial in to the cluster in order to manage the aws-auth file. What if you want a fully private cluster? You need to somehow get your Terraform to run on a network that has access to that private endpoint. In released versions it also meant a call out to Plus long running issues with AWS API returning "READY" before the endpoint was actually ready to receive requests. Maybe someone running a lot of startup and shutdown automated tests gave up and split their creation process in two?
Anything to do with IAM is always a tricky issue. Some corporate environments run with very restrictive policies and CI or devs do not have the ability to create IAM roles or policies. Creation of IAM resources is on a toggle because users have requested it. Creation of the OIDC provider needs to be optional for the same reason. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I haven't read all the comments yet but I agree with the core sentiment @max-rocket-internet . It's a part of why I've largely walked away from the project - there needs to be hard bounds around keeping modules simple, to the point, and guided by a design north star. In reality, that means saying no to a lot of feature requests and issues but that leaves people upset. It's a lose-lose proposition from a maintainer's point of view and eventually the consumers of a bloated module suffer just the same. There's no silver bullet here - this module is wildly successful but also a monstrosity. Largely to the credit of folks here, the issue count and open PR count remain within reason. 👏 Maybe this is the the best possible outcome given the tools we have to work with. I should note, the effort to split into several submodules is laudable and potentially helpful if done well, but I've also seen it get out of control over the past couple years at GCP. I want everyone here to keep in mind Hashicorp's core advice here:
This is critical and I think it's somewhat unfortunate that it leaves so much room for interpretation. My interpretation is that submodules fall in this same camp. That is: don't wrap a single resource in a submodule. Every module/submodule of this sort equates to:
Feel free to disagree, but I see such modules as a net-loss on all counts. Similarly, modules that really just act as a giant switch between one of many singular resources, amount to the same. The caller maintains the same amount of code they would have spent on the raw resource + they take on a dependency - with all sorts of versioning stumbling blocks - for no practical benefit. Echoing a thought by @barryib - I still think testing is largely under utilized in the IaC space and a way to give contributors better confidence around changes in an unwieldy, complex, publically shared IaC codebase. Testing doesn't seem to have great uptake in these circles but I've seen it help elsewhere in this same IaC space. It deserves more thought here. Again, appreciating the efforts here. ❤️ Try to avoid falling in these pitfalls as the project moves forward. Major releasing our way to something better just has to happen occasionally. I largely take responsibility for the flat nature of the module (if you can't tell 😆 ) and stand by that design 💥 . I think the other side of the coin is having a sense of when to split off dedicated modules of unique flavors, duplicating some parts of the codebase, and having a strategy for managing the maintenance that brings. This project may have passed that junction a few times but it's still a viable way to consider restructuring now (as submodules or distinct repos). |
A few thoughts as a (very) minor contributor with an interest in the future of this module:
|
yeah, nah.... MNG's are terrible and limited. |
Thanks guys for all your work. As a consumer of this module I can say that I'll work through whatever path you guys decide. My clusters are completely disposable (thanks largely to your efforts) and I'll just spin up new clusters, migrate workloads and terminate my old ones regardless of how you want to proceed. As a side note, I upgraded to Terraform 0.13, pulled master yesterday and spun up a new sandbox cluster no changes to anything but local code. |
Agreed. I just ran into the very awkward issue that you can't attach existing security groups onto worker nodes managed by the The only way around this is to use the new |
@max-rocket-internet @antonbabenko @barryib @dpiddockcmp I just opened #1031, and now I'm wondering what the decision was on this issue? Perhaps we should have a virtual meet to discuss this? I don't mind splitting this into more than one module, and even happy to do the initial work since I have to do it for my work anyway. Are we going the route of splitting the modules like so?
|
@sc250024 yep. We've already got node-groups and fargate submodules. There is an ongoing PR #858 for worker groups which will introduce a huge breaking change. Also, there is a terraform-aws-module office hour on November 6th, 15-16:00 CET (9-10am, EST). We can probably discuss there if there is time left. Here is more info about the upcoming Office Hours: Link to zoom (if you want to discuss something from the agenda, see below) - https://us02web.zoom.us/j/82195026969?pwd=WWNJaW4wQjZuTTlqWWUwVWZzUmJMUT09 Link to YouTube live stream (if you want to just listen and write in chat) - https://youtu.be/RXwYBI-IWw4 This call will be public for everyone to attend via zoom (all of us) and to watch live on YouTube. Agenda:
|
I feel like big modules like this one here would benefit from being implemented with something like Pulumi (once it fully matures) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity since being marked as stale. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
A general question for users and contributors of this module 🙂
My feeling is that complexity getting too high and quality is suffering somewhat. We are squeezing a lot of features in a single module. This is by far the most complex module in the Terraform AWS modules org. But perhaps I just feel the pressure since I'm a maintainer?
Some recent examples:
create_eks
variable: Add destroy time flag #580And now:
The amount of new resources being added to this module is just getting higher. And I don't see this slowing down as AWS is doubling down on EKS, as is everyone on k8s in general.
Don't get me wrong, these are all awesome contributions 💙 but I feel there's a price here. On every new release there's always bunch of new issues related to the recent changes.
Let me know your thoughts 😃
The text was updated successfully, but these errors were encountered: