-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for resource-specific resync periods and default drift remediation period #106
Add support for resource-specific resync periods and default drift remediation period #106
Conversation
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some naming suggestions inline... otherwise I like it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great work! A few comments, mostly on the contents of the inline documentation.
633110f
to
ac6bc0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comments regarding the duration = 0
edge case. Otherwise nits
if err != nil { | ||
return "", 0, fmt.Errorf("invalid value in flag argument: %v", err) | ||
} | ||
if resyncSeconds < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think resyncSeconds <= 0
should be the check. If there are 0 seconds of reconciliation, it'll be constantly reconciled with no exponential backoff. Maybe we can suggest at least 1 second of wait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked here, if a used provides 0 seconds, the controller will use the package default value which is 10hours
ac6bc0f
to
4663973
Compare
…mediation period This commit introduces the ability to specify resource-specific resync periods in the drift remediation configuration, as well as a default drift remediation period in the controller configuration. The resync period for each reconciler is determined by trying to retrieve it from the following sources, in this order: 1. A resource-specific period specified in the drift remediation configuration. 2. A resource-specific requeue on success period specified by the resource manager factory. 3. The default drift remediation period specified in the controller configuration. 4. The default resync period defined in the ACK runtime package. This allows users to customize the drift remediation behavior for different resources as needed, while still providing a fallback option for resources that do not have a specific period specified. Signed-off-by: Amine Hilaly <[email protected]>
4663973
to
4cac789
Compare
/unhold |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: A-Hilaly, jljaco The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a great feature
// getResyncPeriod returns the period of the recurring reconciler process which ensures the desired | ||
// state of custom resources is maintained. | ||
// It attempts to retrieve the duration from the following sources, in this order: | ||
// 1. A resource-specific reconciliation resync period specified in the reconciliation resync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why resync configuration map?
can we use annotation on the resource like other existing features e.g.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a super fan of resource-level drift remediation control - but happy to hear what other folks think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why resync configuration map?
can we use annotation on the resource like other existing features e.g.
* https://aws-controllers-k8s.github.io/community/docs/user-docs/multi-region-resource-management/ * https://aws-controllers-k8s.github.io/community/docs/user-docs/deletion-policy/
@surajkota for both of those annotations, there is a corresponding controller CLI flag:
Lines 111 to 115 in b98d322
flag.StringVar( | |
&cfg.Region, flagAWSRegion, | |
envutil.WithDefault(envVarAWSRegion, ""), | |
"The AWS Region in which the service controller will create its resources", | |
) |
Lines 151 to 154 in b98d322
flag.Var( | |
&cfg.DeletionPolicy, flagDeletionPolicy, | |
"The default deletion policy for all resources managed by the controller", | |
) |
The CLI flags serve as defaults if the annotation is not present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack on the controller flag, my question is related to highest in order to precedence.
Annotation or configmap, both are giving control to the user to configure the resync period. Since we don't have a common configmap or a CRD to define all controller configurations, my preference is to keep the experience consistent and not introduce another place~~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi folks, Nick helped me with clarification that resync configuration map != ConfigMap
, so we can resolve this comment with one suggestion, rename the comment to say resource resync CLI flag
or something along those lines
I do think even CLI flag is not a great option because changing it requires restarting the controller or reinstalling helm chart but thats a discussion for another time. This is good to get started
see my comment on No 3 here. We should drop it
pkg/config/config.go
Outdated
@@ -152,6 +159,19 @@ func (cfg *Config) BindFlags() { | |||
&cfg.DeletionPolicy, flagDeletionPolicy, | |||
"The default deletion policy for all resources managed by the controller", | |||
) | |||
flag.IntVar( | |||
&cfg.ReconcileDefaultResyncSeconds, flagReconcileDefaultResyncSeconds, | |||
60, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woah woah wait we don't want every resource to try and reconcile itself every 60 seconds. That's way too often. I was thinking on the order of every 6 or even 10 hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree for following reasons:
- We risk causing throttling error as number of resources managed increase
- There can be side affect of No 1 on inter service communication, e.g. services which use Autoscaling, SageMaker hosting service makes call to autoscaling service and if throttling happens, the SageMaker service will get throttled and impact scale in/out of fleet serving customer traffic
- Not all resources need this, e.g. SageMaker training jobs are one time jobs, SageMaker models do not support updates so the only way to introduce drift is by deleting and recreating the entire resource outside of ACK but still not resolvable by reconciling.
My vote is to drop this flag completely and let the service teams use requeue_on_success
to configure or let the customer override based on their preference. I strongly suggest starting conservative here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still support this default resync period and also have a setting that ignores this for one-time style resources like TrainingJob
. Call that a future feature request. Managing resync periods for every resource individually would require users to update their configuration every time we add a new resource - this is essentially just a shortcut for all of them.
I actually believe I don't want service teams to set requeue_on_success
at all. It's ultimately more important that a customer can set their expectations for requeuing resources, since they know their drift conditions better than any ACK contributor will. Sure some resources should be shorter than 10 hours, but whether it's every 5 hours or every 10 minutes, it'll depend on everyone's specific deployment context.
pkg/config/config.go
Outdated
@@ -152,6 +159,19 @@ func (cfg *Config) BindFlags() { | |||
&cfg.DeletionPolicy, flagDeletionPolicy, | |||
"The default deletion policy for all resources managed by the controller", | |||
) | |||
flag.IntVar( | |||
&cfg.ReconcileDefaultResyncSeconds, flagReconcileDefaultResyncSeconds, | |||
60, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree for following reasons:
- We risk causing throttling error as number of resources managed increase
- There can be side affect of No 1 on inter service communication, e.g. services which use Autoscaling, SageMaker hosting service makes call to autoscaling service and if throttling happens, the SageMaker service will get throttled and impact scale in/out of fleet serving customer traffic
- Not all resources need this, e.g. SageMaker training jobs are one time jobs, SageMaker models do not support updates so the only way to introduce drift is by deleting and recreating the entire resource outside of ACK but still not resolvable by reconciling.
My vote is to drop this flag completely and let the service teams use requeue_on_success
to configure or let the customer override based on their preference. I strongly suggest starting conservative here
// getResyncPeriod returns the period of the recurring reconciler process which ensures the desired | ||
// state of custom resources is maintained. | ||
// It attempts to retrieve the duration from the following sources, in this order: | ||
// 1. A resource-specific reconciliation resync period specified in the reconciliation resync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi folks, Nick helped me with clarification that resync configuration map != ConfigMap
, so we can resolve this comment with one suggestion, rename the comment to say resource resync CLI flag
or something along those lines
I do think even CLI flag is not a great option because changing it requires restarting the controller or reinstalling helm chart but thats a discussion for another time. This is good to get started
see my comment on No 3 here. We should drop it
Signed-off-by: Amine Hilaly <[email protected]>
…on comment Signed-off-by: Amine Hilaly <[email protected]>
…ft remediation period Issue: aws-controllers-k8s/community#1367 Follow-up of aws-controllers-k8s/runtime#106 This patch completes the implementation of support for resource-specific resync periods and a default drift remediation period made in runtime repository. The resync period for each reconciler is determined by trying to retrieve it from the following sources, in this order: 1. A resource-specific period specified in the drift remediation configuration. 2. A resource-specific requeue on success period specified by the resource manager factory. 3. The default drift remediation period specified in the controller configuration. 4. The default resync period defined in the ACK runtime package. This allows users to customize the drift remediation behavior for different resources as needed, while still providing a fallback option for resources that do not have a specific period specified. Signed-off-by: Amine Hilaly <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last change then I'm happy to ship this
pkg/config/config.go
Outdated
&cfg.ReconcileDefaultResyncSeconds, flagReconcileDefaultResyncSeconds, | ||
0, | ||
"The default duration, in seconds, to wait before resyncing desired state of custom resources. "+ | ||
"This value is used if no resource-specific override has been specified. Default is 60 seconds.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this description for the new default
pkg/config/config.go
Outdated
@@ -163,7 +163,7 @@ func (cfg *Config) BindFlags() { | |||
&cfg.ReconcileDefaultResyncSeconds, flagReconcileDefaultResyncSeconds, | |||
0, | |||
"The default duration, in seconds, to wait before resyncing desired state of custom resources. "+ | |||
"This value is used if no resource-specific override has been specified. Default is 60 seconds.", | |||
"This value is used if no resource-specific override has been specified. Default is 0 seconds.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... the default is the 10 hour fallback. 0 seconds makes it sound like it's constantly reconciling without any pause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm with this comment addressed
Thanks for the changes Amine
Signed-off-by: Amine Hilaly <[email protected]>
21ae777
to
27b87db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this shipped, please. :)
Thanks for the last minute changes @a-hilaly /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: A-Hilaly, jaypipes, jljaco, RedbackThomson, surajkota The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ft remediation period (#386) Issue: aws-controllers-k8s/community#1367 Follow-up of aws-controllers-k8s/runtime#106 This patch completes the implementation of support for resource-specific resync periods and a default drift remediation period made in runtime repository. The resync period for each reconciler is determined by trying to retrieve it from the following sources, in this order: 1. A resource-specific period specified in the drift remediation configuration. 2. A resource-specific requeue on success period specified by the resource manager factory. 3. The default drift remediation period specified in the controller configuration. 4. The default resync period defined in the ACK runtime package. This allows users to customize the drift remediation behavior for different resources as needed, while still providing a fallback option for resources that do not have a specific period specified. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
As discussed in aws-controllers-k8s#106 this patch brings resource name validation to the arguments passed to `--reconcile-resource-resync-seconds`. It also slightly changes the previously implemented `ParseReconcileResourceResyncSeconds` to avoid uncessary validation ops. Signed-off-by: Amine Hilaly <[email protected]>
As discussed in aws-controllers-k8s#106 this patch brings resource name validation to the arguments passed to `--reconcile-resource-resync-seconds`. It also slightly changes the previously implemented `ParseReconcileResourceResyncSeconds` to avoid uncessary validation ops. Signed-off-by: Amine Hilaly <[email protected]>
As discussed in aws-controllers-k8s#106 this patch brings resource name validation to the arguments passed to `--reconcile-resource-resync-seconds`. It also slightly changes the previously implemented `ParseReconcileResourceResyncSeconds` to avoid uncessary validation ops. Signed-off-by: Amine Hilaly <[email protected]>
As discussed in aws-controllers-k8s#106 this patch brings resource name validation to the arguments passed to `--reconcile-resource-resync-seconds`. It also slightly changes the previously implemented `ParseReconcileResourceResyncSeconds` to avoid uncessary validation ops. Signed-off-by: Amine Hilaly <[email protected]>
As discussed in aws-controllers-k8s#106 this patch brings resource name validation to the arguments passed to `--reconcile-resource-resync-seconds`. It also slightly changes the previously implemented `ParseReconcileResourceResyncSeconds` to avoid uncessary validation ops. Signed-off-by: Amine Hilaly <[email protected]>
As discussed in aws-controllers-k8s#106 this patch brings resource name validation to the arguments passed to `--reconcile-resource-resync-seconds`. It also slightly changes the previously implemented `ParseReconcileResourceResyncSeconds` to avoid uncessary validation ops. Signed-off-by: Amine Hilaly <[email protected]>
As discussed in aws-controllers-k8s#106 this patch brings resource name validation to the arguments passed to `--reconcile-resource-resync-seconds`. It also slightly changes the previously implemented `ParseReconcileResourceResyncSeconds` to avoid uncessary validation ops. Signed-off-by: Amine Hilaly <[email protected]>
As discussed in aws-controllers-k8s#106 this patch brings resource name validation to the arguments passed to `--reconcile-resource-resync-seconds`. It also slightly changes the previously implemented `ParseReconcileResourceResyncSeconds` to avoid uncessary validation ops. Signed-off-by: Amine Hilaly <[email protected]>
As discussed in aws-controllers-k8s#106 this patch brings resource name validation to the arguments passed to `--reconcile-resource-resync-seconds`. It also slightly changes the previously implemented `ParseReconcileResourceResyncSeconds` to avoid uncessary validation ops. Signed-off-by: Amine Hilaly <[email protected]>
[fixes aws-controllers-k8s/community#1647] As discussed in #106 this patch brings resource name validation to the arguments passed to `--reconcile-resource-resync-seconds`. It also slightly changes the previously implemented `ParseReconcileResourceResyncSeconds` to avoid uncessary validation ops. Signed-off-by: Amine Hilaly <[email protected]> By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Part of: aws-controllers-k8s/community#1367
This patch introduces the ability to specify resource-specific resync periods in
the drift remediation configuration, as well as a default drift remediation period
in the controller configuration. The resync period for each reconciler is
determined by trying to retrieve it from the following sources, in this order:
factory.
This allows users to customize the drift remediation behavior for different
resources as needed, while still providing a fallback option for resources that do
not have a specific period specified.
Signed-off-by: Amine Hilaly [email protected]
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.