Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for resource-specific resync periods and default drift remediation period #106

Merged
merged 4 commits into from
Feb 6, 2023

Conversation

a-hilaly
Copy link
Member

Part of: aws-controllers-k8s/community#1367

This patch introduces the ability to specify resource-specific resync periods in
the drift remediation configuration, as well as a default drift remediation period
in the controller configuration. The resync period for each reconciler is
determined by trying to retrieve it from the following sources, in this order:

  1. A resource-specific period specified in the drift remediation configuration.
  2. A resource-specific requeue on success period specified by the resource manager
    factory.
  3. The default drift remediation period specified in the controller configuration.
  4. The default resync period defined in the ACK runtime package.

This allows users to customize the drift remediation behavior for different
resources as needed, while still providing a fallback option for resources that do
not have a specific period specified.

Signed-off-by: Amine Hilaly [email protected]

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@a-hilaly
Copy link
Member Author

/hold

@ack-bot ack-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 22, 2022
Copy link
Collaborator

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some naming suggestions inline... otherwise I like it!

Copy link

@jljaco jljaco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work! A few comments, mostly on the contents of the inline documentation.

Copy link
Contributor

@RedbackThomson RedbackThomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments regarding the duration = 0 edge case. Otherwise nits

if err != nil {
return "", 0, fmt.Errorf("invalid value in flag argument: %v", err)
}
if resyncSeconds < 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think resyncSeconds <= 0 should be the check. If there are 0 seconds of reconciliation, it'll be constantly reconciled with no exponential backoff. Maybe we can suggest at least 1 second of wait?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked here, if a used provides 0 seconds, the controller will use the package default value which is 10hours

@a-hilaly a-hilaly force-pushed the drift-remediation-flag branch from ac6bc0f to 4663973 Compare January 25, 2023 18:20
…mediation period

This commit introduces the ability to specify resource-specific resync periods in
the drift remediation configuration, as well as a default drift remediation period
in the controller configuration. The resync period for each reconciler is
determined by trying to retrieve it from the following sources, in this order:

1. A resource-specific period specified in the drift remediation configuration.
2. A resource-specific requeue on success period specified by the resource manager
   factory.
3. The default drift remediation period specified in the controller configuration.
4. The default resync period defined in the ACK runtime package.

This allows users to customize the drift remediation behavior for different
resources as needed, while still providing a fallback option for resources that do
not have a specific period specified.

Signed-off-by: Amine Hilaly <[email protected]>
@a-hilaly
Copy link
Member Author

/unhold

@ack-bot
Copy link
Collaborator

ack-bot commented Jan 27, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: A-Hilaly, jljaco

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@surajkota surajkota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great feature

// getResyncPeriod returns the period of the recurring reconciler process which ensures the desired
// state of custom resources is maintained.
// It attempts to retrieve the duration from the following sources, in this order:
// 1. A resource-specific reconciliation resync period specified in the reconciliation resync
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a super fan of resource-level drift remediation control - but happy to hear what other folks think about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why resync configuration map?

can we use annotation on the resource like other existing features e.g.

* https://aws-controllers-k8s.github.io/community/docs/user-docs/multi-region-resource-management/

* https://aws-controllers-k8s.github.io/community/docs/user-docs/deletion-policy/

@surajkota for both of those annotations, there is a corresponding controller CLI flag:

flag.StringVar(
&cfg.Region, flagAWSRegion,
envutil.WithDefault(envVarAWSRegion, ""),
"The AWS Region in which the service controller will create its resources",
)

flag.Var(
&cfg.DeletionPolicy, flagDeletionPolicy,
"The default deletion policy for all resources managed by the controller",
)

The CLI flags serve as defaults if the annotation is not present.

Copy link
Member

@surajkota surajkota Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack on the controller flag, my question is related to highest in order to precedence.

Annotation or configmap, both are giving control to the user to configure the resync period. Since we don't have a common configmap or a CRD to define all controller configurations, my preference is to keep the experience consistent and not introduce another place~~

Copy link
Member

@surajkota surajkota Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi folks, Nick helped me with clarification that resync configuration map != ConfigMap, so we can resolve this comment with one suggestion, rename the comment to say resource resync CLI flag or something along those lines

I do think even CLI flag is not a great option because changing it requires restarting the controller or reinstalling helm chart but thats a discussion for another time. This is good to get started

see my comment on No 3 here. We should drop it

@@ -152,6 +159,19 @@ func (cfg *Config) BindFlags() {
&cfg.DeletionPolicy, flagDeletionPolicy,
"The default deletion policy for all resources managed by the controller",
)
flag.IntVar(
&cfg.ReconcileDefaultResyncSeconds, flagReconcileDefaultResyncSeconds,
60,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah woah wait we don't want every resource to try and reconcile itself every 60 seconds. That's way too often. I was thinking on the order of every 6 or even 10 hours.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree for following reasons:

  1. We risk causing throttling error as number of resources managed increase
  2. There can be side affect of No 1 on inter service communication, e.g. services which use Autoscaling, SageMaker hosting service makes call to autoscaling service and if throttling happens, the SageMaker service will get throttled and impact scale in/out of fleet serving customer traffic
  3. Not all resources need this, e.g. SageMaker training jobs are one time jobs, SageMaker models do not support updates so the only way to introduce drift is by deleting and recreating the entire resource outside of ACK but still not resolvable by reconciling.

My vote is to drop this flag completely and let the service teams use requeue_on_success to configure or let the customer override based on their preference. I strongly suggest starting conservative here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still support this default resync period and also have a setting that ignores this for one-time style resources like TrainingJob. Call that a future feature request. Managing resync periods for every resource individually would require users to update their configuration every time we add a new resource - this is essentially just a shortcut for all of them.

I actually believe I don't want service teams to set requeue_on_success at all. It's ultimately more important that a customer can set their expectations for requeuing resources, since they know their drift conditions better than any ACK contributor will. Sure some resources should be shorter than 10 hours, but whether it's every 5 hours or every 10 minutes, it'll depend on everyone's specific deployment context.

@@ -152,6 +159,19 @@ func (cfg *Config) BindFlags() {
&cfg.DeletionPolicy, flagDeletionPolicy,
"The default deletion policy for all resources managed by the controller",
)
flag.IntVar(
&cfg.ReconcileDefaultResyncSeconds, flagReconcileDefaultResyncSeconds,
60,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree for following reasons:

  1. We risk causing throttling error as number of resources managed increase
  2. There can be side affect of No 1 on inter service communication, e.g. services which use Autoscaling, SageMaker hosting service makes call to autoscaling service and if throttling happens, the SageMaker service will get throttled and impact scale in/out of fleet serving customer traffic
  3. Not all resources need this, e.g. SageMaker training jobs are one time jobs, SageMaker models do not support updates so the only way to introduce drift is by deleting and recreating the entire resource outside of ACK but still not resolvable by reconciling.

My vote is to drop this flag completely and let the service teams use requeue_on_success to configure or let the customer override based on their preference. I strongly suggest starting conservative here

// getResyncPeriod returns the period of the recurring reconciler process which ensures the desired
// state of custom resources is maintained.
// It attempts to retrieve the duration from the following sources, in this order:
// 1. A resource-specific reconciliation resync period specified in the reconciliation resync
Copy link
Member

@surajkota surajkota Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi folks, Nick helped me with clarification that resync configuration map != ConfigMap, so we can resolve this comment with one suggestion, rename the comment to say resource resync CLI flag or something along those lines

I do think even CLI flag is not a great option because changing it requires restarting the controller or reinstalling helm chart but thats a discussion for another time. This is good to get started

see my comment on No 3 here. We should drop it

a-hilaly added a commit to a-hilaly/ack-code-generator that referenced this pull request Feb 3, 2023
…ft remediation period

Issue: aws-controllers-k8s/community#1367
Follow-up of aws-controllers-k8s/runtime#106

This patch completes the implementation of support for resource-specific
resync periods and a default drift remediation period made in runtime
repository. The resync period for each reconciler is determined by trying
to retrieve it from the following sources, in this order:

1. A resource-specific period specified in the drift remediation configuration.
2. A resource-specific requeue on success period specified by the resource manager factory.
3. The default drift remediation period specified in the controller configuration.
4. The default resync period defined in the ACK runtime package.

This allows users to customize the drift remediation behavior for different
resources as needed, while still providing a fallback option for resources
that do not have a specific period specified.

Signed-off-by: Amine Hilaly <[email protected]>
Copy link
Contributor

@RedbackThomson RedbackThomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last change then I'm happy to ship this

&cfg.ReconcileDefaultResyncSeconds, flagReconcileDefaultResyncSeconds,
0,
"The default duration, in seconds, to wait before resyncing desired state of custom resources. "+
"This value is used if no resource-specific override has been specified. Default is 60 seconds.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this description for the new default

@@ -163,7 +163,7 @@ func (cfg *Config) BindFlags() {
&cfg.ReconcileDefaultResyncSeconds, flagReconcileDefaultResyncSeconds,
0,
"The default duration, in seconds, to wait before resyncing desired state of custom resources. "+
"This value is used if no resource-specific override has been specified. Default is 60 seconds.",
"This value is used if no resource-specific override has been specified. Default is 0 seconds.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... the default is the 10 hour fallback. 0 seconds makes it sound like it's constantly reconciling without any pause

Copy link
Member

@surajkota surajkota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm with this comment addressed
Thanks for the changes Amine

#106 (comment)

@a-hilaly a-hilaly force-pushed the drift-remediation-flag branch from 21ae777 to 27b87db Compare February 6, 2023 15:58
Copy link
Collaborator

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this shipped, please. :)

@RedbackThomson
Copy link
Contributor

Thanks for the last minute changes @a-hilaly

/lgtm

@ack-prow ack-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 6, 2023
@ack-prow
Copy link

ack-prow bot commented Feb 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: A-Hilaly, jaypipes, jljaco, RedbackThomson, surajkota

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [A-Hilaly,RedbackThomson,jljaco]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-prow ack-prow bot merged commit 3b2bc34 into aws-controllers-k8s:main Feb 6, 2023
ack-prow bot pushed a commit to aws-controllers-k8s/code-generator that referenced this pull request Feb 7, 2023
…ft remediation period (#386)

Issue: aws-controllers-k8s/community#1367
Follow-up of aws-controllers-k8s/runtime#106

This patch completes the implementation of support for resource-specific
resync periods and a default drift remediation period made in runtime
repository. The resync period for each reconciler is determined by trying
to retrieve it from the following sources, in this order:

1. A resource-specific period specified in the drift remediation configuration.
2. A resource-specific requeue on success period specified by the resource manager factory.
3. The default drift remediation period specified in the controller configuration.
4. The default resync period defined in the ACK runtime package.

This allows users to customize the drift remediation behavior for different
resources as needed, while still providing a fallback option for resources
that do not have a specific period specified.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
a-hilaly added a commit to a-hilaly/ack-runtime that referenced this pull request Mar 3, 2023
As discussed in aws-controllers-k8s#106
this patch brings resource name validation to the arguments passed to
`--reconcile-resource-resync-seconds`. It also slightly changes the
previously implemented `ParseReconcileResourceResyncSeconds` to avoid
uncessary validation ops.

Signed-off-by: Amine Hilaly <[email protected]>
a-hilaly added a commit to a-hilaly/ack-runtime that referenced this pull request Mar 3, 2023
As discussed in aws-controllers-k8s#106
this patch brings resource name validation to the arguments passed to
`--reconcile-resource-resync-seconds`. It also slightly changes the
previously implemented `ParseReconcileResourceResyncSeconds` to avoid
uncessary validation ops.

Signed-off-by: Amine Hilaly <[email protected]>
a-hilaly added a commit to a-hilaly/ack-runtime that referenced this pull request Mar 6, 2023
As discussed in aws-controllers-k8s#106
this patch brings resource name validation to the arguments passed to
`--reconcile-resource-resync-seconds`. It also slightly changes the
previously implemented `ParseReconcileResourceResyncSeconds` to avoid
uncessary validation ops.

Signed-off-by: Amine Hilaly <[email protected]>
a-hilaly added a commit to a-hilaly/ack-runtime that referenced this pull request Mar 6, 2023
As discussed in aws-controllers-k8s#106
this patch brings resource name validation to the arguments passed to
`--reconcile-resource-resync-seconds`. It also slightly changes the
previously implemented `ParseReconcileResourceResyncSeconds` to avoid
uncessary validation ops.

Signed-off-by: Amine Hilaly <[email protected]>
a-hilaly added a commit to a-hilaly/ack-runtime that referenced this pull request Mar 6, 2023
As discussed in aws-controllers-k8s#106
this patch brings resource name validation to the arguments passed to
`--reconcile-resource-resync-seconds`. It also slightly changes the
previously implemented `ParseReconcileResourceResyncSeconds` to avoid
uncessary validation ops.

Signed-off-by: Amine Hilaly <[email protected]>
a-hilaly added a commit to a-hilaly/ack-runtime that referenced this pull request Mar 6, 2023
As discussed in aws-controllers-k8s#106
this patch brings resource name validation to the arguments passed to
`--reconcile-resource-resync-seconds`. It also slightly changes the
previously implemented `ParseReconcileResourceResyncSeconds` to avoid
uncessary validation ops.

Signed-off-by: Amine Hilaly <[email protected]>
a-hilaly added a commit to a-hilaly/ack-runtime that referenced this pull request Mar 6, 2023
As discussed in aws-controllers-k8s#106
this patch brings resource name validation to the arguments passed to
`--reconcile-resource-resync-seconds`. It also slightly changes the
previously implemented `ParseReconcileResourceResyncSeconds` to avoid
uncessary validation ops.

Signed-off-by: Amine Hilaly <[email protected]>
a-hilaly added a commit to a-hilaly/ack-runtime that referenced this pull request Mar 6, 2023
As discussed in aws-controllers-k8s#106
this patch brings resource name validation to the arguments passed to
`--reconcile-resource-resync-seconds`. It also slightly changes the
previously implemented `ParseReconcileResourceResyncSeconds` to avoid
uncessary validation ops.

Signed-off-by: Amine Hilaly <[email protected]>
a-hilaly added a commit to a-hilaly/ack-runtime that referenced this pull request Apr 7, 2023
As discussed in aws-controllers-k8s#106
this patch brings resource name validation to the arguments passed to
`--reconcile-resource-resync-seconds`. It also slightly changes the
previously implemented `ParseReconcileResourceResyncSeconds` to avoid
uncessary validation ops.

Signed-off-by: Amine Hilaly <[email protected]>
ack-prow bot pushed a commit that referenced this pull request Apr 13, 2023
[fixes aws-controllers-k8s/community#1647]

As discussed in #106

this patch brings resource name validation to the arguments passed to
`--reconcile-resource-resync-seconds`. It also slightly changes the
previously implemented `ParseReconcileResourceResyncSeconds` to avoid
uncessary validation ops.

Signed-off-by: Amine Hilaly <[email protected]>

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants