Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle CARM errors and requeues #140

Merged

Conversation

a-hilaly
Copy link
Member

Prior to this patch we introduced new mechanisms to handle Role ARN
retrievals from the CARM cache. Which improved ACK runtime scaling
capabilities and addressed possible race condition scenarios. However in
the process we missed two things:

  • Setting an ACK.ResourceSynced condition stating that the resource
    isn't synced, yet.
  • Returning the correct runtime error that will cause the reconciller to
    requeue every 15seconds.

Both of the problems stemmed from the fact that we're not "yet" in the
reconcile function (rm.Sync) that is wrapped by a proper error handler
(that triggers requeues, and resets/sets resource conditions). In this
special case, we need to manually inject the condition and return a
controller-runtime error that will trigger a requeue after 15seconds.

While this is a "fair" fix, we're planning on refactoring a lot of the
runtime logic to make easier to read, maintain and more importantly
expose reusable component that will help avoid falling into such traps.

Signed-off-by: Amine Hilaly [email protected]

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Prior to this patch we introduced better mechanisms to handle Role ARN
retrievals from the CARM cache. Which improved ACK runtime scaling
capabilities and addressed possible race condition scenarios. However in
the process we missed two things:
- Setting an ACK.ResourceSynced condition stating that the resource
  isn't synced, yet.
- Returning the right runtime error that will cause the reconciller to
  requeue every 15seconds.

Both of the points stemmed from the fact that we're not "yet" in the
reconcile function that is wrapped by a proper error handler (that
triggers requeues, and resets/sets resource conditions). In this special
case, we need to manually inject the condition and return a
controller-runtime error that will trigger a requeue after 15seconds.

While this is a good fix, we're planning on refactoring a lot of the
runtime logic to make easier to read, maintain and more importantly
expose reusable component that will help avoid falling into such traps.

Signed-off-by: Amine Hilaly <[email protected]>
@ack-prow ack-prow bot requested review from jlbutler and jljaco February 29, 2024 04:22
@ack-prow ack-prow bot added the approved label Feb 29, 2024
@@ -124,6 +124,7 @@ func (r *adoptionReconciler) reconcile(ctx context.Context, req ctrlrt.Request)
// is not available.
roleARN, err = r.getRoleARN(acctID)
if err != nil {
ackrtlog.InfoAdoptedResource(r.log, res, fmt.Sprintf("Unable to start adoption reconcilliation %s: %v", acctID, err))
// r.getRoleARN errors are not terminal, we should requeue.
return requeue.NeededAfter(err, roleARNNotAvailableRequeueDelay)
Copy link
Member Author

@a-hilaly a-hilaly Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requeue works fine, because r.reconcile is wrapped by handleReconcileErrors

@a-hilaly
Copy link
Member Author

/test all

@a-hilaly
Copy link
Member Author

/retest

1 similar comment
@a-hilaly
Copy link
Member Author

/retest

@a-hilaly
Copy link
Member Author

/test all

@a-hilaly
Copy link
Member Author

a-hilaly commented Mar 1, 2024

/retest

2 similar comments
@a-hilaly
Copy link
Member Author

a-hilaly commented Mar 1, 2024

/retest

@a-hilaly
Copy link
Member Author

a-hilaly commented Mar 1, 2024

/retest

@yuxiang-zhang
Copy link
Member

/lgtm

@ack-prow ack-prow bot added the lgtm Indicates that a PR is ready to be merged. label Mar 1, 2024
Copy link

ack-prow bot commented Mar 1, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a-hilaly, yuxiang-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-prow ack-prow bot merged commit 861f7ed into aws-controllers-k8s:main Mar 1, 2024
5 checks passed
ack-prow bot pushed a commit to aws-controllers-k8s/code-generator that referenced this pull request Mar 5, 2024
Propagating two main patches to all the controllers:
- Properly handle CARM errors and requeues by @a-hilaly in aws-controllers-k8s/runtime#140
- Introduce flags for fine-tuning maximum concurrent reconciles per resource by @a-hilaly in aws-controllers-k8s/runtime#141

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
ndbhat pushed a commit to ndbhat/ack-runtime that referenced this pull request Apr 16, 2024
Prior to this patch we introduced new mechanisms to handle Role ARN
retrievals from the CARM cache. Which improved ACK runtime scaling
capabilities and addressed possible race condition scenarios. However in
the process we missed two things:
- Setting an `ACK.ResourceSynced` condition stating that the resource
  isn't synced, yet.
- Returning the **correct** runtime error that will cause the reconciller to
  requeue every 15seconds.

Both of the problems stemmed from the fact that we're not "yet" in the
reconcile function (`rm.Sync`) that is wrapped by a proper error handler
(that triggers requeues, and resets/sets resource conditions). In this
special case, we need to manually inject the condition and return a
controller-runtime error that will trigger a requeue after 15seconds.

While this is a "fair" fix, we're planning on refactoring a lot of the
runtime logic to make easier to read, maintain and more importantly
expose reusable component that will help avoid falling into such traps.

Signed-off-by: Amine Hilaly <[email protected]>

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants