Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-12615. Ozone Recon - Failure of any OM task during bootstrapping of Recon needs to be handled. #8098

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

devmadhuu
Copy link
Contributor

What changes were proposed in this pull request?

This PR change is to handle failure of Recon bootstrap and its OM tasks.
If any OM task failed during bootstrapping of Recon (Full OM DB snapshot), then failed OM tasks needs to be handled to bootstrap and reprocess of OM tasks again. For partial or corrupted receive of OM DB tar ball, recon should clean and delete the tar ball and start the fetch of OM DB tar ball from scratch.

Following cases can be there and handled accordingly by this change.

Case 1: Normal bootstrap flow will take care of this scenario.
full snapshot: DB not Updated
  - Om Snapshot number - 0
  - Om Delta snapshot number - 0
  - All Om Tasks snapshot number - 0

Case 2: This case will force Recon to run reprocess of only those OM tasks whose last updated sequence number is zero
full snapshot: DB Updated, Tasks not reprocessed, Recon restarted or crash
  - Om Snapshot number - 100000
  - Om Delta snapshot number - 0
  - Few Om Tasks snapshot number - 0, remaining Om tasks snapshot number - 100000

Case 3: This case will force Recon to run reprocess of all OM tasks
full snapshot: DB Updated, Tasks not reprocessed, Recon restarted or crash
  - Om Snapshot number - 100000
  - Om Delta snapshot number - 0
  - All Om Tasks snapshot number - 0

Case 4: This case will not force to reprocess any OM tasks and on restart of Recon, bootstrap normal flow will be okay. 
full snapshot: DB Updated, Tasks reprocessed, but before delta DB applied, Recon restarted or crash
  - Om Snapshot number - 100000
  - Om Delta snapshot number - 0
  - All Om Tasks snapshot number - 100000

Case 5: This case will force Recon to run reprocess of all OM tasks
full snapshot: DB Updated, Tasks reprocessed, delta DB updates also applied, recon restarted or crash, but all delta tasks not processed 
  - Om Snapshot number - 100000
  - Om Delta snapshot number - 100010
  - All Om Tasks snapshot number - 100000   

Case 6: This case will force Recon to run reprocess of only those OM tasks whose last updated sequence number is less than Om Delta snapshot number
full snapshot: DB Updated, Tasks reprocessed, delta DB updates also applied, recon restarted or crash, but delta tasks not processed 
  - Om Snapshot number - 100000
  - Om Delta snapshot number - 100010
  - Few Om Tasks snapshot number - 100000  , Remaining Om Tasks snapshot number - 100010

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12615

How was this patch tested?

This patch is tested manually and with local docker cluster.

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding the issue @devmadhuu I had some suggestions on the approach. I am not really sure of the fesability of it. LMK if that is something possible.

// lastUpdatedSeqNumber number for any of the OM task, then just run reprocess for such tasks.

ReconTaskStatusUpdater fullSnapshotTaskStatusUpdater =
taskStatusUpdaterManager.getTaskStatusUpdater(OmSnapshotTaskName.OmSnapshotRequest.name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have a new fullSnapshotTaskStatusUpdater ? We can update the same variable irrespective of it being a delta process or reprocess.
We just have to bootstrap if there is difference between omSnapshot & OM Leader sequence number and provided all TaskStatusUpdater tasks = OmSnapshot.sequenceNumber. On end of every process or reprocess we should just update the sequenceNumber in taskStatusUpdater to the OmSnapshot's sequence number. If it doesn't match we should just run reprocess if the value is 0 or rerun process from the seekPos. Isn't this possible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be more simpler to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swamirishi thanks for patch review. Currently we maintain lastUpdatedSequence number with every OM task in the form of hadoop metrics and its last run status, so this is the simplest way to handle all cases mentioned in PR description. Kindly have a look over all possible cases.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu given few comment, trying to understand the logic.

* - Om Delta snapshot number - 100010
* - All Om Tasks snapshot number - 100000
*
* Case 6: This case will force Recon to run reprocess of only those OM tasks whose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how case 5 and case 6 are different? is it not simple that if fullSnapshot task not same delta or task snapshot number, trigger those task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference between case 5 and case 6 is that case 5 has All the OM tasks has last updated sequence number behind the delta task's last updated sequence number. Case 6 has only few tasks where their last updated sequence number behind the delta task's last updated sequence number. So in case 5, all OM tasks will go for reprocess and in case 6, only those OM tasks will go for reprocess who couldn't complete their process delta updates and before that Recon crashed or restarted.

On your question - "if fullSnapshot task not same delta or task snapshot number, trigger those task."
--- I think, the condition alone which you mentioned, will not cover case 4, pls check and in case 4 with condition you mentioned, all OM tasks will go for reprocess which we don't want.

ReconTaskStatusUpdater taskStatusUpdater) {
return fullSnapshotTaskStatusUpdater.getLastUpdatedSeqNumber() > 0
&& deltaTaskStatusUpdater.getLastUpdatedSeqNumber() == 0
&& !isOmSnapshotTask(taskName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why snpashotTask itself is not started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is for re-run over reprocess of OM tasks (which means process the DB data) and not getting DB updates again.

reconOmTaskMap.keySet()
.forEach(taskName -> {
LOG.info("{} -> {}", taskName,
taskStatusUpdaterManager.getTaskStatusUpdater(taskName).getLastUpdatedSeqNumber());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how fullSnapshotTask, delta task and taskStatus are related?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to get the DB updates again, just want OM tasks to reprocess if they failed in their last run whether it was bootstrap case or delta updates case and Recon was need to be restarted or crashed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants