-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-12615. Ozone Recon - Failure of any OM task during bootstrapping of Recon needs to be handled. #8098
base: master
Are you sure you want to change the base?
Conversation
… of Recon needs to be handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding the issue @devmadhuu I had some suggestions on the approach. I am not really sure of the fesability of it. LMK if that is something possible.
// lastUpdatedSeqNumber number for any of the OM task, then just run reprocess for such tasks. | ||
|
||
ReconTaskStatusUpdater fullSnapshotTaskStatusUpdater = | ||
taskStatusUpdaterManager.getTaskStatusUpdater(OmSnapshotTaskName.OmSnapshotRequest.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have a new fullSnapshotTaskStatusUpdater ? We can update the same variable irrespective of it being a delta process or reprocess.
We just have to bootstrap if there is difference between omSnapshot & OM Leader sequence number and provided all TaskStatusUpdater tasks = OmSnapshot.sequenceNumber. On end of every process or reprocess we should just update the sequenceNumber in taskStatusUpdater to the OmSnapshot's sequence number. If it doesn't match we should just run reprocess if the value is 0 or rerun process from the seekPos. Isn't this possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be more simpler to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swamirishi thanks for patch review. Currently we maintain lastUpdatedSequence number with every OM task in the form of hadoop metrics and its last run status, so this is the simplest way to handle all cases mentioned in PR description. Kindly have a look over all possible cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devmadhuu given few comment, trying to understand the logic.
* - Om Delta snapshot number - 100010 | ||
* - All Om Tasks snapshot number - 100000 | ||
* | ||
* Case 6: This case will force Recon to run reprocess of only those OM tasks whose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how case 5 and case 6 are different? is it not simple that if fullSnapshot task not same delta or task snapshot number, trigger those task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difference between case 5 and case 6 is that case 5 has All the OM tasks has last updated sequence number behind the delta task's last updated sequence number. Case 6 has only few tasks where their last updated sequence number behind the delta task's last updated sequence number. So in case 5, all OM tasks will go for reprocess and in case 6, only those OM tasks will go for reprocess who couldn't complete their process delta updates and before that Recon crashed or restarted.
On your question - "if fullSnapshot task not same delta or task snapshot number, trigger those task.
"
--- I think, the condition alone which you mentioned, will not cover case 4, pls check and in case 4 with condition you mentioned, all OM tasks will go for reprocess which we don't want.
ReconTaskStatusUpdater taskStatusUpdater) { | ||
return fullSnapshotTaskStatusUpdater.getLastUpdatedSeqNumber() > 0 | ||
&& deltaTaskStatusUpdater.getLastUpdatedSeqNumber() == 0 | ||
&& !isOmSnapshotTask(taskName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why snpashotTask itself is not started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is for re-run over reprocess of OM tasks (which means process the DB data) and not getting DB updates again.
reconOmTaskMap.keySet() | ||
.forEach(taskName -> { | ||
LOG.info("{} -> {}", taskName, | ||
taskStatusUpdaterManager.getTaskStatusUpdater(taskName).getLastUpdatedSeqNumber()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how fullSnapshotTask, delta task and taskStatus are related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to get the DB updates again, just want OM tasks to reprocess if they failed in their last run whether it was bootstrap case or delta updates case and Recon was need to be restarted or crashed.
What changes were proposed in this pull request?
This PR change is to handle failure of Recon bootstrap and its OM tasks.
If any OM task failed during bootstrapping of Recon (Full OM DB snapshot), then failed OM tasks needs to be handled to bootstrap and reprocess of OM tasks again. For partial or corrupted receive of OM DB tar ball, recon should clean and delete the tar ball and start the fetch of OM DB tar ball from scratch.
Following cases can be there and handled accordingly by this change.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-12615
How was this patch tested?
This patch is tested manually and with local docker cluster.