HDDS-12615. Ozone Recon - Failure of any OM task during bootstrapping of Recon needs to be handled. #8098

devmadhuu · 2025-03-16T17:21:37Z

What changes were proposed in this pull request?

This PR change is to handle failure of Recon bootstrap and its OM tasks.
If any OM task failed during bootstrapping of Recon (Full OM DB snapshot), then failed OM tasks needs to be handled to bootstrap and reprocess of OM tasks again. For partial or corrupted receive of OM DB tar ball, recon should clean and delete the tar ball and start the fetch of OM DB tar ball from scratch.

Following cases can be there and handled accordingly by this change.

Case 1: Normal bootstrap flow will take care of this scenario.
full snapshot: DB not Updated
  - Om Snapshot number - 0
  - Om Delta snapshot number - 0
  - All Om Tasks snapshot number - 0

Case 2: This case will force Recon to run reprocess of only those OM tasks whose last updated sequence number is zero
full snapshot: DB Updated, Tasks not reprocessed, Recon restarted or crash
  - Om Snapshot number - 100000
  - Om Delta snapshot number - 0
  - Few Om Tasks snapshot number - 0, remaining Om tasks snapshot number - 100000

Case 3: This case will force Recon to run reprocess of all OM tasks
full snapshot: DB Updated, Tasks not reprocessed, Recon restarted or crash
  - Om Snapshot number - 100000
  - Om Delta snapshot number - 0
  - All Om Tasks snapshot number - 0

Case 4: This case will not force to reprocess any OM tasks and on restart of Recon, bootstrap normal flow will be okay. 
full snapshot: DB Updated, Tasks reprocessed, but before delta DB applied, Recon restarted or crash
  - Om Snapshot number - 100000
  - Om Delta snapshot number - 0
  - All Om Tasks snapshot number - 100000

Case 5: This case will force Recon to run reprocess of all OM tasks
full snapshot: DB Updated, Tasks reprocessed, delta DB updates also applied, recon restarted or crash, but all delta tasks not processed 
  - Om Snapshot number - 100000
  - Om Delta snapshot number - 100010
  - All Om Tasks snapshot number - 100000   

Case 6: This case will force Recon to run reprocess of only those OM tasks whose last updated sequence number is less than Om Delta snapshot number
full snapshot: DB Updated, Tasks reprocessed, delta DB updates also applied, recon restarted or crash, but delta tasks not processed 
  - Om Snapshot number - 100000
  - Om Delta snapshot number - 100010
  - Few Om Tasks snapshot number - 100000  , Remaining Om Tasks snapshot number - 100010

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12615

How was this patch tested?

This patch is tested manually and with local docker cluster.

… of Recon needs to be handled.

swamirishi

Thanks for finding the issue @devmadhuu I had some suggestions on the approach. I am not really sure of the fesability of it. LMK if that is something possible.

swamirishi · 2025-03-16T19:02:39Z

...on/src/main/java/org/apache/hadoop/ozone/recon/spi/impl/OzoneManagerServiceProviderImpl.java

    // lastUpdatedSeqNumber number for any of the OM task, then just run reprocess for such tasks.
+
+    ReconTaskStatusUpdater fullSnapshotTaskStatusUpdater =
+        taskStatusUpdaterManager.getTaskStatusUpdater(OmSnapshotTaskName.OmSnapshotRequest.name());


Why have a new fullSnapshotTaskStatusUpdater ? We can update the same variable irrespective of it being a delta process or reprocess.
We just have to bootstrap if there is difference between omSnapshot & OM Leader sequence number and provided all TaskStatusUpdater tasks = OmSnapshot.sequenceNumber. On end of every process or reprocess we should just update the sequenceNumber in taskStatusUpdater to the OmSnapshot's sequence number. If it doesn't match we should just run reprocess if the value is 0 or rerun process from the seekPos. Isn't this possible?

This should be more simpler to understand.

@swamirishi thanks for patch review. Currently we maintain lastUpdatedSequence number with every OM task in the form of hadoop metrics and its last run status, so this is the simplest way to handle all cases mentioned in PR description. Kindly have a look over all possible cases.

sumitagrawl

@devmadhuu given few comment, trying to understand the logic.

sumitagrawl · 2025-03-17T04:57:25Z

...on/src/main/java/org/apache/hadoop/ozone/recon/spi/impl/OzoneManagerServiceProviderImpl.java

+   *   - Om Delta snapshot number - 100010
+   *   - All Om Tasks snapshot number - 100000
+   *
+   * Case 6: This case will force Recon to run reprocess of only those OM tasks whose


how case 5 and case 6 are different? is it not simple that if fullSnapshot task not same delta or task snapshot number, trigger those task.

Difference between case 5 and case 6 is that case 5 has All the OM tasks has last updated sequence number behind the delta task's last updated sequence number. Case 6 has only few tasks where their last updated sequence number behind the delta task's last updated sequence number. So in case 5, all OM tasks will go for reprocess and in case 6, only those OM tasks will go for reprocess who couldn't complete their process delta updates and before that Recon crashed or restarted.

On your question - "if fullSnapshot task not same delta or task snapshot number, trigger those task."
--- I think, the condition alone which you mentioned, will not cover case 4, pls check and in case 4 with condition you mentioned, all OM tasks will go for reprocess which we don't want.

sumitagrawl · 2025-03-17T04:57:58Z

...on/src/main/java/org/apache/hadoop/ozone/recon/spi/impl/OzoneManagerServiceProviderImpl.java

+      ReconTaskStatusUpdater taskStatusUpdater) {
+    return fullSnapshotTaskStatusUpdater.getLastUpdatedSeqNumber() > 0
+        && deltaTaskStatusUpdater.getLastUpdatedSeqNumber() == 0
+        && !isOmSnapshotTask(taskName)


why snpashotTask itself is not started?

This logic is for re-run over reprocess of OM tasks (which means process the DB data) and not getting DB updates again.

sumitagrawl · 2025-03-17T05:00:26Z

...on/src/main/java/org/apache/hadoop/ozone/recon/spi/impl/OzoneManagerServiceProviderImpl.java

-      reconOmTaskMap.keySet()
-          .forEach(taskName -> {
-            LOG.info("{} -> {}", taskName,
-                taskStatusUpdaterManager.getTaskStatusUpdater(taskName).getLastUpdatedSeqNumber());


how fullSnapshotTask, delta task and taskStatus are related?

We don't want to get the DB updates again, just want OM tasks to reprocess if they failed in their last run whether it was bootstrap case or delta updates case and Recon was need to be restarted or crashed.

HDDS-12615. Ozone Recon - Failure of any OM task during bootstrapping…

672a21e

… of Recon needs to be handled.

devmadhuu requested review from sumitagrawl and swamirishi March 16, 2025 17:21

adoroszlai added the recon label Mar 16, 2025

swamirishi reviewed Mar 16, 2025

View reviewed changes

sumitagrawl reviewed Mar 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-12615. Ozone Recon - Failure of any OM task during bootstrapping of Recon needs to be handled. #8098

HDDS-12615. Ozone Recon - Failure of any OM task during bootstrapping of Recon needs to be handled. #8098

devmadhuu commented Mar 16, 2025

swamirishi left a comment

swamirishi Mar 16, 2025

swamirishi Mar 16, 2025

devmadhuu Mar 17, 2025

sumitagrawl left a comment

sumitagrawl Mar 17, 2025

devmadhuu Mar 17, 2025

sumitagrawl Mar 17, 2025

devmadhuu Mar 17, 2025

sumitagrawl Mar 17, 2025

devmadhuu Mar 17, 2025

HDDS-12615. Ozone Recon - Failure of any OM task during bootstrapping of Recon needs to be handled. #8098

Are you sure you want to change the base?

HDDS-12615. Ozone Recon - Failure of any OM task during bootstrapping of Recon needs to be handled. #8098

Conversation

devmadhuu commented Mar 16, 2025

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

swamirishi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sumitagrawl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment