Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-12057. Implement command ozone debug replicas verify checksums #7748

Merged
merged 13 commits into from
Mar 13, 2025

Conversation

ptlrs
Copy link
Contributor

@ptlrs ptlrs commented Jan 24, 2025

What changes were proposed in this pull request?

This PR:

  • adds a new ozone debug replicas verify command
  • renamed read-replicas to checksums command
  • updated checksums command with the ability to walk the file tree and calculate checksums of all files
  • made a URI a required parameter for all subcommands of replicas verify

Note: This is a WIP change. Further enhancements and refactoring will be done and acceptance tests will be updated.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12057

How was this patch tested?

Manual testing in docker-compose environment
CI: https://github.com/ptlrs/ozone/actions/runs/12954171235

File dir = createDirectory(volumeName, bucketName, keyName);
OzoneKeyDetails keyInfoDetails = checksumClient.getKeyDetails(volumeName, bucketName, keyName);
Map<OmKeyLocationInfo, Map<DatanodeDetails, OzoneInputStream>> replicas =
checksumClient.getKeysEveryReplicas(volumeName, bucketName, keyName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally related, but RpcClient.getKeysEveryReplicas() doesn't seems to create input stream that refreshes container cache upon failure. Could that lead problems in such a corner case?

https://github.com/apache/ozone/blob/master/hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/rpc/RpcClient.java#L1609

Copy link
Contributor

@sarvekshayr sarvekshayr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ptlrs
The Replicas.java implementation already exists under the name ReplicasDebug.java as part of the replicas package. Could you please move these files to that package?

@adoroszlai adoroszlai changed the title HDDS-12094.Implement ozone debug replicas verify checksums command HDDS-12057. Implement command ozone debug replicas verify checksums Jan 29, 2025
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for working on this.

@errose28 errose28 added the tools Tools that helps with debugging label Feb 6, 2025
@ptlrs ptlrs force-pushed the HDDS-12057-checksums branch from 6b649c9 to 53130e9 Compare February 17, 2025 08:21
@ptlrs
Copy link
Contributor Author

ptlrs commented Feb 17, 2025

Thanks for the reviews @jojochuang @adoroszlai @sarvekshayr.
I have made some changes based on the comments. Could you please take another look.

Copy link
Contributor

@sarvekshayr sarvekshayr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ptlrs for updating the patch. Suggested two minor improvements below:

@ptlrs ptlrs force-pushed the HDDS-12057-checksums branch from da5bfee to 4c05627 Compare February 19, 2025 02:32
@ptlrs ptlrs requested a review from sarvekshayr February 19, 2025 02:38
Copy link
Contributor

@sarvekshayr sarvekshayr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ptlrs ptlrs marked this pull request as ready for review February 20, 2025 14:07
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for improving the patch.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left some comments on the CLI construction. See HDDS-12206 for a description of how this should fit together.

@adoroszlai adoroszlai marked this pull request as draft February 25, 2025 20:24
@ptlrs ptlrs force-pushed the HDDS-12057-checksums branch from 4c05627 to d271183 Compare February 26, 2025 18:23
@adoroszlai adoroszlai requested a review from errose28 February 27, 2025 16:57
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @ptlrs. Overall looks good as a port of the existing checks. There is a lot of room for improvement within the existing code of those checks though so I'll file a follow-up Jira to redo the read replicas/checksum verification in a more efficient manner.

Usually we don't put log messages in CLI commands:

  • Users may not see them depending how the log4j config is set up.
  • It's clunky to control log levels at the command line, and often a full set of levels is not needed.
    • Usually just a two level system with regular messages, and additional messages controlled by a --verbose flag will suffice.

There's some older commands using the LOG object fromHandler that we should probably fix up later. I think we can replace IOUtils#close with closeQuietly in this context, since we've already completed verification at this point and failure to close the streams is not actionable by the user.

…/replicas/ReplicasVerify.java

Co-authored-by: Ethan Rose <[email protected]>
@ptlrs ptlrs requested a review from errose28 March 12, 2025 17:38
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can improve the implementations of existing checks that were ported over in follow up tasks.

@ptlrs ptlrs requested a review from adoroszlai March 13, 2025 00:34
Comment on lines 57 to 66
@CommandLine.Option(names = "--checksums",
description = "Do client side data checksum validation of all replicas.",
// value will be true only if the "--checksums" option was specified on the CLI
defaultValue = "false")
private boolean doExecuteChecksums;

@CommandLine.Option(names = "--padding",
description = "Check for missing padding in erasure coded replicas.",
defaultValue = "false")
private boolean doExecutePadding;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command does not do anything useful without one of these two options. They should be in an @ArgGroup(exclusive = false, multiplicity = "1").

https://picocli.info/#_argument_groups

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@ptlrs ptlrs requested a review from adoroszlai March 13, 2025 19:08
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for updating the patch, LGTM.

$ ozone debug replicas verify --output-dir /tmp /
Error: Missing required argument(s): ([--checksums] [--padding])

@adoroszlai adoroszlai marked this pull request as ready for review March 13, 2025 19:18
@errose28 errose28 merged commit 786da39 into apache:master Mar 13, 2025
55 checks passed
@ptlrs ptlrs deleted the HDDS-12057-checksums branch March 13, 2025 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tools Tools that helps with debugging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants