Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsync: disable --delete on src walk error #599

Merged
merged 3 commits into from
Nov 13, 2024

Conversation

ofaaland
Copy link
Collaborator

In dsync, if there was an error during the walk of the source tree, the source flist generated is likely incomplete.

In the case where the source and destination have the same files and directories before dsync --delete is run, an incomplete source flist results in files being incorrectly deleted from the destination.

When an error was detected for the source walk, warn the user and disable the --delete option. Continue with the dsync, so that any new files identified are still copied and metadata is synced.

See rsync(1) for the --delete option, it takes the same approach.

Implement a test where dsync --delete is run, and the walk of the source tree is caused to fail by making a directory unreadable.

Before the patch, the test detects that files on the destination are deleted when they should not be, due to the error during the source walk. After the patch, the test shows the improper deletes do not occur.

Fixes #587

Copy link
Collaborator

@carbonneau1 carbonneau1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@ofaaland
Copy link
Collaborator Author

Note that I used global variables to detect errors during the walk. I'm open to some other mechanism if that seems important. Other options that I can think of are to modify libcircle so that it has some return value like mechanism, or to use libcircle reductions.

Implement a test where dsync --delete is run, and the walk of the source
tree is caused to fail by making a directory unreadable.

dsync currently has a bug which causes it to fail that test,
hpc#587

After the source walk fails, dsync continues with the dest walk followed
by the copies and the deletes.

Because the source walk failed, files or directories that are already on
both source and destination, but which were not walked on the source,
are missing from the source list.  Those files or directories are then
deleted from the destination tree.

Signed-off-by: Olaf Faaland <[email protected]>
Record the ocurrence of errors in mfu_flist_walk_param_paths() and
friends.  Alter mfu_flist_walk_path() and friends to return an int
indicating success (0) or failure.

Add (void) to mfu_flist_walk_param_paths() calls where we do not use the
return value.

In dsync, if there was an error during the walk of the source tree, the
source flist generated is likely incomplete.

In the case where the source and destination have the same files and
directories before dsync --delete is run, an incomplete source flist
results in files being incorrectly deleted from the destination.

When an error was reported for the source walk, warn the user and
disable the --delete option.  Continue with the dsync, so that any
new files identified are still copied and metadata is synced.

See rsync(1) for the --delete option, it takes the same approach.

Fixes hpc#587

Signed-off-by: Olaf Faaland <[email protected]>
In places where we disregard the return value of mfu_flist_walk_param_paths(),
add (void) as a visual indicator.

In these cases we still need to analyze what we would do differently,
if anything, with the knowledge that the walk failed.

Signed-off-by: Olaf Faaland <[email protected]>
@ofaaland ofaaland force-pushed the b-src-walk-noaccess-3 branch from d1b0a66 to dcd6e51 Compare November 13, 2024 19:48
@ofaaland
Copy link
Collaborator Author

ofaaland commented Nov 13, 2024

Cleanup at end of the test had been commented out accidentally. I fixed that in update to dcd6e51

@ofaaland ofaaland merged commit 22a67ee into hpc:main Nov 13, 2024
@ofaaland ofaaland deleted the b-src-walk-noaccess-3 branch November 13, 2024 19:59
Comment on lines +3320 to +3322
int all_rc;
MPI_Allreduce(&walk_rc, &all_rc, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
if (all_rc > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there is a helper function for this

Suggested change
int all_rc;
MPI_Allreduce(&walk_rc, &all_rc, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
if (all_rc > 0) {
if (!mfu_alltrue(walk_rc == 0, MPI_COMM_WORLD) {

@daltonbohning
Copy link
Collaborator

Woops, reviewed too late :)

ofaaland added a commit to ofaaland/mpifileutils that referenced this pull request Nov 13, 2024
Fixes issues identified in the code review of hpc#599

Since WALK_RESULT is an int and can only hold one error code, but any
number of errors might be encountered during a walk, there is little
value in recording errno.

In addition, the value of errno might have changed unless it is captured
immediately after the library or system call that failed, which would
require deeper code changes.

Instead, set to -1 as the indicator of a failure.

Signed-off-by: Olaf Faaland <[email protected]>
ofaaland added a commit to ofaaland/mpifileutils that referenced this pull request Nov 14, 2024
Fixes issues identified in the code review of hpc#599

Since WALK_RESULT is an int and can only hold one error code, but any
number of errors might be encountered during a walk, there is little
value in recording errno.

In addition, the value of errno might have changed unless it is captured
immediately after the library or system call that failed, which would
require deeper code changes.

Instead, set to -1 as the indicator of a failure.

Signed-off-by: Olaf Faaland <[email protected]>
@ofaaland
Copy link
Collaborator Author

@daltonbohning said:

Woops, reviewed too late :)

Thanks for the feedback! I created a new PR, #600 for those fixes.

ofaaland added a commit that referenced this pull request Nov 14, 2024
Fixes issues identified in the code review of #599

Since WALK_RESULT is an int and can only hold one error code, but any
number of errors might be encountered during a walk, there is little
value in recording errno.

In addition, the value of errno might have changed unless it is captured
immediately after the library or system call that failed, which would
require deeper code changes.

Instead, set to -1 as the indicator of a failure.

Signed-off-by: Olaf Faaland <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dsync deletes destination files after getting errors while walking the source
3 participants