-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardlinks support #633
Open
rezib
wants to merge
9
commits into
hpc:main
Choose a base branch
from
rezib:for-upstream/hardlinks-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Hardlinks support #633
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
During tree walk with details, regular files with more than one nlink are temporarily placed in a hardlinks flist. This flist is then globally ordered by names and ranked to select one reference path per inode, and flag all other paths to this inodes as hardlinks. The sorted hardlinks flist is finally merged in global flist with all other items. The paths name ordering is performed to ensure reproducibility between two similar trees, thus minimizing the differences for dcmp and dsync eventually. This commit introduces a new structure inodes_hardlink_map_t used to temporarily associate paths to inodes in reference/hardlinks solving logic. The type elem_t receives 2 new members: nlink, the number of links on an inode, and ref, the reference path to this inode. The ref is NULL except on hardlinks. This commit also introduces a new filetype MFU_TYPE_HARDLINK, which is used to distinguish hardlinks to inodes from reference paths which have MFU_TYPE_FILE type. The packed flist element now contains the filetype, even when details are enabled, as there is now way to determine if an element is a regular file or a hardlink based on stat result. New functions mfu_[un]pack_sized_str() are introduced to manage packing and unpackaging of optional strings with maximum length. Signed-off-by: Rémi Palancher <[email protected]>
Add support for hardlinks in dcp. This function renames existing functions mfu_create_hardlink[s]() to mfu_create_hardlink[s]_dest() to reflect their purpose related to --link-dest option. Two new functions mfu_create_hardlink[s]() are introduced to create all hardlinks in destination directory with the appropriate link destination. The summary at the end of copy is modified to mention hardlinks operations. Signed-off-by: Rémi Palancher <[email protected]>
Add support of hardlinks in dcmp. The reference paths of hardlinks in source and destination are compared. If not equal, strmap is updated to flag them as different. The branchs in items comparison logic is now based on filetype recorded in flist rather than the file mode as there is no way to distinguish reference paths and hardlinks with just the mode, both are regular files. Signed-off-by: Rémi Palancher <[email protected]>
Add support of hardlinks in dsync. The reference paths of hardlinks in source and destination are compared. If not equal, strmap is updated to flag them as different. The branchs in items comparison logic is now based on filetype recorded in flist rather than the file mode as there is no way to distinguish reference paths and hardlinks with just the mode, both are regular files. Additional logic is added with dsync_remove_hardlinks_with_removed_ref() function to detect hardlinks whose references paths are marked for deletion in destination. In this case, all the hardlinks pointing to this reference are also marked for being replaced to avoid residual links pointing to wrong inodes. Signed-off-by: Rémi Palancher <[email protected]>
Add support for hardlinks in dtar, in all supported create and extract algorithms. New structure entry_list_t is introduced, it is used in some extract algorithms to fill a temporary a list of hardlinks entries to create in a second pass, after all other files are created. Signed-off-by: Rémi Palancher <[email protected]>
Introduce cache format v5 which supports hardlinks encoding with nlink and reference paths. New read_cache_v5() is basically similar to read_cache_v4() except the calls to list_elem_pack_size[_le4]() and list_insert_ptr[_le4](). Signed-off-by: Rémi Palancher <[email protected]>
When dcp reads input list from cache, place files with more than one links in a temporary list and resolve hardlinks, similary to the logic implemented in walk with details. Signed-off-by: Rémi Palancher <[email protected]>
This command adds many functional tests of dcmp, dcp, dsync, dtar and dwalk, executed and automatically validated with Python standard unittest library. This is designed to be easy to execute and integrate in continuous integration systems. Set two environment variables to define respectively the path to mpifileutils binaries and arguments provided to mpirun, eg: $ export MFU_BIN=~/dev/bin $ export MFU_MPIRUN_ARGS="--bind-to none --oversubscribe -N 4" And run all the tests: $ python3 -m unittest discover -v test Or: $ pytest # require pytest The suite has utilities to check similarity between two trees, with the possibility to specific paths and attributes (eg. mtime). It is also possible to assert specific command outputs. Most tests are run against a specific testing file tree to cover many cases. Other tests are run with a file tree generated by dfilemaker. Signed-off-by: Rémi Palancher <[email protected]>
Add continuous integration workflow to build and install lwgrp, libcircle, dtcmp and mpifileutils and execute Python test suite in github actions for all pull requests and merges in main branch. Signed-off-by: Rémi Palancher <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dear mpifileutils developers,
This is my proposal to add support of hardlinks in many mpifileutils commands:
dwalk
,dcp
,dcmp
,dsync
anddtar
.During tree walk with details, regular files with more than one nlink are temporarily placed in a hardlinks flist. This flist is then globally ordered by names and ranked to select one reference path per inode, and flag all other paths to this inodes as hardlinks. The sorted hardlinks flist is finally merged in global flist with all other items. The paths name ordering is performed to ensure reproducibility between two similar trees, thus minimizing the differences for
dcmp
anddsync
eventually.Note
You may find more implementations details in respective commits messages.
The pull request introduces a cache format v5, to support encoding of files nlink and hardlinks references paths.
This pull request also includes a functional test suite that relies on Python standard
unittest
library. This suite is designed to be easy to execute:$ python3 -m unittest discover -v test
$ pytest # require pytest
It is also designed to be easy to integrate in continuous integration systems. The pull request even provides a GitHub action workflow to execute this test suite on every pull requests and merges in main branch (example run).
For the record, this test suite has already helped detect and fix the following bugs:
--preserve-{xattrs,acl,flags}
#628 → dtar: fix creation with symlink and --preserve-* #629 (pending)Please let me know what you think! I can also remove the tests and GitHub actions workflow if you don't like the technical approach.
Important
Note this feature does not work properly without this fix for a bug in DTCMP: LLNL/dtcmp#20
Important
There is one limitation with
dcp
/dsync --dereference
when symlinks point to path with more than one link. In this specific case, mpifileutils will consider the symlink as one more additional path to the same inode and create one more hardlink on this inodes in destination directory. For reference, this case is coverered by testtest_dsync_symlink_dereference_target_nlinks
.Note
I would like to emphasize that this work is sponsored by @cea-hpc.
fix #417 #336