-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace reflink dependency with an implementation using Python stdlib primitives #483
Labels
my first issue
Good for newcomers
Comments
eskultety
added a commit
to eskultety/hermeto
that referenced
this issue
Jul 19, 2024
This defines the _fast_copy primitive to replace the external unmaintained reflink library and use python's stdlib handling. The reason for this effort is that the reflink library [1] was created as an attempt to make use of the reflink optimization before python gained support for the os.copy_file_range syscall. The library was never really anything more than a band-aid and now that it's possible to use a syscall the library even mentions on its GitHub page that Python now implements the functionality natively. The implementation was taken from (with some 3.9+ tweaks applied) from an existing code proposal [2] (marked as "awaiting merge") to add the same functionality to the 'shutil' library copying primitives and make it completely transparent to end users. We'd have to wait a long time to be able to use it though. Compared to the reflink library, which used a dirty trick of copying a small file first (in kinda error-prone way) to see if the operation raised an exception, os.copy_file_range based solution succeeds in vast majority of cases because if reflinks are not supported within the underlying file system (which is nothing more than inode sharing) a copy without the overhead of userspace <-> kernel can still continue normally, hence reserving the 'shutil.copy2' fallback to really obscure cases (like cross-device copying - EXDEV OR on old systems without the syscall - ENOSYS) or simply cases where the copying failed for some reason which we may not even encounter ever. [1] https://gitlab.com/rubdos/pyreflink [2] https://github.com/python/cpython/pull/93152/files Resolves: hermetoproject#483 Signed-off-by: Erik Skultety <[email protected]>
eskultety
added a commit
to eskultety/hermeto
that referenced
this issue
Jul 19, 2024
This defines the _fast_copy primitive to replace the external unmaintained reflink library and use python's stdlib handling. The reason for this effort is that the reflink library [1] was created as an attempt to make use of the reflink optimization before python gained support for the os.copy_file_range syscall. The library was never really anything more than a band-aid and now that it's possible to use a syscall the library even mentions on its GitHub page that Python now implements the functionality natively. The implementation was taken from (with some 3.9+ tweaks applied) from an existing code proposal [2] (marked as "awaiting merge") to add the same functionality to the 'shutil' library copying primitives and make it completely transparent to end users. We'd have to wait a long time to be able to use it though. Compared to the reflink library, which used a dirty trick of copying a small file first (in kinda error-prone way) to see if the operation raised an exception, os.copy_file_range based solution succeeds in vast majority of cases because if reflinks are not supported within the underlying file system (which is nothing more than inode sharing) a copy without the overhead of userspace <-> kernel can still continue normally, hence reserving the 'shutil.copy2' fallback to really obscure cases (like cross-device copying - EXDEV OR on old systems without the syscall - ENOSYS) or simply cases where the copying failed for some reason which we may not even encounter ever. [1] https://gitlab.com/rubdos/pyreflink [2] https://github.com/python/cpython/pull/93152/files Resolves: hermetoproject#483 Signed-off-by: Erik Skultety <[email protected]>
eskultety
added a commit
to eskultety/hermeto
that referenced
this issue
Jul 19, 2024
This defines the _fast_copy primitive to replace the external unmaintained reflink library and use python's stdlib handling. The reason for this effort is that the reflink library [1] was created as an attempt to make use of the reflink optimization before python gained support for the os.copy_file_range syscall. The library was never really anything more than a band-aid and now that it's possible to use a syscall the library even mentions on its GitHub page that Python now implements the functionality natively. The implementation was taken from (with some 3.9+ tweaks applied) from an existing code proposal [2] (marked as "awaiting merge") to add the same functionality to the 'shutil' library copying primitives and make it completely transparent to end users. We'd have to wait a long time to be able to use it though. Compared to the reflink library, which used a dirty trick of copying a small file first (in kinda error-prone way) to see if the operation raised an exception, os.copy_file_range based solution succeeds in vast majority of cases because if reflinks are not supported within the underlying file system (which is nothing more than inode sharing) a copy without the overhead of userspace <-> kernel can still continue normally, hence reserving the 'shutil.copy2' fallback to really obscure cases (like cross-device copying - EXDEV OR on old systems without the syscall - ENOSYS) or simply cases where the copying failed for some reason which we may not even encounter ever. [1] https://gitlab.com/rubdos/pyreflink [2] https://github.com/python/cpython/pull/93152/files Resolves: hermetoproject#483 Signed-off-by: Erik Skultety <[email protected]>
eskultety
added a commit
to eskultety/hermeto
that referenced
this issue
Jul 31, 2024
This defines the _fast_copy primitive to replace the external unmaintained reflink library and use python's stdlib handling. The reason for this effort is that the reflink library [1] was created as an attempt to make use of the reflink optimization before python gained support for the os.copy_file_range syscall. The library was never really anything more than a band-aid and now that it's possible to use a syscall the library even mentions on its GitHub page that Python now implements the functionality natively. The implementation was taken from (with some 3.9+ tweaks applied) from an existing code proposal [2] (marked as "awaiting merge") to add the same functionality to the 'shutil' library copying primitives and make it completely transparent to end users. We'd have to wait a long time to be able to use it though. Compared to the reflink library, which used a dirty trick of copying a small file first (in kinda error-prone way) to see if the operation raised an exception, os.copy_file_range based solution succeeds in vast majority of cases because if reflinks are not supported within the underlying file system (which is nothing more than inode sharing) a copy without the overhead of userspace <-> kernel can still continue normally, hence reserving the 'shutil.copy2' fallback to really obscure cases (like cross-device copying - EXDEV OR on old systems without the syscall - ENOSYS) or simply cases where the copying failed for some reason which we may not even encounter ever. [1] https://gitlab.com/rubdos/pyreflink [2] https://github.com/python/cpython/pull/93152/files Resolves: hermetoproject#483 Signed-off-by: Erik Skultety <[email protected]>
eskultety
added a commit
to eskultety/hermeto
that referenced
this issue
Aug 6, 2024
This defines the _fast_copy primitive to replace the external unmaintained reflink library and use python's stdlib handling. The reason for this effort is that the reflink library [1] was created as an attempt to make use of the reflink optimization before python gained support for the os.copy_file_range syscall. The library was never really anything more than a band-aid and now that it's possible to use a syscall the library even mentions on its GitHub page that Python now implements the functionality natively. The implementation was taken from (with some 3.9+ tweaks applied) from an existing code proposal [2] (marked as "awaiting merge") to add the same functionality to the 'shutil' library copying primitives and make it completely transparent to end users. We'd have to wait a long time to be able to use it though. Compared to the reflink library, which used a dirty trick of copying a small file first (in kinda error-prone way) to see if the operation raised an exception, os.copy_file_range based solution succeeds in vast majority of cases because if reflinks are not supported within the underlying file system (which is nothing more than inode sharing) a copy without the overhead of userspace <-> kernel can still continue normally, hence reserving the 'shutil.copy2' fallback to really obscure cases (like cross-device copying - EXDEV OR on old systems without the syscall - ENOSYS) or simply cases where the copying failed for some reason which we may not even encounter ever. [1] https://gitlab.com/rubdos/pyreflink [2] https://github.com/python/cpython/pull/93152/files Resolves: hermetoproject#483 Signed-off-by: Erik Skultety <[email protected]>
eskultety
added a commit
to eskultety/hermeto
that referenced
this issue
Aug 7, 2024
This defines the _fast_copy primitive to replace the external unmaintained reflink library and use python's stdlib handling. The reason for this effort is that the reflink library [1] was created as an attempt to make use of the reflink optimization before python gained support for the os.copy_file_range syscall. The library was never really anything more than a band-aid and now that it's possible to use a syscall the library even mentions on its GitHub page that Python now implements the functionality natively. The implementation was taken from (with some 3.9+ tweaks applied) from an existing code proposal [2] (marked as "awaiting merge") to add the same functionality to the 'shutil' library copying primitives and make it completely transparent to end users. We'd have to wait a long time to be able to use it though. Compared to the reflink library, which used a dirty trick of copying a small file first (in kinda error-prone way) to see if the operation raised an exception, os.copy_file_range based solution succeeds in vast majority of cases because if reflinks are not supported within the underlying file system (which is nothing more than inode sharing) a copy without the overhead of userspace <-> kernel can still continue normally, hence reserving the 'shutil.copy2' fallback to really obscure cases (like cross-device copying - EXDEV OR on old systems without the syscall - ENOSYS) or simply cases where the copying failed for some reason which we may not even encounter ever. [1] https://gitlab.com/rubdos/pyreflink [2] https://github.com/python/cpython/pull/93152/files Resolves: hermetoproject#483 Signed-off-by: Erik Skultety <[email protected]>
eskultety
added a commit
to eskultety/hermeto
that referenced
this issue
Aug 21, 2024
This defines the _fast_copy primitive to replace the external unmaintained reflink library and use python's stdlib handling. The reason for this effort is that the reflink library [1] was created as an attempt to make use of the reflink optimization before python gained support for the os.copy_file_range syscall. The library was never really anything more than a band-aid and now that it's possible to use a syscall the library even mentions on its GitHub page that Python now implements the functionality natively. The implementation was taken from (with some 3.9+ tweaks applied) from an existing code proposal [2] (marked as "awaiting merge") to add the same functionality to the 'shutil' library copying primitives and make it completely transparent to end users. We'd have to wait a long time to be able to use it though. Compared to the reflink library, which used a dirty trick of copying a small file first (in kinda error-prone way) to see if the operation raised an exception, os.copy_file_range based solution succeeds in vast majority of cases because if reflinks are not supported within the underlying file system (which is nothing more than inode sharing) a copy without the overhead of userspace <-> kernel can still continue normally, hence reserving the 'shutil.copy2' fallback to really obscure cases (like cross-device copying - EXDEV OR on old systems without the syscall - ENOSYS) or simply cases where the copying failed for some reason which we may not even encounter ever. [1] https://gitlab.com/rubdos/pyreflink [2] https://github.com/python/cpython/pull/93152/files Resolves: hermetoproject#483 Signed-off-by: Erik Skultety <[email protected]>
eskultety
added a commit
to eskultety/hermeto
that referenced
this issue
Aug 21, 2024
This defines the _fast_copy primitive to replace the external unmaintained reflink library and use python's stdlib handling. The reason for this effort is that the reflink library [1] was created as an attempt to make use of the reflink optimization before python gained support for the os.copy_file_range syscall. The library was never really anything more than a band-aid and now that it's possible to use a syscall the library even mentions on its GitHub page that Python now implements the functionality natively. The implementation was taken from (with some 3.9+ tweaks applied) from an existing code proposal [2] (marked as "awaiting merge") to add the same functionality to the 'shutil' library copying primitives and make it completely transparent to end users. We'd have to wait a long time to be able to use it though. Compared to the reflink library, which used a dirty trick of copying a small file first (in kinda error-prone way) to see if the operation raised an exception, os.copy_file_range based solution succeeds in vast majority of cases because if reflinks are not supported within the underlying file system (which is nothing more than inode sharing) a copy without the overhead of userspace <-> kernel can still continue normally, hence reserving the 'shutil.copy2' fallback to really obscure cases (like cross-device copying - EXDEV OR on old systems without the syscall - ENOSYS) or simply cases where the copying failed for some reason which we may not even encounter ever. [1] https://gitlab.com/rubdos/pyreflink [2] https://github.com/python/cpython/pull/93152/files Resolves: hermetoproject#483 Signed-off-by: Erik Skultety <[email protected]>
eskultety
added a commit
to eskultety/hermeto
that referenced
this issue
Aug 22, 2024
This defines the _fast_copy primitive to replace the external unmaintained reflink library and use python's stdlib handling. The reason for this effort is that the reflink library [1] was created as an attempt to make use of the reflink optimization before python gained support for the os.copy_file_range syscall. The library was never really anything more than a band-aid and now that it's possible to use a syscall the library even mentions on its GitHub page that Python now implements the functionality natively. The implementation was taken from (with some 3.9+ tweaks applied) from an existing code proposal [2] (marked as "awaiting merge") to add the same functionality to the 'shutil' library copying primitives and make it completely transparent to end users. We'd have to wait a long time to be able to use it though. Compared to the reflink library, which used a dirty trick of copying a small file first (in kinda error-prone way) to see if the operation raised an exception, os.copy_file_range based solution succeeds in vast majority of cases because if reflinks are not supported within the underlying file system (which is nothing more than inode sharing) a copy without the overhead of userspace <-> kernel can still continue normally, hence reserving the 'shutil.copy2' fallback to really obscure cases (like cross-device copying - EXDEV OR on old systems without the syscall - ENOSYS) or simply cases where the copying failed for some reason which we may not even encounter ever. [1] https://gitlab.com/rubdos/pyreflink [2] https://github.com/python/cpython/pull/93152/files Resolves: hermetoproject#483 Signed-off-by: Erik Skultety <[email protected]>
github-merge-queue bot
pushed a commit
that referenced
this issue
Aug 22, 2024
This defines the _fast_copy primitive to replace the external unmaintained reflink library and use python's stdlib handling. The reason for this effort is that the reflink library [1] was created as an attempt to make use of the reflink optimization before python gained support for the os.copy_file_range syscall. The library was never really anything more than a band-aid and now that it's possible to use a syscall the library even mentions on its GitHub page that Python now implements the functionality natively. The implementation was taken from (with some 3.9+ tweaks applied) from an existing code proposal [2] (marked as "awaiting merge") to add the same functionality to the 'shutil' library copying primitives and make it completely transparent to end users. We'd have to wait a long time to be able to use it though. Compared to the reflink library, which used a dirty trick of copying a small file first (in kinda error-prone way) to see if the operation raised an exception, os.copy_file_range based solution succeeds in vast majority of cases because if reflinks are not supported within the underlying file system (which is nothing more than inode sharing) a copy without the overhead of userspace <-> kernel can still continue normally, hence reserving the 'shutil.copy2' fallback to really obscure cases (like cross-device copying - EXDEV OR on old systems without the syscall - ENOSYS) or simply cases where the copying failed for some reason which we may not even encounter ever. [1] https://gitlab.com/rubdos/pyreflink [2] https://github.com/python/cpython/pull/93152/files Resolves: #483 Signed-off-by: Erik Skultety <[email protected]>
Resolved by: #578 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
With Yarn package manager support we introduced another dependency to the project to deal with faster copies of large artifacts - reflink (commit 2937416). However, the library we used was created merely as a hobby attempt to solve this in Python for the time being since there hadn't been a native Python support for COW at the time. That project seems to have been abandoned since with zero activity but with a note that Python does already implement the functionality natively.
That said, while it is true that Python added means to achieve the same thing in the meantime via a new
os
syscall mappingos.copy_file_range
, proper high-level primitives haven't been introduced to shutil yet. Compared to thecopy_file_range
syscall the reflink libfrary used an alternative low-level C implementation relying onioctl
combined with theFICLONE
flag because back then thecopy_file_range
syscall wasn't considered stable or production ready. That has changed in the meantime and we should be able to come up with a pretty straightforward implementation based oncopy_file_range
for what we need until high-level support lands in theshutil
module and ditch a dependency on a project that is an abandonware.Implementation-wise the above could be simplified to the following pseudocode snippet:
References:
copy_file_range
inshutil.copyfile
copy functions python/cpython#93152The text was updated successfully, but these errors were encountered: