Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace reflink dependency with an implementation using Python stdlib primitives #483

Closed
eskultety opened this issue Mar 7, 2024 · 1 comment
Labels
my first issue Good for newcomers

Comments

@eskultety
Copy link
Member

With Yarn package manager support we introduced another dependency to the project to deal with faster copies of large artifacts - reflink (commit 2937416). However, the library we used was created merely as a hobby attempt to solve this in Python for the time being since there hadn't been a native Python support for COW at the time. That project seems to have been abandoned since with zero activity but with a note that Python does already implement the functionality natively.

That said, while it is true that Python added means to achieve the same thing in the meantime via a new os syscall mapping os.copy_file_range, proper high-level primitives haven't been introduced to shutil yet. Compared to the copy_file_range syscall the reflink libfrary used an alternative low-level C implementation relying on ioctl combined with the FICLONE flag because back then the copy_file_range syscall wasn't considered stable or production ready. That has changed in the meantime and we should be able to come up with a pretty straightforward implementation based on copy_file_range for what we need until high-level support lands in the shutil module and ditch a dependency on a project that is an abandonware.

Implementation-wise the above could be simplified to the following pseudocode snippet:

# core/utils.py

def reflink_copy(src, dst, *):
    try:
      os.copy_file_range(src, dst, count_bytes)
    except OSError as e:
      if e.errno == errno.EXDEV or e.errno == errno.ENOSYS:
          raise Cachi2Error("reflinks not supported")
      raise from e

References:

@eskultety eskultety added my first issue Good for newcomers python labels Mar 7, 2024
eskultety added a commit to eskultety/hermeto that referenced this issue Jul 19, 2024
This defines the _fast_copy primitive to replace the external
unmaintained reflink library and use python's stdlib handling.

The reason for this effort is that the reflink library [1] was created
as an attempt to make use of the reflink optimization before python
gained support for the os.copy_file_range syscall. The library was
never really anything more than a band-aid and now that it's possible
to use a syscall the library even mentions on its GitHub page that
Python now implements the functionality natively.

The implementation was taken from (with some 3.9+ tweaks applied) from
an existing code proposal [2] (marked as "awaiting merge") to add the
same functionality to the 'shutil' library copying primitives and make
it completely transparent to end users. We'd have to wait a long time
to be able to use it though. Compared to the reflink library, which
used a dirty trick of copying a small file first
(in kinda error-prone way) to see if the operation raised an exception,
os.copy_file_range based solution succeeds in vast majority of
cases because if reflinks are not supported within the underlying
file system (which is nothing more than inode sharing) a copy without
the overhead of userspace <-> kernel can still continue normally, hence
reserving the 'shutil.copy2' fallback to really obscure cases
(like cross-device copying - EXDEV OR on old systems without the
 syscall - ENOSYS) or simply cases where the copying failed for some
reason which we may not even encounter ever.

[1] https://gitlab.com/rubdos/pyreflink
[2] https://github.com/python/cpython/pull/93152/files

Resolves: hermetoproject#483

Signed-off-by: Erik Skultety <[email protected]>
eskultety added a commit to eskultety/hermeto that referenced this issue Jul 19, 2024
This defines the _fast_copy primitive to replace the external
unmaintained reflink library and use python's stdlib handling.

The reason for this effort is that the reflink library [1] was created
as an attempt to make use of the reflink optimization before python
gained support for the os.copy_file_range syscall. The library was
never really anything more than a band-aid and now that it's possible
to use a syscall the library even mentions on its GitHub page that
Python now implements the functionality natively.

The implementation was taken from (with some 3.9+ tweaks applied) from
an existing code proposal [2] (marked as "awaiting merge") to add the
same functionality to the 'shutil' library copying primitives and make
it completely transparent to end users. We'd have to wait a long time
to be able to use it though. Compared to the reflink library, which
used a dirty trick of copying a small file first
(in kinda error-prone way) to see if the operation raised an exception,
os.copy_file_range based solution succeeds in vast majority of
cases because if reflinks are not supported within the underlying
file system (which is nothing more than inode sharing) a copy without
the overhead of userspace <-> kernel can still continue normally, hence
reserving the 'shutil.copy2' fallback to really obscure cases
(like cross-device copying - EXDEV OR on old systems without the
 syscall - ENOSYS) or simply cases where the copying failed for some
reason which we may not even encounter ever.

[1] https://gitlab.com/rubdos/pyreflink
[2] https://github.com/python/cpython/pull/93152/files

Resolves: hermetoproject#483

Signed-off-by: Erik Skultety <[email protected]>
eskultety added a commit to eskultety/hermeto that referenced this issue Jul 19, 2024
This defines the _fast_copy primitive to replace the external
unmaintained reflink library and use python's stdlib handling.

The reason for this effort is that the reflink library [1] was created
as an attempt to make use of the reflink optimization before python
gained support for the os.copy_file_range syscall. The library was
never really anything more than a band-aid and now that it's possible
to use a syscall the library even mentions on its GitHub page that
Python now implements the functionality natively.

The implementation was taken from (with some 3.9+ tweaks applied) from
an existing code proposal [2] (marked as "awaiting merge") to add the
same functionality to the 'shutil' library copying primitives and make
it completely transparent to end users. We'd have to wait a long time
to be able to use it though. Compared to the reflink library, which
used a dirty trick of copying a small file first
(in kinda error-prone way) to see if the operation raised an exception,
os.copy_file_range based solution succeeds in vast majority of
cases because if reflinks are not supported within the underlying
file system (which is nothing more than inode sharing) a copy without
the overhead of userspace <-> kernel can still continue normally, hence
reserving the 'shutil.copy2' fallback to really obscure cases
(like cross-device copying - EXDEV OR on old systems without the
 syscall - ENOSYS) or simply cases where the copying failed for some
reason which we may not even encounter ever.

[1] https://gitlab.com/rubdos/pyreflink
[2] https://github.com/python/cpython/pull/93152/files

Resolves: hermetoproject#483

Signed-off-by: Erik Skultety <[email protected]>
eskultety added a commit to eskultety/hermeto that referenced this issue Jul 31, 2024
This defines the _fast_copy primitive to replace the external
unmaintained reflink library and use python's stdlib handling.

The reason for this effort is that the reflink library [1] was created
as an attempt to make use of the reflink optimization before python
gained support for the os.copy_file_range syscall. The library was
never really anything more than a band-aid and now that it's possible
to use a syscall the library even mentions on its GitHub page that
Python now implements the functionality natively.

The implementation was taken from (with some 3.9+ tweaks applied) from
an existing code proposal [2] (marked as "awaiting merge") to add the
same functionality to the 'shutil' library copying primitives and make
it completely transparent to end users. We'd have to wait a long time
to be able to use it though. Compared to the reflink library, which
used a dirty trick of copying a small file first
(in kinda error-prone way) to see if the operation raised an exception,
os.copy_file_range based solution succeeds in vast majority of
cases because if reflinks are not supported within the underlying
file system (which is nothing more than inode sharing) a copy without
the overhead of userspace <-> kernel can still continue normally, hence
reserving the 'shutil.copy2' fallback to really obscure cases
(like cross-device copying - EXDEV OR on old systems without the
 syscall - ENOSYS) or simply cases where the copying failed for some
reason which we may not even encounter ever.

[1] https://gitlab.com/rubdos/pyreflink
[2] https://github.com/python/cpython/pull/93152/files

Resolves: hermetoproject#483

Signed-off-by: Erik Skultety <[email protected]>
eskultety added a commit to eskultety/hermeto that referenced this issue Aug 6, 2024
This defines the _fast_copy primitive to replace the external
unmaintained reflink library and use python's stdlib handling.

The reason for this effort is that the reflink library [1] was created
as an attempt to make use of the reflink optimization before python
gained support for the os.copy_file_range syscall. The library was
never really anything more than a band-aid and now that it's possible
to use a syscall the library even mentions on its GitHub page that
Python now implements the functionality natively.

The implementation was taken from (with some 3.9+ tweaks applied) from
an existing code proposal [2] (marked as "awaiting merge") to add the
same functionality to the 'shutil' library copying primitives and make
it completely transparent to end users. We'd have to wait a long time
to be able to use it though. Compared to the reflink library, which
used a dirty trick of copying a small file first
(in kinda error-prone way) to see if the operation raised an exception,
os.copy_file_range based solution succeeds in vast majority of
cases because if reflinks are not supported within the underlying
file system (which is nothing more than inode sharing) a copy without
the overhead of userspace <-> kernel can still continue normally, hence
reserving the 'shutil.copy2' fallback to really obscure cases
(like cross-device copying - EXDEV OR on old systems without the
 syscall - ENOSYS) or simply cases where the copying failed for some
reason which we may not even encounter ever.

[1] https://gitlab.com/rubdos/pyreflink
[2] https://github.com/python/cpython/pull/93152/files

Resolves: hermetoproject#483

Signed-off-by: Erik Skultety <[email protected]>
eskultety added a commit to eskultety/hermeto that referenced this issue Aug 7, 2024
This defines the _fast_copy primitive to replace the external
unmaintained reflink library and use python's stdlib handling.

The reason for this effort is that the reflink library [1] was created
as an attempt to make use of the reflink optimization before python
gained support for the os.copy_file_range syscall. The library was
never really anything more than a band-aid and now that it's possible
to use a syscall the library even mentions on its GitHub page that
Python now implements the functionality natively.

The implementation was taken from (with some 3.9+ tweaks applied) from
an existing code proposal [2] (marked as "awaiting merge") to add the
same functionality to the 'shutil' library copying primitives and make
it completely transparent to end users. We'd have to wait a long time
to be able to use it though. Compared to the reflink library, which
used a dirty trick of copying a small file first
(in kinda error-prone way) to see if the operation raised an exception,
os.copy_file_range based solution succeeds in vast majority of
cases because if reflinks are not supported within the underlying
file system (which is nothing more than inode sharing) a copy without
the overhead of userspace <-> kernel can still continue normally, hence
reserving the 'shutil.copy2' fallback to really obscure cases
(like cross-device copying - EXDEV OR on old systems without the
 syscall - ENOSYS) or simply cases where the copying failed for some
reason which we may not even encounter ever.

[1] https://gitlab.com/rubdos/pyreflink
[2] https://github.com/python/cpython/pull/93152/files

Resolves: hermetoproject#483

Signed-off-by: Erik Skultety <[email protected]>
eskultety added a commit to eskultety/hermeto that referenced this issue Aug 21, 2024
This defines the _fast_copy primitive to replace the external
unmaintained reflink library and use python's stdlib handling.

The reason for this effort is that the reflink library [1] was created
as an attempt to make use of the reflink optimization before python
gained support for the os.copy_file_range syscall. The library was
never really anything more than a band-aid and now that it's possible
to use a syscall the library even mentions on its GitHub page that
Python now implements the functionality natively.

The implementation was taken from (with some 3.9+ tweaks applied) from
an existing code proposal [2] (marked as "awaiting merge") to add the
same functionality to the 'shutil' library copying primitives and make
it completely transparent to end users. We'd have to wait a long time
to be able to use it though. Compared to the reflink library, which
used a dirty trick of copying a small file first
(in kinda error-prone way) to see if the operation raised an exception,
os.copy_file_range based solution succeeds in vast majority of
cases because if reflinks are not supported within the underlying
file system (which is nothing more than inode sharing) a copy without
the overhead of userspace <-> kernel can still continue normally, hence
reserving the 'shutil.copy2' fallback to really obscure cases
(like cross-device copying - EXDEV OR on old systems without the
 syscall - ENOSYS) or simply cases where the copying failed for some
reason which we may not even encounter ever.

[1] https://gitlab.com/rubdos/pyreflink
[2] https://github.com/python/cpython/pull/93152/files

Resolves: hermetoproject#483

Signed-off-by: Erik Skultety <[email protected]>
eskultety added a commit to eskultety/hermeto that referenced this issue Aug 21, 2024
This defines the _fast_copy primitive to replace the external
unmaintained reflink library and use python's stdlib handling.

The reason for this effort is that the reflink library [1] was created
as an attempt to make use of the reflink optimization before python
gained support for the os.copy_file_range syscall. The library was
never really anything more than a band-aid and now that it's possible
to use a syscall the library even mentions on its GitHub page that
Python now implements the functionality natively.

The implementation was taken from (with some 3.9+ tweaks applied) from
an existing code proposal [2] (marked as "awaiting merge") to add the
same functionality to the 'shutil' library copying primitives and make
it completely transparent to end users. We'd have to wait a long time
to be able to use it though. Compared to the reflink library, which
used a dirty trick of copying a small file first
(in kinda error-prone way) to see if the operation raised an exception,
os.copy_file_range based solution succeeds in vast majority of
cases because if reflinks are not supported within the underlying
file system (which is nothing more than inode sharing) a copy without
the overhead of userspace <-> kernel can still continue normally, hence
reserving the 'shutil.copy2' fallback to really obscure cases
(like cross-device copying - EXDEV OR on old systems without the
 syscall - ENOSYS) or simply cases where the copying failed for some
reason which we may not even encounter ever.

[1] https://gitlab.com/rubdos/pyreflink
[2] https://github.com/python/cpython/pull/93152/files

Resolves: hermetoproject#483

Signed-off-by: Erik Skultety <[email protected]>
eskultety added a commit to eskultety/hermeto that referenced this issue Aug 22, 2024
This defines the _fast_copy primitive to replace the external
unmaintained reflink library and use python's stdlib handling.

The reason for this effort is that the reflink library [1] was created
as an attempt to make use of the reflink optimization before python
gained support for the os.copy_file_range syscall. The library was
never really anything more than a band-aid and now that it's possible
to use a syscall the library even mentions on its GitHub page that
Python now implements the functionality natively.

The implementation was taken from (with some 3.9+ tweaks applied) from
an existing code proposal [2] (marked as "awaiting merge") to add the
same functionality to the 'shutil' library copying primitives and make
it completely transparent to end users. We'd have to wait a long time
to be able to use it though. Compared to the reflink library, which
used a dirty trick of copying a small file first
(in kinda error-prone way) to see if the operation raised an exception,
os.copy_file_range based solution succeeds in vast majority of
cases because if reflinks are not supported within the underlying
file system (which is nothing more than inode sharing) a copy without
the overhead of userspace <-> kernel can still continue normally, hence
reserving the 'shutil.copy2' fallback to really obscure cases
(like cross-device copying - EXDEV OR on old systems without the
 syscall - ENOSYS) or simply cases where the copying failed for some
reason which we may not even encounter ever.

[1] https://gitlab.com/rubdos/pyreflink
[2] https://github.com/python/cpython/pull/93152/files

Resolves: hermetoproject#483

Signed-off-by: Erik Skultety <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Aug 22, 2024
This defines the _fast_copy primitive to replace the external
unmaintained reflink library and use python's stdlib handling.

The reason for this effort is that the reflink library [1] was created
as an attempt to make use of the reflink optimization before python
gained support for the os.copy_file_range syscall. The library was
never really anything more than a band-aid and now that it's possible
to use a syscall the library even mentions on its GitHub page that
Python now implements the functionality natively.

The implementation was taken from (with some 3.9+ tweaks applied) from
an existing code proposal [2] (marked as "awaiting merge") to add the
same functionality to the 'shutil' library copying primitives and make
it completely transparent to end users. We'd have to wait a long time
to be able to use it though. Compared to the reflink library, which
used a dirty trick of copying a small file first
(in kinda error-prone way) to see if the operation raised an exception,
os.copy_file_range based solution succeeds in vast majority of
cases because if reflinks are not supported within the underlying
file system (which is nothing more than inode sharing) a copy without
the overhead of userspace <-> kernel can still continue normally, hence
reserving the 'shutil.copy2' fallback to really obscure cases
(like cross-device copying - EXDEV OR on old systems without the
 syscall - ENOSYS) or simply cases where the copying failed for some
reason which we may not even encounter ever.

[1] https://gitlab.com/rubdos/pyreflink
[2] https://github.com/python/cpython/pull/93152/files

Resolves: #483

Signed-off-by: Erik Skultety <[email protected]>
@eskultety
Copy link
Member Author

Resolved by: #578

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
my first issue Good for newcomers
Development

No branches or pull requests

1 participant