Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: CoW and explicit copy keyword in DataFrame/Series methods #50535

Closed
19 tasks
jorisvandenbossche opened this issue Jan 3, 2023 · 3 comments · Fixed by #51464
Closed
19 tasks

API: CoW and explicit copy keyword in DataFrame/Series methods #50535

jorisvandenbossche opened this issue Jan 3, 2023 · 3 comments · Fixed by #51464

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 3, 2023

In general almost all DataFrame and Series methods return new data and thus make a copy if needed (if there was no calculation / data didn't change). But some methods allow you to avoid making this copy with an explicit copy keyword, which defaults to copy=True, but which you can change to copy=False manually to avoid the copy.

Example:

>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
# by default a method returns a copy
>>> df2 = df.rename(columns=str.upper)
>>> df2.iloc[0, 0] = 100
>>> df
   a  b
0  1  3
1  2  4

# explicitly ask not to make a copy
>>> df3 = df.rename(columns=str.upper, copy=False)
>>> df3.iloc[0, 0] = 100
>>> df
     a  b
0  100  3
1    2  4

Now, if Copy-on-Write is enabled, the above behaviour shouldn't happen (because we are updating one dataframe (df) through changing another dataframe (df3)).

In this specific case of rename, it actually already doesn't work anymore like that, and df is not updated:

>>> pd.options.mode.copy_on_write = True
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df3 = df.rename(columns=str.upper, copy=False)
>>> df3.iloc[0, 0] = 100
>>> df
   a  b
0  1  3
1  2  4

This is because of how it is implemented under the hood in rename, using result = self.copy(deep=copy), and so this always was already taking a shallow copy of the calling dataframe. With CoW enabled, a "shallow" copy doesn't exist anymore in the original meaning, but now essentially is a "lazy copy with CoW".
But for some other methods, this is actually not yet working

There are several issues/questions here:

  1. Are we OK with copy=False now actually meaning a "lazy" copy for all those methods?
    • I don't think there is any alternative with the current CoW semantics, but just to make this explicit and track this, because we 1) should document this (it's a breaking change) and potentially add future warnings for this at some point, and 2) ensure this behaviour is correctly happening for all methods that have a copy keyword.
  2. The case of manually passing copy=True should still give an actual hard / "eager" copy?
    • Probably yes (if we keep the keyword, see 3) below), but we should also ensure to test this when CoW is enabled.
  3. If (in the future with CoW enabled) the default will now be to not return a copy, is it still worth it to keep the copy keyword?
    • Currently the default is copy=True, and so people will typically mostly use it explicitly to set copy=False. But copy=False will become the default in the future, and so will not be needed anymore to specify explicitly.
    • People can still use copy=True in the future to ensure they get a "eager" copy (and not delay the copy / trigger a copy later on). But is that use case worth it to keep the keyword around? (they can always do .copy() instead)

DataFrame/Series methods that have a copy keyword (except for the constructors):

  • align
  • astype
  • infer_objects
  • merge
  • reindex
  • reindex_like
  • rename
  • rename_axis
  • set_axis (only added in 1.5)
  • set_flags (default False)
  • swapaxes
  • swaplevel
  • to_numy (default False)
  • to_timestamp
  • transpose (default False)
  • truncate
  • tz_convert
  • tz_localize
  • pd.concat

xref CoW overview issue: #48998

@jbrockmendel
Copy link
Member

If (in the future with CoW enabled) the default will now be to not return a copy, is it still worth it to keep the copy keyword?

If we keep the no-CoW mode, then we should keep the copy keyword, and add it where relevant xref #48141. Ideally we'll end up with just one mode.

Are we OK with copy=False now actually meaning a "lazy" copy for all those methods?

I'm ambivalent here. The other reasonable alternative would be to for copy=False to clear the refs (something like (result._mgr.refs = None) which I can see being nice for advanced users, but confusing for everyone else.

People can still use copy=True in the future to ensure they get a "eager" copy (and not delay the copy / trigger a copy later on). But is that use case worth it to keep the keyword around? (they can always do .copy() instead)

Better to tell them to do .copy() instead.

@phofl
Copy link
Member

phofl commented Jan 11, 2023

If we keep the no-CoW mode, then we should keep the copy keyword, and add it where relevant xref #48141. Ideally we'll end up with just one mode.

Yeah agreed, I'd prefer one mode as well. Should make things significantly easier to test. Also agree with getting rid of the keyword in this case.

If we decide to do that, I would also prefer copy=True meaning a lazy copy as well, makes the transition easier when we remove it altogether

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 6, 2023

Yes, I am also assuming that eventually we only have a single copy/view mode.

If we decide to do that, I would also prefer copy=True meaning a lazy copy as well, makes the transition easier when we remove it altogether

And so basically already start to ignore the copy keyword altogether in those methods when CoW is enabled? That seems fine to me, given that the plan now is to remove that keyword later on (and users can do an explicit .copy() if they really want).

Note: #51464 is doing exactly that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants