-
-
Notifications
You must be signed in to change notification settings - Fork 730
Fail to serialize dict_keys argument to read_csv #3893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for the excellent bug report and minimal example @brl0 .
I'm glad to hear it. I didn't see this at first, so let me first share the steps that I ran.
# in distributed/tests/test_collections.py
from distributed.utils import tmpfile
@gen_cluster(client=True)
async def test_read_csv(c, s, a, b):
with tmpfile(extension="csv") as fn:
with open(fn, "w") as f:
f.write("a,b,c\n1,2,3")
columns = {"a": "object", "c": "object"}
df = dd.read_csv(fn, usecols=list(columns.keys()))
await df.persist() Now my next step is probably to put a breakpoint wherever this is happening, probably somewhere in distributed/protocol/serialize.py and see if I can isolate it further to see exactly what the culprit is. As you mention, it probably has to do with serializing the dict_keys object. We might try calling something like the following on that object: pickle.dump(columns.keys()) and if that doesn't work cloudpickle.dump(columns.keys()) If those fail then great, let's raise an issue upstream at cloudpickle and then do a short-term fix in read_csv to listify inputs to usecols. I'd be surprised if cloudpickle doesn't work though. So the next thing to do would be to figure out exactly what combination of things isn't working with serialization. Ideally you would be able to isolate this further to a call like the following: from distributed.protocol import serialize
>>> serialze(some_object)
Exception(...) Then we can strip away all of the dask dataframe and distributed computing setup and focus just on what's causing issues with serialization. I hope that that plan makes sense. It's also probably wrong. You seem to know what you're doing so please deviate as soon as it seems sensible. |
Thanks @mrocklin for the fantastic guidance, you rock! I followed your advice and tested the pickling of After a bit of digging, I believe the issue is that I think that the best solution here is probably to listify inputs to usecols as you suggested. Let me know if this doesn't make sense. I will keep digging into this. |
OK, that's an interesting result.
It might be worth raising an issue on the cloudpickle tracker anyway. The maintainers there have thought more deeply about these kinds of problems I think. Listifying makes sense short term. What is challenging I think is how to apply a fix like this broadly across kwargs on many different functions. You're running into this in usecols, and so presumably one fix would be to pull if "usecols" in kwargs:
kwargs["usecols"] = list(kwargs["usecols"] But this has a few problems:
We wouldn't want to do this for every possible case, because that would be messy and hard to maintain. In my ideal world this is handled upstream in cloudpickle (where a change would affect everything globally). If that's not possible then maybe we make some general purpose The first step though, I think, is to check in with cloudpickle. |
The PR to cloudpickle was accepted, and the example code above now works, closing. |
Thanks @brl0 for the upstream changes in |
What happened:
Attempting to read a csv with
usecols=columns.keys()
while using the distributed scheduler throws an error and hangs.What you expected to happen:
Expected read_csv to function as it does without distributed.
Minimal Complete Verifiable Example:
Anything else we need to know?:
The issue seems to be related to using a
dict_keys
object withusecols
. Wrapping in a list works.This has previously been noted in this comment: #2597 (comment)
I am looking for opportunities to contribute, so I am happy to assist with some guidance.
Environment:
The text was updated successfully, but these errors were encountered: