-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix WriteCombinedReference FSSpec Credentials Yet Again #703
Conversation
target_options=storage_options, | ||
remote_options=storage_options, | ||
remote_protocol=remote_protocol, | ||
target_options=self.remote_options, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incredibly confusing but I'm actually not sure, still, why remote_options and target_options can be the same here. They're passed as distinctly named arguments here: https://github.com/pangeo-forge/pangeo-forge-recipes/blob/main/pangeo_forge_recipes/transforms.py#L447-L448 and in https://fsspec.github.io/kerchunk/reference.html#kerchunk.combine.MultiZarrToZarr it says:
target_options – dict Storage options for opening path
which makes me think target_options (where the reference files are, say s3://gcorradini-...) should not be the same as remote_options, which doesn't have a docstring in MultiZarrToZarr but I have to assume are credentials for the bucket which contains the data files (e.g. s3://gesdisc-cumulus-protected)
But we saw that this worked so maybe I'm missing something you explained previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know 😆. At the fsspec
level inside MultiZarrToZarr
it is very confusing where target_options
vs remote_options
are used which is why we're covering our butts here by passing all of them through (this is a theme in the recipes code if you poke around enough)
But it does work indeed as this Flink run shows: https://github.com/NASA-IMPACT/veda-pforge-job-runner/actions/runs/8057334009
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Looks like this works well enough for now but I hate to see us just stacking arguments on here. Surely our fsspec
target instances can carry around options internally (or something similar)?
This isn't about the We already have a way to carry around options interally for |
@moradology and I talked and maybe we'll create an |
OK, this is helpful. Yeah, so the fsspec target classes already do this and it simply appears that we lack a convention for using them as input filesystems in addition to output filesystems. I want to say: we should prefer a tiny number of high level abstractions that can be used in a wide range of contexts where that's at all possible |
I think we finally have the correct incantation here 🪄 😓
Problem
WriteCombinedReference
is composed of two other transforms:CombineReferences
andWriteReference
.CombineReferences
deals with reading the reference inputs (which could be inside credentialed cloud storage) and combining them. WhileWriteReference
deals with writing to the dep injected target root. Anyhoo, point isWriteCombinedReference
needs to pass along fsspec credentials for theCombineReference
workflow which it currently (and previously) wasn't handlingSolution
Fix
WriteCombinedReference
pass along fsspec credentials for theCombineReference
workflowCaveats
Pretty sure the
WriteReference
for parquets flow (since it's callingMultiZarrToZarr.translate()
) will still try to do reads against reference inputs but let's ticket that separately 👍