Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WriteCombinedReference FSSpec Credentials Yet Again #703

Merged
merged 2 commits into from
Feb 27, 2024

Conversation

ranchodeluxe
Copy link
Contributor

I think we finally have the correct incantation here 🪄 😓

Problem

WriteCombinedReference is composed of two other transforms: CombineReferences and WriteReference. CombineReferences deals with reading the reference inputs (which could be inside credentialed cloud storage) and combining them. While WriteReference deals with writing to the dep injected target root. Anyhoo, point is WriteCombinedReference needs to pass along fsspec credentials for the CombineReference workflow which it currently (and previously) wasn't handling

Solution

Fix WriteCombinedReference pass along fsspec credentials for the CombineReference workflow

Caveats

Pretty sure the WriteReference for parquets flow (since it's calling MultiZarrToZarr.translate()) will still try to do reads against reference inputs but let's ticket that separately 👍

target_options=storage_options,
remote_options=storage_options,
remote_protocol=remote_protocol,
target_options=self.remote_options,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredibly confusing but I'm actually not sure, still, why remote_options and target_options can be the same here. They're passed as distinctly named arguments here: https://github.com/pangeo-forge/pangeo-forge-recipes/blob/main/pangeo_forge_recipes/transforms.py#L447-L448 and in https://fsspec.github.io/kerchunk/reference.html#kerchunk.combine.MultiZarrToZarr it says:

target_options – dict Storage options for opening path

which makes me think target_options (where the reference files are, say s3://gcorradini-...) should not be the same as remote_options, which doesn't have a docstring in MultiZarrToZarr but I have to assume are credentials for the bucket which contains the data files (e.g. s3://gesdisc-cumulus-protected)

But we saw that this worked so maybe I'm missing something you explained previously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know 😆. At the fsspec level inside MultiZarrToZarr it is very confusing where target_options vs remote_options are used which is why we're covering our butts here by passing all of them through (this is a theme in the recipes code if you poke around enough)

But it does work indeed as this Flink run shows: https://github.com/NASA-IMPACT/veda-pforge-job-runner/actions/runs/8057334009

Copy link
Contributor

@moradology moradology left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Looks like this works well enough for now but I hate to see us just stacking arguments on here. Surely our fsspec target instances can carry around options internally (or something similar)?

@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Feb 27, 2024

Hmm... Looks like this works well enough for now but I hate to see us just stacking arguments on here. Surely our fsspec target instances can carry around options internally (or something similar)?

This isn't about the FSSpecTarget (which is the output). That still gets pumped into WriteCombinedReference as target_root. This is about passing credentials to read inputs from your source bucket

We already have a way to carry around options interally for FSSpecTarget: https://github.com/pangeo-forge/pangeo-forge-recipes/blob/main/pangeo_forge_recipes/writers.py#L132-L135

@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Feb 27, 2024

@moradology and I talked and maybe we'll create an FSSpecInputTarget storage class and rename FSSpecTarget > FSSpecOutputTarget. Depends on how/when things with recipes/runner change in the near future

@ranchodeluxe ranchodeluxe merged commit 0507455 into main Feb 27, 2024
6 checks passed
@ranchodeluxe ranchodeluxe deleted the gcorradini/credential_fix branch February 27, 2024 15:56
@moradology
Copy link
Contributor

moradology commented Feb 27, 2024

OK, this is helpful. Yeah, so the fsspec target classes already do this and it simply appears that we lack a convention for using them as input filesystems in addition to output filesystems.
Note to future selves: these things are just filesystems so maybe they shouldn't be spoken of as Target at all so far as the class is concerned? Like, jobs can differ a great deal in how many different filesystems they'll need to have specified (and which ones get read vs write) and maybe we shouldn't be opinionated about that question at all? Doesn't seem like that's what we should be in the business of at this level of abstraction?

I want to say: we should prefer a tiny number of high level abstractions that can be used in a wide range of contexts where that's at all possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants