Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future of Pangeo-Forge? #799

Open
TomNicholas opened this issue Feb 26, 2025 · 10 comments
Open

Future of Pangeo-Forge? #799

TomNicholas opened this issue Feb 26, 2025 · 10 comments

Comments

@TomNicholas
Copy link

TomNicholas commented Feb 26, 2025

Over in nsidc/earthaccess#956 (reply in thread) I argued that Pangeo-Forge as a project is winding down, in favour of new tools and services.

Specifically:

  1. the Pangeo-forge project is winding down since the original creator (@cisaacstern) is no longer funded to work on it,
  2. the catalog functionality in PGF has been gone/unmaintained for a while, and cataloging is/will be hopefully better served by services such as Earthmover and Source Cooperative (or maybe a community effort via FROST),
  3. the kerchunking recipes are now better done by VirtualiZarr (to which many of the same devs have now switched their focus),
  4. PGF's pattern of writing out the zarr metadata as a schema first then filling in the chunks in parallel is much better served by Icechunk, which can basically do the same thing but with transactional version control.
  5. moving actual chunks at scale (instead of just references to chunks) is likely now better done by Xarray-Beam or Cubed, though these are much less mature.
  6. the VirtualiZarr/Cubed approaches both have the huge advantage of avoiding having to create a whole separate API for ETL compared to for analytics.

So generally I would advise against anyone starting a big project now that is tied to Pangeo-Forge because although the stack is in a bit of a transitionary period right now (with Icechunk/VirtualiZarr gaining more maturity daily) but the future trajectory is already clear.

I think Pangeo-Forge is an awesome project but I would now like to see something more robust and sustainable built using the many lessons learned.

Do people think this is fair / reasonable? If so do you disagree with the vision I sketched above? Is there a part of the PGF project that isn't represented by that sketch? Should we consider at what point the project could be officially deprecated? Or could it instead transition to use new tools?

cc @keewis @jbusecke @abarciauskas-bgse @sharkinsspatial @alxmrs @rabernat @TomAugspurger

@abarciauskas-bgse
Copy link
Contributor

I agree, thanks for this summarization @TomNicholas

@rabernat
Copy link
Contributor

rabernat commented Feb 26, 2025

So many lessons were learned working on Pangeo Forge. It is an inspiring vision and a useful piece of technology. But a lot has changed in our ecosystem since we started working on it.

I'd add another issue to Tom's list: the Beam dependency. The decision to migrate to Beam was a mistake. It is too hard to run Beam in production anywhere but Google Cloud Dataflow. We overestimated the state of Beam support in other clouds, and that has created a huge barrier for many people to actually use Pangeo Forge. Big lesson learned.

We should have used Spark. (Yes, I know Beam theoretically runs on Spark, but it doesn't work well.)

@TomNicholas
Copy link
Author

Good point.

Compare to Cubed (I never miss a chance to plug Cubed 😅): It can already run on Beam, Lithops (so AWS Lambda, GCF etc.), Modal, Dask, Coiled Functions, locally (serially, across threads or across processes), and has prototype executors for Spark and Ray. That's because a Cubed executor is basically just implementing the map primitive. These executors just need to be exercised a bit more.

@cisaacstern
Copy link
Member

Agree with all of above 😃

@keewis
Copy link
Contributor

keewis commented Feb 27, 2025

@norlandrhagen, @jbusecke, @alxmrs and myself have been wondering about this for a while now during the meetings since October (@alxmrs actually wrote up his thoughts about the future of pangeo-forge last year, which is pretty similar to what you're proposing)

What we have gathered so far is:

  • as @rabernat pointed out, beam is really painful to deploy and debug on anything but dataflow (but @alxmrs is continuing to look into the dask runner, so this may change... at some point in the future)
  • the "official" way of running recipes (pangeo-forge-runner) is a bit too magical for its own good (parameter injection, pipeline is constructed without the actual pipeline object, etc.)
  • I/O is really tricky to get to work properly with beam

However, with all that said, I personally think that the pipeline syntax is actually an advantage (this doesn't have to be beam specifically):

  • the recipe has a single (enforced) entrypoint, which really helps with understanding the code
  • the individual steps are chained together like a shell pipeline, which makes the steps transparent
  • the parallel execution is configured in one place, and this doesn't leak everywhere else

Additionally, decomposing the full pipeline into individual steps allowed writing packages like pangeo-forge-ndpyramid and my own stac-recipes that build on a subset of pangeo-forge's transforms to do other cool things.

(As an aside, @norlandrhagen and I are trying to maintain pangeo-forge-recipes but don't appear to have the appropriate permissions. @rabernat or @cisaacstern, would you be able to help with that? Thanks for the invite, @rabernat)

@alxmrs
Copy link
Contributor

alxmrs commented Feb 27, 2025

I totally agree with your state of the data engineering ecosystem, Tom. I think you've covered more developments since I've tried to produce an overview, and in greater detail.

Is there a part of the PGF project that isn't represented by that sketch? Should we consider at what point the project could be officially deprecated? Or could it instead transition to use new tools?

I think this is the crux of the issue today. Maybe, the simple answer to the second question is that we should deprecate PGF to favor the litany of tools that you described. However, I do think PGF covers something not mentioned in your list: the community mediation, curation, and maintenance of datasets on top of the Github platform. This is the same sort of idea behind FROST. Considering Github has long been considered the social media of developers -- and pangeo-forge was modeled after the social structure defined by conda-forge -- could Pangeo-Forge shift from being a project revolved around a data engineering compiler + compute engine to a system to facilitate community? Maybe this could even make space for the rapidly evolving set of LLM-based code generation tools?

One possible future direction for PGF is it becomes more like a protocol for using (an evolving set of) data engineering tools within an open community. Skimming the FROST specs now, if you squint, how many features are within spitting distance of being built on top of Github and this project?

All that said, I can clearly see the momentum has gone. I'm also very open to letting new tools and systems emerge in place of this one.

@TomNicholas
Copy link
Author

I personally think that the pipeline syntax is actually an advantage

That's an interesting counterpoint...

the individual steps are chained together like a shell pipeline, which makes the steps transparent

I would like to see xarray become more like a pipeline - this is the "xarray as a new SQL-like query language for arrays" idea.

the parallel execution is configured in one place, and this doesn't leak everywhere else

Do Cubed's executors not already allow pretty centralized configuration of parallelism?

Additionally, decomposing the full pipeline into individual steps allowed writing packages

I might be missing something, but I don't see how that's different to writing another python package that consumes and produces xarray Datasets as part of an xarray pipeline.


I do think PGF covers something not mentioned in your list: the community mediation, curation, and maintenance of datasets on top of the Github platform.

Yes great point. My bullets above didn't propose an alternative to "PGF as a platform".

This is the same sort of idea behind FROST.

FROST doesn't exist other than as a twinkle in my eye. I don't know what the future of that platform part of it is but it likely involves Earthmover/Source Coop + ideas from FROST.

@keewis
Copy link
Contributor

keewis commented Mar 3, 2025

I would like to see xarray become more like a pipeline - this is the "xarray as a new SQL-like query language for arrays" idea.

It might be better to compare with PRQL because SQL is not very easy to read as a pipeline, but otherwise I agree, that would be great.

Do Cubed's executors not already allow pretty centralized configuration of parallelism?

I might have to look into cubed's executors to really answer that question, but maybe?

I might be missing something, but I don't see how that's different to writing another python package that consumes and produces xarray Datasets as part of an xarray pipeline.

I guess this doesn't have anything to do with the syntax but rather with the building blocks pfr provides. If we can get a different package to provide similar blocks, I think pfr can become a thin wrapper around that.

@mattjbr123
Copy link

Thanks @TomNicholas for linking me to this.

So generally I would advise against anyone starting a big project now that is tied to Pangeo-Forge

I guess I fall into this category, for the past few months I've been working on a project developing digital infrastructure (tools, packages, UIs, APIs etc.) to help scientists work with large gridded datasets via the cloud and have been using and advocating for pangeo-forge-recipes as the starting block...

I personally think that the pipeline syntax is actually an advantage

...because I agree with this. I saw plug & play off-the-shelf blocks from which you can build a recipe for conversion/chunking/kerchunking/uploading/otherwise manipulating gridded datasets as a good way of making these sorts of workflows easier for scientists who don't know/care about this sort of thing easily get their data into ARCO format, given that is the way the world of big gridded data science is going. I envisaged building a nice UI on top of pangeo-forge-recipes that would allow users to bring their dataset and compose the necessary recipe without having to get to grips with (yet) another package.

Now that I can see the community is moving away from PFR I'm considering switching to VirtualiZarr & Cubed however this pipeline element I will miss and figure out what to replace it with.

I'd add another issue to Tom's list: the Beam dependency. The decision to migrate to Beam was a mistake. It is too hard to run Beam in production anywhere but Google Cloud Dataflow. We overestimated the state of Beam support in other clouds, and that has created a huge barrier for many people to actually use Pangeo Forge. Big lesson learned.

We should have used Spark. (Yes, I know Beam theoretically runs on Spark, but it doesn't work well.)

Having grappled with this Beam dependency for a while now I can definitely agree with this, it's been a major blocker trying to get my PFR recipes to run with anything other than the Beam DirectRunner. So this will be nice to not have to worry about!

@rabernat
Copy link
Contributor

rabernat commented Mar 4, 2025

Let me add a bit more context here.

I think there is absolutely a need for an ETL tool like Pangeo Forge, and maybe it even is Pangeo Forge.

As the creator of the project, my reflection is that 2021 was actually too early for a Pangeo Forge, because we had not yet solved some more foundational issues, in particular:

  • How can we efficiently access archival files from cloud object storage? That has been solved by Kerchunk / Virtualizarr.
  • How can we safely and correctly update / insert / mutate large Zarr datasets which are simultaneously being read by other users? That has been solved by Icechunk.
  • What's the best way to download millions of files from slow on-prem FTP servers. This turned out the be the crux of many Pangeo Forge recipes. The approach of horizontally scaling does absolutely nothing to speed things up here, and in fact it makes it worse (it looks like a DDOS attach against the poor server). We still haven't solved this, and it may require a very different architecture from the one we have here.
  • What is a suitably generic compute infrastructure for executing large-scale dataflow-style transformations? We have still not solved this one either. Spark is really the industry standard here. Many people in our community use Dask. There is excitement about Cubed. But overall the experience with Beam has been a disappointment and a dead end. (I'll note that my new favorite entry in this space is ByteWax.)

So if we're serious about evolving PF, this tell us what we have to do.

  1. Rethink the architecture to solve the "slow downloading" problem. I think this probably means a standalone tool focused exclusively on this. Skyplane might be a relevant tool to build on.
  2. Do a more serious and comprehensive evaluation of existing dataflow processing frameworks (Dask, Spark, Flink, Bytewax, etc.) and pick one that best suits the project.
  3. Refactor the library again to remove the Beam dependency but keep the general pipeline structure, perhaps adopting new primatives from whatever framework is chosen in step 2.

I estimate that it would take 3 engineers working full time on this for three months to achieve this transformation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants