-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Future of Pangeo-Forge? #799
Comments
I agree, thanks for this summarization @TomNicholas |
So many lessons were learned working on Pangeo Forge. It is an inspiring vision and a useful piece of technology. But a lot has changed in our ecosystem since we started working on it. I'd add another issue to Tom's list: the Beam dependency. The decision to migrate to Beam was a mistake. It is too hard to run Beam in production anywhere but Google Cloud Dataflow. We overestimated the state of Beam support in other clouds, and that has created a huge barrier for many people to actually use Pangeo Forge. Big lesson learned. We should have used Spark. (Yes, I know Beam theoretically runs on Spark, but it doesn't work well.) |
Good point. Compare to Cubed (I never miss a chance to plug Cubed 😅): It can already run on Beam, Lithops (so AWS Lambda, GCF etc.), Modal, Dask, Coiled Functions, locally (serially, across threads or across processes), and has prototype executors for Spark and Ray. That's because a Cubed executor is basically just implementing the |
Agree with all of above 😃 |
@norlandrhagen, @jbusecke, @alxmrs and myself have been wondering about this for a while now during the meetings since October (@alxmrs actually wrote up his thoughts about the future of What we have gathered so far is:
However, with all that said, I personally think that the pipeline syntax is actually an advantage (this doesn't have to be
Additionally, decomposing the full pipeline into individual steps allowed writing packages like (As an aside, @norlandrhagen and I are trying to maintain |
I totally agree with your state of the data engineering ecosystem, Tom. I think you've covered more developments since I've tried to produce an overview, and in greater detail.
I think this is the crux of the issue today. Maybe, the simple answer to the second question is that we should deprecate PGF to favor the litany of tools that you described. However, I do think PGF covers something not mentioned in your list: the community mediation, curation, and maintenance of datasets on top of the Github platform. This is the same sort of idea behind FROST. Considering Github has long been considered the social media of developers -- and pangeo-forge was modeled after the social structure defined by conda-forge -- could Pangeo-Forge shift from being a project revolved around a data engineering compiler + compute engine to a system to facilitate community? Maybe this could even make space for the rapidly evolving set of LLM-based code generation tools? One possible future direction for PGF is it becomes more like a protocol for using (an evolving set of) data engineering tools within an open community. Skimming the FROST specs now, if you squint, how many features are within spitting distance of being built on top of Github and this project? All that said, I can clearly see the momentum has gone. I'm also very open to letting new tools and systems emerge in place of this one. |
That's an interesting counterpoint...
I would like to see xarray become more like a pipeline - this is the "xarray as a new SQL-like query language for arrays" idea.
Do Cubed's executors not already allow pretty centralized configuration of parallelism?
I might be missing something, but I don't see how that's different to writing another python package that consumes and produces xarray Datasets as part of an xarray pipeline.
Yes great point. My bullets above didn't propose an alternative to "PGF as a platform".
FROST doesn't exist other than as a twinkle in my eye. I don't know what the future of that platform part of it is but it likely involves Earthmover/Source Coop + ideas from FROST. |
It might be better to compare with PRQL because SQL is not very easy to read as a pipeline, but otherwise I agree, that would be great.
I might have to look into
I guess this doesn't have anything to do with the syntax but rather with the building blocks pfr provides. If we can get a different package to provide similar blocks, I think pfr can become a thin wrapper around that. |
Thanks @TomNicholas for linking me to this.
I guess I fall into this category, for the past few months I've been working on a project developing digital infrastructure (tools, packages, UIs, APIs etc.) to help scientists work with large gridded datasets via the cloud and have been using and advocating for pangeo-forge-recipes as the starting block...
...because I agree with this. I saw plug & play off-the-shelf blocks from which you can build a recipe for conversion/chunking/kerchunking/uploading/otherwise manipulating gridded datasets as a good way of making these sorts of workflows easier for scientists who don't know/care about this sort of thing easily get their data into ARCO format, given that is the way the world of big gridded data science is going. I envisaged building a nice UI on top of pangeo-forge-recipes that would allow users to bring their dataset and compose the necessary recipe without having to get to grips with (yet) another package. Now that I can see the community is moving away from PFR I'm considering switching to VirtualiZarr & Cubed however this pipeline element I will miss and figure out what to replace it with.
Having grappled with this Beam dependency for a while now I can definitely agree with this, it's been a major blocker trying to get my PFR recipes to run with anything other than the Beam DirectRunner. So this will be nice to not have to worry about! |
Let me add a bit more context here. I think there is absolutely a need for an ETL tool like Pangeo Forge, and maybe it even is Pangeo Forge. As the creator of the project, my reflection is that 2021 was actually too early for a Pangeo Forge, because we had not yet solved some more foundational issues, in particular:
So if we're serious about evolving PF, this tell us what we have to do.
I estimate that it would take 3 engineers working full time on this for three months to achieve this transformation. |
Over in nsidc/earthaccess#956 (reply in thread) I argued that Pangeo-Forge as a project is winding down, in favour of new tools and services.
Specifically:
So generally I would advise against anyone starting a big project now that is tied to Pangeo-Forge because although the stack is in a bit of a transitionary period right now (with Icechunk/VirtualiZarr gaining more maturity daily) but the future trajectory is already clear.
I think Pangeo-Forge is an awesome project but I would now like to see something more robust and sustainable built using the many lessons learned.
Do people think this is fair / reasonable? If so do you disagree with the vision I sketched above? Is there a part of the PGF project that isn't represented by that sketch? Should we consider at what point the project could be officially deprecated? Or could it instead transition to use new tools?
cc @keewis @jbusecke @abarciauskas-bgse @sharkinsspatial @alxmrs @rabernat @TomAugspurger
The text was updated successfully, but these errors were encountered: