-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrency limiting #557
Concurrency limiting #557
Conversation
This appears to work in its current form. The climsim recipes I deployed with ![]() Indeed, following a quick spike to 20 at the start of the job, the number of workers has been locked at 10 for most of this time (with 2 vCPUs per worker, 10 workers means 20 vCPUs, so maybe this means two concurrency groups are running per worker?): ![]() At any rate, I've very encouraged by this (slow and steady) caching progress, as compared to previous deployment attempts of these recipes, wherein:
In the day plus of running, the import gcsfs
gcs = gcsfs.GCSFileSystem()
cache_mli = [f for f in gcs.ls("gs://leap-scratch/data-library/cache") if "mli" in f]
len(cache_mli) # -> 100238 I'm working on an integration test which will hopefully be able to capture this behavior in a (much) smaller pipeline. In the meantime, this work is unit tested here, so I thought I'd open it up for review to see what others think. I've made the concurrency-limiter here a standalone ptransform which can wrapped by other, more specific transforms, as demonstrated for My feeling is that merging some version of what's here is the easiest way to move forward with blocked recipes, and that if we need something more general down the line, we can address that later. That being said, if others see it another way, I'm all ears. Thanks in advance for taking a look! |
A few additional notes/thoughts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great!
A few nits on type hints and function signature.
pangeo_forge_recipes/transforms.py
Outdated
@@ -56,7 +57,7 @@ | |||
|
|||
|
|||
# TODO: replace with beam.MapTuple? | |||
def _add_keys(func): | |||
def _add_keys(func: Callable) -> Callable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Callable of what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like these annotation are correct! To keep things moving, I'm going to merge, @alxmrs if you catch anything out-of-place, feel free to let me know and I'll fix it in a follow-on.
Thanks for encouraging more specificity here, I think this really helps with clarity. 😃
WIP, will close #45 and #389 when complete.
Prioritizing this because it is blocking leap-stc/data-management#36, and also because it's been a long standing feature request to unblock many different recipes. Implementation using
GroupByKey
adapted from example provided by @alxmrs's in linked issue, with this being the key couple lines: