Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: timeout in ChunkRecordingExecutor (ProcessPoolExecutor) #813

Closed
miketrumpis opened this issue Jul 12, 2022 · 8 comments
Closed
Labels
enhancement New feature or request

Comments

@miketrumpis
Copy link
Contributor

I've been having some stalls in the ProcessPoolExecutor when creating some WaveformExtractor objects. Unfortunately, there aren't any factors that occur to me for debugging this. However, I made a quick fork from the v0.94.0 to include a timeout for the map call in the ProcessPoolExecutor, which at least raises an exception instead of hanging forever.

Very trivial changes. Happy to rebase this and PR

https://github.com/miketrumpis/spikeinterface/blob/multiproc/spikeinterface/core/job_tools.py

@alejoe91
Copy link
Member

alejoe91 commented Jul 13, 2022

Hi @miketrumpis

Thanks for the report. Can you point to the changes that you made?

Cheers
Alessio

@alejoe91
Copy link
Member

I see now you added a timeout!! Do you have any idea why the parallel process is failing?
What OS are you using? and what is the input recording (e.g. any fancy preprocessing?)

@miketrumpis
Copy link
Contributor Author

Pretty sure I've only seen it on Ubuntu 20.04.4. Correct me if I'm wrong, I believe the 0.94 version only uses spawning for multiprocessing.

The stalling scenario is either using a basic WaveformExtractor or a custom extension, in the extract_waveforms_to_buffers stage. The preprocessing stages channel selection and CAR, nothing too intensive.

I've wondered whether it's my extensions that are abusing the process executor, but the logic is largely inherited, and I'm only setting the sparsity matrix to narrow the output.

The stalling can definitely happen when multiple independent processes are each spawning the process executors. I believe it can happen in a single process, but less certain.

Not sure it's relevant, but curious if anyone else sees resource warnings, as reported in this Python bug? python/cpython#90549 (Note that I see this both in MacOS and Linux, since the previous SpikeInterface release is preferring to spawn.)

diffs
master...miketrumpis:spikeinterface:multiproc

@alejoe91
Copy link
Member

Pretty sure I've only seen it on Ubuntu 20.04.4. Correct me if I'm wrong, I believe the 0.94 version only uses spawning for multiprocessing.

Actually it uses loky (default on ubuntu). On Windows default os spawn. We have a new job argument called mp_context. Can you try to run the parallel processing with this additional mp_context="spawn" argument?

The stalling scenario is either using a basic WaveformExtractor or a custom extension, in the extract_waveforms_to_buffers stage. The preprocessing stages channel selection and CAR, nothing too intensive.

I've wondered whether it's my extensions that are abusing the process executor, but the logic is largely inherited, and I'm only setting the sparsity matrix to narrow the output.

Can I ask you which extensions? If you have something cool in mind, I suggest to open an issue or open a draft PR and we can definitely provide support :)

The stalling can definitely happen when multiple independent processes are each spawning the process executors. I believe it can happen in a single process, but less certain.

Not sure it's relevant, but curious if anyone else sees resource warnings, as reported in this Python bug? python/cpython#90549 (Note that I see this both in MacOS and Linux, since the previous SpikeInterface release is preferring to spawn.)

diffs master...miketrumpis:spikeinterface:multiproc

Thanks for the diffs!

@miketrumpis
Copy link
Contributor Author

I will try to run under the git main soon and play with the multiprocessing context. I have a list of jobs that have failed, but not 100% sure the failure mode is deterministic. Unfortunately the extensions are on a private repo for my organization 😬

@samuelgarcia
Copy link
Member

@alejoe91 : we do not use loky. It was too bugy. The ProcessPoolExecutor is in python core.

@miketrumpis : I am not sure that this timout trick will be sustainable. it is very hard to predict the computation.
For me when it hangs forever, it is because interanlly a chunk make a woker bugy for a strange reason but the error is not propagated to the main process. The best in that case is to use n_jobs=1 and track why a chunk trigger a bug.

@miketrumpis
Copy link
Contributor Author

@alejoe91 : we do not use loky. It was too bugy. The ProcessPoolExecutor is in python core.

I was taking another look at this too--so I presume that uses fork for linux and spawn for mac. Another reason to think spawn might change behavior.

@miketrumpis : I am not sure that this timout trick will be sustainable. it is very hard to predict the computation. For me when it hangs forever, it is because interanlly a chunk make a woker bugy for a strange reason but the error is not propagated to the main process. The best in that case is to use n_jobs=1 and track why a chunk trigger a bug.

That's a fair point -- the way I wrote it does not allow for a "None" default (current behavior). Still, it would be nice to have the option for a timeout if requested specifically, e.g. in the parameters to WaveformExtractor.

I will try your n_jobs suggestion next time I see the problem.

@alejoe91 alejoe91 added the enhancement New feature or request label Aug 2, 2022
@zm711
Copy link
Collaborator

zm711 commented Jun 24, 2024

This is 2 years old. Should we close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants