-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors when loading parallel packages on same machine #16788
Comments
See also #13684. You should probably load everything and do recompilation in serial first, before trying to load that code on parallel workers. |
Thanks for the reference. Note that the second log in my original report was after I loaded and recompiled in serial; there were still problems. |
This seems to have been closed with the message
But the bug title and the very first line of my report says that these errors happened while running on a single machine. So I think this bug should be re-opened. |
I'd look into what's causing the module replacement and method overwrite warnings, resolve those and see if you still have an issue. |
How? The warnings only occur in parallel mode; when I ran on a single process immediately before there were no complaints (that is, once I did the package update). |
If the needs-more-info tag is directed at me, I don't understand what more info I should provide. |
Self-contained code that allows others to reproduce the problem. |
The errors appear to concern that state of my package cache; are you saying I should upload the cache? The sequence that led to these problems began running parallel jobs when the cache needed to be updated, and reproducing that would require an update to the external packages. I suppose with some fiddling I could create a local "upstream" package and tweak it. Apart from the question of origins, I'm currently in this weird state in which Pkg.update() runs fine as a single process, but loading in parallel fails. From: Tony Kelman [[email protected]] Self-contained code that allows others to reproduce the problem. — |
|
Sounds like you need to clear out your cache, have it rebuild in serial, and if the problems go away there's not really an actionable bug right now, other than perhaps the feature request of implementing a package-cache lock file. |
How do I clear the cache? I don't see a command in http://docs.julialang.org/en/release-0.4/manual/packages/. Do I just deleted the whole ~/.julia/v0.4, or maybe ~/julia/v0.4/.cache and the directory it points to? BTW, I did not explicitly run Pkg.update() in parallel, but, as the first transcript in my original report shows, some kind of update operation was triggered when I loaded my module in parallel. I'm guessing it was basically a Pkg.update(). |
delete everything under |
~/.julia/lib/v0.4/.cache is a symlink to ~/.julia/.cache. Should I delete ~/.julia/.cache as well, or perhaps the material underneath it? I'd hate to delete the cache without actually deleting it:) |
|
Short version: clearing the cache didn't help. The following code is sufficient for me to get errors in parallel on one machine, even though I get no errors when not in parallel:
Here are the errors:
HistoryFirst I deleted everything under ~/.julia/v0.4/ and issued Pkg.add to get the packages back. I did Pkg.update() and load of my original trouble.jl in serial mode and everything was fine. Then I started a parallel version, did Then I realized I had deleted the wrong directory, and deleted everything under ~/.julia/lib/v0.4/. There were no julia's running at the time. I repeated the single user load (except using the stuff trouble2.jl shown above); no error messages. Then I did the parallel load reported above. |
Thanks. I only see warnings there? |
Correct. I noticed that and thought perhaps everything was OK. But when I load the real code, which gets some data from a database, I get
The extra code is
although I note the line number in the traceback refers to line 8, |
The using directives and the calls such as |
Regarding the warning messages displayed by my simple test code: does the fact that I don't see them in serial model mean they don't occur, or could this just be a difference in whether warnings are displayed in parallel vs serial mode? |
JuliaLang/Distributed.jl#20 and related bugs suggested trying an additional
I'm delighted to be able to run parallel code. I remain baffled by the behavior. For example, why are the errors and warnings from the child processes affected by what I do in the main process? Outstanding items:
|
When you develop a concurrent application you should account for various race conditions in your code. It's a bit naive to expect single process application to work as multiprocess one without any modification. Even though the Julia language provide a friendly framework for multiprocessor computations that eliminates thread-level races, process-level data races could occur when data is modified and accessed simultaneously by different processes. Always check your code and packages so they are correctly work in multiprocessor environments. |
julia regenerates the cache somewhat behind the back of user code; it seems to me desirable that it do so in a way that is parallel-safe. |
We probably should do some file locking here. The main issue is actually a cross-platform implementation and API for file locking: #7176. It still doesn't seem like we'll be able to rely on libuv for this, so someone would need to implement it in Julia. |
I think if we move from a in-file storage to a directory storage which would contain manifests of installed packages the parallel installation can be easily solved. |
Another scenario: multiple architectures with a shared home directory. |
I bumped into this too running parallel scripts on a single machine. An easy way to get to the repeated warnings is including a file containing, e.g.:
then do
I also got the error reported above, but I don't have a MWE. I also get the UndefVarError:
Interestingly & luckily, running the script again in the same julia session makes the error and warnings disappear. |
See also #16778. That problem involves multiple machines; I get very similar errors using, e.g., -p 4 on a single machine.
Originally I think the packages/cache were a bit out of date, and when I ran in parallel things went wrong and rebuilt everywhere. If they were all rebuilding in the same spot, as they seemed to, it's easy to imagine that would cause problems. I also had a single CPU copy of julia running at the same time (idle) throughout. I was running under ESS.
After that I started up a single cpu version, did a Pkg.update(), and ran my code. Everything seemed OK.
However, I then ran a parallel julia, and continued to get errors. The included code is all in a module and connect to a Postgresql DB to get some data.
Here are excerpts from the sessions, first my initial run
And after the cleanup with one CPU:
The text was updated successfully, but these errors were encountered: