-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oscillations in performance after different precompilation runs #51988
Comments
I tried to substitute |
That result sounds fairly typical. There are some tools you can do to reduce noise on your system, but these will likely only help you a little https://juliaci.github.io/BenchmarkTools.jl/dev/linuxtips/#Additional-resources Even with those, we usually see +-20% performance running exactly the same scalar code on exactly the same CPU frequency settings every day in our testing https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_date/2023-08/02/report.md I cannot find the link now, but prior researchers have shown that you can have even larger--reproducible--differences in performance due to minute changes such as the length of the environment variables block or the current working directory used to run the program from. |
To me this sounds a bit different because I can wait an arbitrary amount of time after precompilation, I can even restart the OS and I will get those same amount of time e.g. bob@bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx:~$ julia -e "using Related; for _ in 1:10 main() end"
Processing time (w/o IO): 297 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 23 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds
# I restart, wait ten minutes, or anything, and still same times
bob@bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx:~$ julia -e "using Related; for _ in 1:10 main() end"
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 23 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds If I retrigger precompilation then I can consistently get 18/19 ms after that if I'm lucky. Is this also something expected? Maybe more specifically, should the precompilation point in time consistently affect results after that? Can we be "lucky" all the time? |
You could use |
Yes, that is expected. You can't tell from the benchmark report directly, but I also know that measurement is likely highly consistent for that particular build arrangement, but likely will change from build-to-build as various unrelated other changes occur to other unrelated code. It is unlikely you can get lucky all the time: Intel hasn't documented how, and there are too many different things that all need to be in one of many possible fast arrangements. |
Oh, this is very interesting, I can just provide "empirical evidence" since I'm surely not an expert on this: I tried with the code in Rust in the fork I provided like 20 times, all the builds resulted in nearly equal times. I'm almost sure from what I saw on the VM on that repository that this will translate to (maybe almost) all other programming languages in that comparison apart from Julia. So this begs the question: what is special about Julia? Will run the |
The rust compiler is likely fairly stable at compiling identical code from identical input, but would also likely run into similar issues if you introduced unrelated changes into the code |
Like |
I added the results of
for both version inside the fork https://github.com/Tortar/related_post_gen/blob/main/perf-jit-results.tar.xz. There are some differences, but I'm not able to say if they are significant or not. It is needed someone more capable of me on this front. Maybe it is not very informative but I ran also this: Fast version:
Slow version:
|
1.021 sec vs. 1.086 sec seems fine (measured at 7% perf difference vs 33% slower) |
yes but there are a lot of other things inside the function (I/O), probably I should actually change the code so that it doesn't do these extra unrelated computations. I will update the code in the fork. edit: actually I think this is really necessary to be able to analyze the perf results, so I will update the code, rerun everything and let you know |
Do something like
So that you can analyze the assembly with callgraphs and JIT code |
ok new results where I minimized I/O but didn't remove all of it because otherwise I wasn't able to reproduce (since there is still some of it the difference is a bit less than the real one):
still didn't have the time to try to analyze this a bit better |
Used
I'm not sure if I'm identifying the right thing but these are quite different: I used |
Could you upload the two files? |
You can also inspect the assembly by pressing |
Here are two new fresh runs (one slow one fast): https://github.com/Tortar/related_post_gen/blob/main/perf-jit-results.tar.xz. I actually see a lot of variations in the collected instructions between runs. Will try to inspect the assembly |
Inspected the assembly, even if the profiler doesn't sample the same instruction for both runs, it seems equivalent, the only thing I notice is that some instructions are hit more in one case than in the other e.g. these instructions are sampled a lot only in the slow version
but I have no idea what does this mean/imply, maybe if someone more expert look into those profiler data something more insightful can be understood |
I also saw that in the @code_llvm of This happens also with much simpler code e.g. Also, I found this very interesting issue in Rust rust-lang/rust#69060 where they discuss non-deterministic performance of LLVM where they manage to reduce the variance on their benchmarks, I'm not sure if this is something also affecting Julia in the same way Citing @eddyb:
(yet the deterministic performance troubles of LLVM in tha benchmark appears to be much less large than the one here) |
just to let you know, I tried again exploring this problem (nothing new to report about the cause though) but I see much worse performance oscillations in 1.10.0 using the procedure described in the first comment: It is trimodal now :D (I wonder if this was even true in 1.9)
Notice the 70% difference between worst and best "build" |
I checked the cached binaries for the |
Tried with two different packages in the general registry, in both again only 6th, 9th, and last are the only different lines between precompilation runs, I have no idea if this can have the effect on changing the performance at the moment (I naively tried to patch the .ji with the one from the most performant build but it retriggers precompilation). It seems strange that these three lines are different in all tested packages, maybe this could be an issue in itself? |
I recommend not modifying the source between runs but renaming the cache files. That way you know the source hasn't changed at all (including its mtime). |
When trying to optimize the Julia code for the benchmark at https://github.com/jinyus/related_post_gen we realized that results for the Julia code varied much more that for other languages, for more on this see https://discourse.julialang.org/t/funny-benchmark-with-julia-no-longer-at-the-bottom/104611/141. When trying on my machine I found the same. I retried on another one and I reproduced it, so I came to the conclusion that something is strange.
The procedure to reproduce this behaviour is:
git clone https://github.com/Tortar/related_post_gen.git
julia -e 'using Pkg; Pkg.develop(path="./related_post_gen/julia/Related")'
julia -e "using Related; for _ in 1:10 main() end"
julia -e "using Related; for _ in 1:10 main() end"
This is what I see on my computer:
As you can see there is a variation of nearly 40% in performance. Notice that the algorithm of the Related source file is deterministic.
Version infos for the two computers where I reproduced this behaviour:
Apart from the (maybe) recoverable performance that solving this issue can get, I want to stress that these oscillations create difficulties in optimizing the code in a time efficient way since one can be misleaded by these oscillations.
The text was updated successfully, but these errors were encountered: