-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement zeros() by calling calloc #130
Comments
The thing is that we use Array() as the default constructor almost everywhere now, and zeros is not used nearly as much. Wouldn't it be better if we can handle the pagefaults ourselves, which will take care of all cases that call fill()? |
That could be possible, but testing with an mmap of /dev/zero might be an easy way to find out what there is to gain. |
It will definitely lead to better cache behaviour, and there will be a certain pattern of usage where this will be greatly beneficial for sure. -viral On Jul 18, 2011, at 11:23 AM, StefanKarpinski wrote:
|
It is quite possible that this does not give any observable gains for anything except very large matrices. Is it possible to quickly do an experiment to see if this gives any measurable benefits? Also, we do not use zeros() much in our codebase. |
Yeah, this is a cool idea, but the way our memory allocation works via a memory pool, it's not very practical. Let's close for now. |
This is a real performance issue that I've seen in the wild recently and has come up on the mailing list: https://groups.google.com/forum/#!topic/julia-users/aW4rjUIFq6w I'm fairly certain that NumPy is using the mmap /dev/zero trick and we should too. |
Why not just use |
|
It'd be worth bench-marking Anonymous Mmaps too, since they're now fully supported and "easy" :) julia> m = Mmap.mmap(Vector{Float64}, 10000)
10000-element Array{Float64,1}:
0.0
0.0
0.0 |
Yes, excellent point, @stevengj. That's definitely the first thing to do. |
No reason not to do this. (Also, a really easy three-digit issue.) |
I used BenchmarkTools to compare I first compared creating a small vector of zeros. I found that
Then I compared the creation of a somewhat large array. In this case, the
Lastly, I created a somewhat large array of zeros, and then overwrote every element by filling with ones. And now the
Assuming I don't have an error somewhere... Which implementation should we go with? |
I don't think the julia> t = Mmap.mmap(Vector{UInt8}, 10)
10-element Array{UInt8,1}:
0x00
0x00
0x00
0x00
0x00
0x00
0x00
0x00
0x00
0x00
julia> push!(t, 0x00)
ERROR: cannot resize array with shared data
in push!(::Array{UInt8,1}, ::UInt8) at ./array.jl:480
in push!(::Array{UInt8,1}, ::UInt8) at /Users/jacobquinn/julia/usr/lib/julia/sys.dylib:? |
Also note that we can't trivially use |
I think we should use manual zeroing for small arrays and |
stevengj@7c9ab1d shows how to get 16-byte alignment from |
16 bytes is fine. We do 64bytes aligned allocation though. |
The same code can easily be modified to do 64-byte aligned |
@stevengj Thanks for the link to your I think I'll need to add information to |
You should use the same free. |
@yuyichao I don't understand how that will work. Steven's calloc_a16 requests 16 more bytes from calloc than the user requested. Then he returns a pointer that is aligned to a 16 byte boundary, but that pointer is never the one that calloc itself returned. Isn't it true that I have to call free on the pointer originally returned by calloc? Steven provides a free_a16 function to do this. That's what led me to believe that I need to use a bit to track whether the memory came from calloc_a16. |
I mean you should just optimize (for size) and use the existing implementation in You do need to keep track of |
Thank you for the help! A couple of questions...
|
I guess I was only considering the case where an array fills its allocated space, and then has to increase its allocation. In that case, realloc would already only be copying the minimum amount of data. But I guess you are saying that it is common to resize the array's allocation while still only using a subset of the allocation? |
I think my changes are working properly, but when I make the final changes to the |
Attach gdb and see why it hangs. |
The backtrace is tens of thousands of frames deep. The top of the stack is thousands of calls to |
It's normal if you messed sth up like memory allocations. There're many ways I'd try to debug it including running in |
My apologies for setting this issue aside for so long. I should have time again now to continue looking at it. I've been testing my changes again. It seems that they work fine everywhere except for Is there some interaction between memory allocation, globals, and the pre-compilation steps that I am missing? If I can't figure this out soon, would it be OK to submit a WIP pull request? Maybe someone else would quickly spot my mistake. |
Submitting a wip pr or at least have a pointer to the wip code would be useful. |
I think I figured out my issue. I had replaced the zeros methods for all eltypes, but there are some where zeroing the memory is not appropriate. I've now restored the fill! implementation as the fallback, and will only opt in to using calloc when appropriate for the eltype. So I should be back on track now. I have the WIP PR up, too. |
Could someone comment on the status of this? There were a couple of followup PRs but it doesn't seem that this functionality is in. Was it decided it was not beneficial? Or is it but nobody got around to it? |
I think nobody got around to it. |
I think the latest on the topic was #22953 |
* Add exercise: binary-search * Change README to proper format for generation * Add tests for bonus tasks * Add notebook
This came up recently on Discourse. Though the scenario in question is not optimal, I would at least expect julia> @benchmark sum(zeros(Float64, 1024, 1024))
BenchmarkTools.Trial: 3858 samples with 1 evaluation.
Range (min … max): 941.038 μs … 3.298 ms ┊ GC (min … max): 0.00% … 8.03%
Time (median): 992.895 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.289 ms ± 546.474 μs ┊ GC (mean ± σ): 6.00% ± 11.53%
█
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁ ▁
2.38 ms Histogram: frequency by time 980 μs < In [1]: import numpy as np
In [2]: %timeit np.zeros((1024, 1024)).sum()
489 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) |
Unless you specifically want to show something with the histogram and all those numbers, consider using |
I used |
If anyone would like to use
|
now that we have |
There's a clever trick that we could use to create large zero matrices really fast: mmap the file
/dev/zero
. This is, in fact, exactly what this "file" exists for. The benefit of doing this are:Since a fair amount of the time, no one actually touches most of the memory in a zeros array, this might be a big win. On the other hand, the drawbacks are:
zeros()
, e.g. forones()
.The text was updated successfully, but these errors were encountered: