Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement zeros() by calling calloc #130

Open
StefanKarpinski opened this issue Jul 17, 2011 · 61 comments
Open

implement zeros() by calling calloc #130

StefanKarpinski opened this issue Jul 17, 2011 · 61 comments
Labels
arrays [a, r, r, a, y, s] help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster

Comments

@StefanKarpinski
Copy link
Member

There's a clever trick that we could use to create large zero matrices really fast: mmap the file /dev/zero. This is, in fact, exactly what this "file" exists for. The benefit of doing this are:

  1. It's nearly instantaneous since no memory actually needs to be allocated or filled with zeros until it's accessed.
  2. You can read and write the memory exactly as you normally would: the kernel only allocates memory pages for you when you do something with them.

Since a fair amount of the time, no one actually touches most of the memory in a zeros array, this might be a big win. On the other hand, the drawbacks are:

  1. Trade obvious immediate allocation cost for unobvious delayed allocation cost.
  2. Can run out of memory on read/write instead of only on allocate.
  3. Doesn't work for anything but zeros(), e.g. for ones().
@ViralBShah
Copy link
Member

The thing is that we use Array() as the default constructor almost everywhere now, and zeros is not used nearly as much.

Wouldn't it be better if we can handle the pagefaults ourselves, which will take care of all cases that call fill()?

@StefanKarpinski
Copy link
Member Author

That could be possible, but testing with an mmap of /dev/zero might be an easy way to find out what there is to gain.

@ViralBShah
Copy link
Member

It will definitely lead to better cache behaviour, and there will be a certain pattern of usage where this will be greatly beneficial for sure.

-viral

On Jul 18, 2011, at 11:23 AM, StefanKarpinski wrote:

That could be possible, but testing with an mmap of /dev/zero might be an easy way to find out what there is to gain.

Reply to this email directly or view it on GitHub:
#130 (comment)

@ViralBShah
Copy link
Member

@ViralBShah
Copy link
Member

It is quite possible that this does not give any observable gains for anything except very large matrices. Is it possible to quickly do an experiment to see if this gives any measurable benefits? Also, we do not use zeros() much in our codebase.

@StefanKarpinski
Copy link
Member Author

Yeah, this is a cool idea, but the way our memory allocation works via a memory pool, it's not very practical. Let's close for now.

@StefanKarpinski
Copy link
Member Author

This is a real performance issue that I've seen in the wild recently and has come up on the mailing list:

https://groups.google.com/forum/#!topic/julia-users/aW4rjUIFq6w

I'm fairly certain that NumPy is using the mmap /dev/zero trick and we should too.

@stevengj
Copy link
Member

stevengj commented Mar 9, 2016

Why not just use calloc? See also #9147.

@ViralBShah
Copy link
Member

calloc does seem like the easiest first thing to try.

@quinnj
Copy link
Member

quinnj commented Mar 9, 2016

It'd be worth bench-marking Anonymous Mmaps too, since they're now fully supported and "easy" :)

julia> m = Mmap.mmap(Vector{Float64}, 10000)
10000-element Array{Float64,1}:
 0.0
 0.0
 0.0

@StefanKarpinski
Copy link
Member Author

Why not just use calloc?

Yes, excellent point, @stevengj. That's definitely the first thing to do.

@StefanKarpinski StefanKarpinski changed the title fast zeros() using mmap? implement zeros() by calling calloc Apr 27, 2016
@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request good first issue Indicates a good issue for first-time contributors to Julia and removed speculative Whether the change will be implemented is speculative labels Apr 27, 2016
@StefanKarpinski
Copy link
Member Author

No reason not to do this. (Also, a really easy three-digit issue.)

@danielmatz
Copy link
Contributor

I used BenchmarkTools to compare Base.zeros against calloc and Mmap.mmap implementations. I've posted the script with my zeros_calloc and zeros_mmap functions and the benchmarks as a gist.

I first compared creating a small vector of zeros. I found that Base.zeros does the best here.

Function Median Time
Base.zeros 43 ns
zeros_calloc 660 ns
zeros_mmap 1.47 μs

Then I compared the creation of a somewhat large array. In this case, the Mmap.mmap solution clearly wins.

Function Median Time
Base.zeros 2.28 ms
zeros_calloc 2.17 ms
zeros_mmap 1.98 μs

Lastly, I created a somewhat large array of zeros, and then overwrote every element by filling with ones. And now the calloc version wins.

Function Median Time
Base.zeros 3.50 ms
zeros_calloc 2.83 ms
zeros_mmap 4.74 ms

Assuming I don't have an error somewhere... Which implementation should we go with?

@quinnj
Copy link
Member

quinnj commented Oct 18, 2016

I don't think the Mmap.mmap version is viable since it leads to the "shared data" situation and can't be mutated. I guess maybe in a non-vector situation we could consider it. Here's what I'm talking about:

julia> t = Mmap.mmap(Vector{UInt8}, 10)
10-element Array{UInt8,1}:
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00

julia> push!(t, 0x00)
ERROR: cannot resize array with shared data
 in push!(::Array{UInt8,1}, ::UInt8) at ./array.jl:480
 in push!(::Array{UInt8,1}, ::UInt8) at /Users/jacobquinn/julia/usr/lib/julia/sys.dylib:?

@yuyichao
Copy link
Contributor

yuyichao commented Oct 18, 2016

Also note that we can't trivially use calloc for malloc array either since that provide less alignment that what we need.

@StefanKarpinski
Copy link
Member Author

I think we should use manual zeroing for small arrays and calloc (with appropriate alignment) for large arrays. I'm guessing the cutoff should be about a memory page but it would be good to test that experimentally – which you're already doing. Thanks for investigating and pushing this forward!

@stevengj
Copy link
Member

stevengj@7c9ab1d shows how to get 16-byte alignment from calloc.

@yuyichao
Copy link
Contributor

16 bytes is fine. We do 64bytes aligned allocation though.

@stevengj
Copy link
Member

stevengj commented Oct 18, 2016

The same code can easily be modified to do 64-byte aligned calloc. (Though I thought there was no performance advantage for more than 32-byte alignment, even with AVX?)

@danielmatz
Copy link
Contributor

@stevengj Thanks for the link to your calloc_a16 code, that was very helpful.

I think I'll need to add information to jl_array_flags_t to track whether calloc_a16 was used, so that I can know to call free_a16. I see that that structure is very carefully packed... Can I take a bit away from ndims?

@yuyichao
Copy link
Contributor

You should use the same free.

@danielmatz
Copy link
Contributor

@yuyichao I don't understand how that will work. Steven's calloc_a16 requests 16 more bytes from calloc than the user requested. Then he returns a pointer that is aligned to a 16 byte boundary, but that pointer is never the one that calloc itself returned. Isn't it true that I have to call free on the pointer originally returned by calloc? Steven provides a free_a16 function to do this. That's what led me to believe that I need to use a bit to track whether the memory came from calloc_a16.

@yuyichao
Copy link
Contributor

yuyichao commented Oct 29, 2016

I mean you should just optimize (for size) and use the existing implementation in jl_{malloc,calloc,free,realloc} when allocating array buffers using malloc. You can use the low bits in the size storage for alignments. This shouldn't cause memory wasting since libc needs to do the over allocation too (for alignment). (Note that you only need to over allocate 64bytes and you are guaranteed to have a sizeof(void*) available to store the size).

You do need to keep track of jl_ptr_to_array(and similar functions) that transfers the ownership of the memory since those should be free'd in using free but that can be done in jl_gc_track_malloc_array and you have at least 4 bits there to mark it.

@danielmatz
Copy link
Contributor

Thank you for the help! A couple of questions...

  1. Are you envisioning that I use jl_calloc to allocate the array data only, and then use jl_ptr_to_array to turn it into an array? Or are you thinking I should make a new version of jl_new_array that uses jl_calloc to allocate the the jl_array_t object and the data inline at once?
  2. You mention that there are at least 4 bits available in jl_gc_track_malloced_array. I'm not sure what you mean. What structure are you talking about?

@danielmatz
Copy link
Contributor

Because the array code knows the exact size it needs, which is in general smaller than the allocation size.

I guess I was only considering the case where an array fills its allocated space, and then has to increase its allocation. In that case, realloc would already only be copying the minimum amount of data. But I guess you are saying that it is common to resize the array's allocation while still only using a subset of the allocation?

@danielmatz
Copy link
Contributor

I think my changes are working properly, but when I make the final changes to the zeros function and then run make to build the sysimg, it gets hung up. Any tips on how to debug that?

@yuyichao
Copy link
Contributor

Attach gdb and see why it hangs.

@danielmatz
Copy link
Contributor

The backtrace is tens of thousands of frames deep. The top of the stack is thousands of calls to inst_tuple_w_. Below that are thousands of calls to jl_apply_generic. Is that normal?

@yuyichao
Copy link
Contributor

It's normal if you messed sth up like memory allocations. There're many ways I'd try to debug it including running in rr, dump local variables, print in allocation etc. It's impossible to tell without seeing the code.

@danielmatz
Copy link
Contributor

My apologies for setting this issue aside for so long. I should have time again now to continue looking at it.

I've been testing my changes again. It seems that they work fine everywhere except for const GLOBAL_RNG = MersenneTwister() in random.jl. That is where things hang when I run make. If I comment out that line, the build completes (though the new Julia executable doesn't work, since this global variable is missing).

Is there some interaction between memory allocation, globals, and the pre-compilation steps that I am missing?

If I can't figure this out soon, would it be OK to submit a WIP pull request? Maybe someone else would quickly spot my mistake.

@yuyichao
Copy link
Contributor

Submitting a wip pr or at least have a pointer to the wip code would be useful.

@danielmatz
Copy link
Contributor

danielmatz commented Jan 5, 2017

I think I figured out my issue. I had replaced the zeros methods for all eltypes, but there are some where zeroing the memory is not appropriate. I've now restored the fill! implementation as the fallback, and will only opt in to using calloc when appropriate for the eltype.

So I should be back on track now. I have the WIP PR up, too.

@yuyichao yuyichao removed good first issue Indicates a good issue for first-time contributors to Julia help wanted Indicates that a maintainer wants help on an issue or pull request labels Jan 30, 2017
StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018
KristofferC pushed a commit that referenced this issue Feb 14, 2018
@antoine-levitt
Copy link
Contributor

Could someone comment on the status of this? There were a couple of followup PRs but it doesn't seem that this functionality is in. Was it decided it was not beneficial? Or is it but nobody got around to it?

@StefanKarpinski
Copy link
Member Author

I think nobody got around to it.

@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request arrays [a, r, r, a, y, s] labels Nov 2, 2019
@ViralBShah
Copy link
Member

I think the latest on the topic was #22953

cmcaine pushed a commit to cmcaine/julia that referenced this issue Sep 24, 2020
* Add exercise: binary-search

* Change README to proper format for generation

* Add tests for bonus tasks

* Add notebook
@ToucheSir
Copy link

This came up recently on Discourse. Though the scenario in question is not optimal, I would at least expect sum(zeros(...)) to be roughly comparable:

julia> @benchmark sum(zeros(Float64, 1024, 1024))
BenchmarkTools.Trial: 3858 samples with 1 evaluation.
 Range (min  max):  941.038 μs    3.298 ms  ┊ GC (min  max): 0.00%   8.03%
 Time  (median):     992.895 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.289 ms ± 546.474 μs  ┊ GC (mean ± σ):  6.00% ± 11.53%

  █                                                              
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁  ▁
  2.38 ms          Histogram: frequency by time          980 μs <
In [1]: import numpy as np

In [2]: %timeit np.zeros((1024, 1024)).sum()
489 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@KristofferC
Copy link
Member

Unless you specifically want to show something with the histogram and all those numbers, consider using @btime for a terser output.

@ToucheSir
Copy link

ToucheSir commented Sep 12, 2021

I used @benchmark because timeit.timeit reports mean execution time. The minimum used by @btime wouldn't have been terrible in this case, but it's not the best point of comparison either. The standard deviation is probably interesting in of itself, given it's larger than both the mean and stddev. of the numpy version.

@mkitti
Copy link
Contributor

mkitti commented May 11, 2022

If anyone would like to use calloc to create an array, I have implemented this in ArrayAllocators.jl.

julia> using ArrayAllocators

julia> @time zeros(UInt8, 2048, 2048);
0.000514 seconds (2 allocations: 4.000 MiB)

julia> @time Array{UInt8}(undef, 2048, 2048);
0.000017 seconds (2 allocations: 4.000 MiB)

julia> @time Array{UInt8}(calloc, 2048, 2048); # Allocates and sets zeros eventually, but is much faster than `Base.zeros`
 0.000015 seconds (2 allocations: 4.000 MiB)

@Moelf
Copy link
Contributor

Moelf commented Oct 30, 2023

now that we have Memory{}, is there some other approach people wanna try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.