implement zeros() by calling calloc #130

StefanKarpinski · 2011-07-17T20:58:09Z

There's a clever trick that we could use to create large zero matrices really fast: mmap the file /dev/zero. This is, in fact, exactly what this "file" exists for. The benefit of doing this are:

It's nearly instantaneous since no memory actually needs to be allocated or filled with zeros until it's accessed.
You can read and write the memory exactly as you normally would: the kernel only allocates memory pages for you when you do something with them.

Since a fair amount of the time, no one actually touches most of the memory in a zeros array, this might be a big win. On the other hand, the drawbacks are:

Trade obvious immediate allocation cost for unobvious delayed allocation cost.
Can run out of memory on read/write instead of only on allocate.
Doesn't work for anything but zeros(), e.g. for ones().

The text was updated successfully, but these errors were encountered:

ViralBShah · 2011-07-18T04:03:31Z

The thing is that we use Array() as the default constructor almost everywhere now, and zeros is not used nearly as much.

Wouldn't it be better if we can handle the pagefaults ourselves, which will take care of all cases that call fill()?

StefanKarpinski · 2011-07-18T05:53:45Z

That could be possible, but testing with an mmap of /dev/zero might be an easy way to find out what there is to gain.

ViralBShah · 2011-07-18T06:13:04Z

It will definitely lead to better cache behaviour, and there will be a certain pattern of usage where this will be greatly beneficial for sure.

-viral

On Jul 18, 2011, at 11:23 AM, StefanKarpinski wrote:

That could be possible, but testing with an mmap of /dev/zero might be an easy way to find out what there is to gain.

Reply to this email directly or view it on GitHub:
#130 (comment)

ViralBShah · 2011-08-21T04:54:19Z

See:

http://stackoverflow.com/questions/2688466/why-mallocmemset-slower-than-calloc

ViralBShah · 2011-12-25T12:09:50Z

It is quite possible that this does not give any observable gains for anything except very large matrices. Is it possible to quickly do an experiment to see if this gives any measurable benefits? Also, we do not use zeros() much in our codebase.

StefanKarpinski · 2011-12-25T17:11:27Z

Yeah, this is a cool idea, but the way our memory allocation works via a memory pool, it's not very practical. Let's close for now.

StefanKarpinski · 2016-03-08T22:20:47Z

This is a real performance issue that I've seen in the wild recently and has come up on the mailing list:

https://groups.google.com/forum/#!topic/julia-users/aW4rjUIFq6w

I'm fairly certain that NumPy is using the mmap /dev/zero trick and we should too.

stevengj · 2016-03-09T03:05:39Z

Why not just use calloc? See also #9147.

ViralBShah · 2016-03-09T04:11:00Z

calloc does seem like the easiest first thing to try.

quinnj · 2016-03-09T04:22:14Z

It'd be worth bench-marking Anonymous Mmaps too, since they're now fully supported and "easy" :)

julia> m = Mmap.mmap(Vector{Float64}, 10000)
10000-element Array{Float64,1}:
 0.0
 0.0
 0.0

StefanKarpinski · 2016-03-09T15:50:37Z

Why not just use calloc?

Yes, excellent point, @stevengj. That's definitely the first thing to do.

StefanKarpinski · 2016-04-27T13:42:05Z

No reason not to do this. (Also, a really easy three-digit issue.)

danielmatz · 2016-10-18T02:22:47Z

I used BenchmarkTools to compare Base.zeros against calloc and Mmap.mmap implementations. I've posted the script with my zeros_calloc and zeros_mmap functions and the benchmarks as a gist.

I first compared creating a small vector of zeros. I found that Base.zeros does the best here.

Function	Median Time
`Base.zeros`	43 ns
`zeros_calloc`	660 ns
`zeros_mmap`	1.47 μs

Then I compared the creation of a somewhat large array. In this case, the Mmap.mmap solution clearly wins.

Function	Median Time
`Base.zeros`	2.28 ms
`zeros_calloc`	2.17 ms
`zeros_mmap`	1.98 μs

Lastly, I created a somewhat large array of zeros, and then overwrote every element by filling with ones. And now the calloc version wins.

Function	Median Time
`Base.zeros`	3.50 ms
`zeros_calloc`	2.83 ms
`zeros_mmap`	4.74 ms

Assuming I don't have an error somewhere... Which implementation should we go with?

quinnj · 2016-10-18T03:09:04Z

I don't think the Mmap.mmap version is viable since it leads to the "shared data" situation and can't be mutated. I guess maybe in a non-vector situation we could consider it. Here's what I'm talking about:

julia> t = Mmap.mmap(Vector{UInt8}, 10)
10-element Array{UInt8,1}:
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00

julia> push!(t, 0x00)
ERROR: cannot resize array with shared data
 in push!(::Array{UInt8,1}, ::UInt8) at ./array.jl:480
 in push!(::Array{UInt8,1}, ::UInt8) at /Users/jacobquinn/julia/usr/lib/julia/sys.dylib:?

yuyichao · 2016-10-18T03:16:05Z

Also note that we can't trivially use calloc for malloc array either since that provide less alignment that what we need.

StefanKarpinski · 2016-10-18T04:54:40Z

I think we should use manual zeroing for small arrays and calloc (with appropriate alignment) for large arrays. I'm guessing the cutoff should be about a memory page but it would be good to test that experimentally – which you're already doing. Thanks for investigating and pushing this forward!

stevengj · 2016-10-18T14:00:09Z

stevengj@7c9ab1d shows how to get 16-byte alignment from calloc.

yuyichao · 2016-10-18T14:10:01Z

16 bytes is fine. We do 64bytes aligned allocation though.

stevengj · 2016-10-18T15:31:41Z

The same code can easily be modified to do 64-byte aligned calloc. (Though I thought there was no performance advantage for more than 32-byte alignment, even with AVX?)

danielmatz · 2016-10-20T23:02:19Z

@stevengj Thanks for the link to your calloc_a16 code, that was very helpful.

I think I'll need to add information to jl_array_flags_t to track whether calloc_a16 was used, so that I can know to call free_a16. I see that that structure is very carefully packed... Can I take a bit away from ndims?

yuyichao · 2016-10-21T00:24:13Z

You should use the same free.

danielmatz · 2016-10-28T23:40:36Z

@yuyichao I don't understand how that will work. Steven's calloc_a16 requests 16 more bytes from calloc than the user requested. Then he returns a pointer that is aligned to a 16 byte boundary, but that pointer is never the one that calloc itself returned. Isn't it true that I have to call free on the pointer originally returned by calloc? Steven provides a free_a16 function to do this. That's what led me to believe that I need to use a bit to track whether the memory came from calloc_a16.

yuyichao · 2016-10-29T00:12:09Z

I mean you should just optimize (for size) and use the existing implementation in jl_{malloc,calloc,free,realloc} when allocating array buffers using malloc. You can use the low bits in the size storage for alignments. This shouldn't cause memory wasting since libc needs to do the over allocation too (for alignment). (Note that you only need to over allocate 64bytes and you are guaranteed to have a sizeof(void*) available to store the size).

You do need to keep track of jl_ptr_to_array(and similar functions) that transfers the ownership of the memory since those should be free'd in using free but that can be done in jl_gc_track_malloc_array and you have at least 4 bits there to mark it.

danielmatz · 2016-10-29T01:31:37Z

Thank you for the help! A couple of questions...

Are you envisioning that I use jl_calloc to allocate the array data only, and then use jl_ptr_to_array to turn it into an array? Or are you thinking I should make a new version of jl_new_array that uses jl_calloc to allocate the the jl_array_t object and the data inline at once?
You mention that there are at least 4 bits available in jl_gc_track_malloced_array. I'm not sure what you mean. What structure are you talking about?

danielmatz · 2016-11-04T23:23:13Z

Because the array code knows the exact size it needs, which is in general smaller than the allocation size.

I guess I was only considering the case where an array fills its allocated space, and then has to increase its allocation. In that case, realloc would already only be copying the minimum amount of data. But I guess you are saying that it is common to resize the array's allocation while still only using a subset of the allocation?

danielmatz · 2016-11-12T16:16:37Z

I think my changes are working properly, but when I make the final changes to the zeros function and then run make to build the sysimg, it gets hung up. Any tips on how to debug that?

yuyichao · 2016-11-13T02:49:28Z

Attach gdb and see why it hangs.

danielmatz · 2016-11-13T22:54:30Z

The backtrace is tens of thousands of frames deep. The top of the stack is thousands of calls to inst_tuple_w_. Below that are thousands of calls to jl_apply_generic. Is that normal?

yuyichao · 2016-11-13T23:29:08Z

It's normal if you messed sth up like memory allocations. There're many ways I'd try to debug it including running in rr, dump local variables, print in allocation etc. It's impossible to tell without seeing the code.

danielmatz · 2016-12-31T20:46:39Z

My apologies for setting this issue aside for so long. I should have time again now to continue looking at it.

I've been testing my changes again. It seems that they work fine everywhere except for const GLOBAL_RNG = MersenneTwister() in random.jl. That is where things hang when I run make. If I comment out that line, the build completes (though the new Julia executable doesn't work, since this global variable is missing).

Is there some interaction between memory allocation, globals, and the pre-compilation steps that I am missing?

If I can't figure this out soon, would it be OK to submit a WIP pull request? Maybe someone else would quickly spot my mistake.

yuyichao · 2016-12-31T21:35:31Z

Submitting a wip pr or at least have a pointer to the wip code would be useful.

danielmatz · 2017-01-05T05:09:07Z

I think I figured out my issue. I had replaced the zeros methods for all eltypes, but there are some where zeroing the memory is not appropriate. I've now restored the fill! implementation as the fallback, and will only opt in to using calloc when appropriate for the eltype.

So I should be back on track now. I have the WIP PR up, too.

compat for #12472

delete unused function

antoine-levitt · 2019-11-02T20:56:40Z

Could someone comment on the status of this? There were a couple of followup PRs but it doesn't seem that this functionality is in. Was it decided it was not beneficial? Or is it but nobody got around to it?

StefanKarpinski · 2019-11-02T20:58:02Z

I think nobody got around to it.

ViralBShah · 2019-11-02T21:00:59Z

I think the latest on the topic was #22953

* Add exercise: binary-search * Change README to proper format for generation * Add tests for bonus tasks * Add notebook

ToucheSir · 2021-09-12T19:21:06Z

This came up recently on Discourse. Though the scenario in question is not optimal, I would at least expect sum(zeros(...)) to be roughly comparable:

julia> @benchmark sum(zeros(Float64, 1024, 1024))
BenchmarkTools.Trial: 3858 samples with 1 evaluation.
 Range (min … max):  941.038 μs …   3.298 ms  ┊ GC (min … max): 0.00% …  8.03%
 Time  (median):     992.895 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.289 ms ± 546.474 μs  ┊ GC (mean ± σ):  6.00% ± 11.53%

  █                                                              
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁  ▁
  2.38 ms          Histogram: frequency by time          980 μs <

In [1]: import numpy as np

In [2]: %timeit np.zeros((1024, 1024)).sum()
489 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

KristofferC · 2021-09-12T21:02:43Z

Unless you specifically want to show something with the histogram and all those numbers, consider using @btime for a terser output.

ToucheSir · 2021-09-12T22:39:09Z

I used @benchmark because timeit.timeit reports mean execution time. The minimum used by @btime wouldn't have been terrible in this case, but it's not the best point of comparison either. The standard deviation is probably interesting in of itself, given it's larger than both the mean and stddev. of the numpy version.

mkitti · 2022-05-11T14:11:47Z

If anyone would like to use calloc to create an array, I have implemented this in ArrayAllocators.jl.

julia> using ArrayAllocators

julia> @time zeros(UInt8, 2048, 2048);
0.000514 seconds (2 allocations: 4.000 MiB)

julia> @time Array{UInt8}(undef, 2048, 2048);
0.000017 seconds (2 allocations: 4.000 MiB)

julia> @time Array{UInt8}(calloc, 2048, 2048); # Allocates and sets zeros eventually, but is much faster than `Base.zeros`
 0.000015 seconds (2 allocations: 4.000 MiB)

Moelf · 2023-10-30T21:11:46Z

now that we have Memory{}, is there some other approach people wanna try?

StefanKarpinski closed this as completed Dec 25, 2011

StefanKarpinski reopened this Mar 8, 2016

StefanKarpinski changed the title ~~fast zeros() using mmap?~~ implement zeros() by calling calloc Apr 27, 2016

StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request good first issue Indicates a good issue for first-time contributors to Julia and removed speculative Whether the change will be implemented is speculative labels Apr 27, 2016

danielmatz mentioned this issue Jan 4, 2017

WIP: Use calloc for zeros #19852

Closed

6 tasks

yuyichao removed good first issue Indicates a good issue for first-time contributors to Julia help wanted Indicates that a maintainer wants help on an issue or pull request labels Jan 30, 2017

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018

Merge pull request #130 from JuliaLang/isapprox

74cea69

compat for #12472

KristofferC pushed a commit that referenced this issue Feb 14, 2018

Merge pull request #130 from JuliaLang/kc/del_unused

e49c9e8

delete unused function

chethega mentioned this issue Apr 8, 2019

Julep/Very WIP - Heap allocated immutable arrays and compiler support #31630

Closed

StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request arrays [a, r, r, a, y, s] labels Nov 2, 2019

mfherbst mentioned this issue Nov 4, 2019

Performance audit JuliaMolSim/DFTK.jl#38

Closed

johnnychen94 mentioned this issue Mar 1, 2020

Simpler syntax for creating uninitialized arrays #34775

Open

cmcaine pushed a commit to cmcaine/julia that referenced this issue Sep 24, 2020

Add exercise: binary-search (JuliaLang#130)

f282c2c

* Add exercise: binary-search * Change README to proper format for generation * Add tests for bonus tasks * Add notebook

This was referenced Oct 15, 2021

Define and export undefs similar to zeros and ones #42620

Closed

jl_gc_calloc_aligned and friends: obtaining aligned zero initialized memory efficiently #42704

Open

Seelengrab mentioned this issue Jan 26, 2022

deleteat!: Handle unassigned array elements #43941

Merged

implement zeros() by calling calloc #130

implement zeros() by calling calloc #130

Comments

StefanKarpinski commented Jul 17, 2011

ViralBShah commented Jul 18, 2011

StefanKarpinski commented Jul 18, 2011

ViralBShah commented Jul 18, 2011

ViralBShah commented Aug 21, 2011

ViralBShah commented Dec 25, 2011

StefanKarpinski commented Dec 25, 2011

StefanKarpinski commented Mar 8, 2016

stevengj commented Mar 9, 2016

ViralBShah commented Mar 9, 2016

quinnj commented Mar 9, 2016

StefanKarpinski commented Mar 9, 2016

StefanKarpinski commented Apr 27, 2016

danielmatz commented Oct 18, 2016

quinnj commented Oct 18, 2016

yuyichao commented Oct 18, 2016 • edited Loading

StefanKarpinski commented Oct 18, 2016

stevengj commented Oct 18, 2016

yuyichao commented Oct 18, 2016

stevengj commented Oct 18, 2016 • edited Loading

danielmatz commented Oct 20, 2016

yuyichao commented Oct 21, 2016

danielmatz commented Oct 28, 2016

yuyichao commented Oct 29, 2016 • edited Loading

danielmatz commented Oct 29, 2016

danielmatz commented Nov 4, 2016

danielmatz commented Nov 12, 2016

yuyichao commented Nov 13, 2016

danielmatz commented Nov 13, 2016

yuyichao commented Nov 13, 2016

danielmatz commented Dec 31, 2016

yuyichao commented Dec 31, 2016

danielmatz commented Jan 5, 2017 • edited Loading

antoine-levitt commented Nov 2, 2019

StefanKarpinski commented Nov 2, 2019

ViralBShah commented Nov 2, 2019

ToucheSir commented Sep 12, 2021

KristofferC commented Sep 12, 2021

ToucheSir commented Sep 12, 2021 • edited Loading

mkitti commented May 11, 2022 • edited Loading

Moelf commented Oct 30, 2023

yuyichao commented Oct 18, 2016 •

edited

Loading

stevengj commented Oct 18, 2016 •

edited

Loading

yuyichao commented Oct 29, 2016 •

edited

Loading

danielmatz commented Jan 5, 2017 •

edited

Loading

ToucheSir commented Sep 12, 2021 •

edited

Loading

mkitti commented May 11, 2022 •

edited

Loading