-
-
Notifications
You must be signed in to change notification settings - Fork 67
Improve block codecs #407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
On another subtopic, after some quick tests I've also found that using thread_local adds a slight speed penalty, can it be? If so, is its use essential when decoding? |
I haven't looked into all of them yet, but a few points:
|
@amallia can you weigh in on these? |
@gustingonzalez I can think of two ways it could affect performance. (1) because we are using dynamic memory, it is possible that it is implemented by the compiler as a lazy initialization, which would add a check each time. (2) even if not dynamic, thread local array would be still somewhere farther from the top of the execution stack, which may affect caching. On both accounts, it's hard to say without running some experiments. I just submitted a PR: #492 As far as I can tell, there's nothing there that could slow anything down; however, I did change I wasn't able to run any benchmarks at the moment. If you could check performance, that would help a lot. There is nothing about the algorithm that makes thread local necessary. The idea is, as far as I can tell (I didn't write this code myself), is that we don't want to repeatedly allocate memory. If we change it to To be honest, I don't see how just removing FYI @JMMackenzie @amallia |
I performed some experiments on CW12B. I used CIFF query-only index, exported to PISA, then ran BMW with and without the changes in the related PR. I used Web Track queries from 12-14. For both baseline ( It seems that in fact the changes I made improve performance. I'm guessing it's because thread local on heap-allocated variable was doing some sort of lazy initialization which could have slowed it down + it was on the heap as opposed to on the stack as it is now. Here is a box-and-whiskers plot for different quantiles (and average): And zoomed in on the average: I did this on my desktop, but I have to say that given my reasoning and those results, I am fairly convinced that this change at least won't slow us down. I would suggest implementing static-length @amallia @JMMackenzie let me know what you think about all this. |
Hi everyone! I bring up for discussion some possible improvements that could be introduced in the
block codecs
.Interpolative
Starting here:
pisa/include/pisa/codec/block_codecs.hpp
Line 151 in 35dd03b
Why not just use
in
instead ofinbuf
?On the other hand, in the next line:
pisa/include/pisa/codec/block_codecs.hpp
Line 152 in 35dd03b
Could there be an improvement using
numeric_limits<uint32_t>::max()
instead ofuint32_t(-1)
?Simple16
Given that 28 is the maximum
batch size
, I think that thebuf
size could be reduced to4 * block_size + block_size % 28
instead of2 * 8 * block_size
:pisa/include/pisa/codec/simple16.hpp
Line 14 in 35dd03b
Although the buffer size is not a problem for the current memory sizes, if you want to use a
block_size
greater than 128, the memory saving is more substantial. On the other hand, theSimple8b
codec could be improved in a similar way.Again on
Simple16
, on the decoding could be used amemcpy
instruction instead araw loop
:pisa/include/pisa/codec/simple16.hpp
Line 30 in 35dd03b
Alternatively, it could be used the
std::copy
function, but it should be slower thanmemcpy
as with https://lemire.me/blog/2020/01/20/filling-large-arrays-with-zeroes-quickly-in-c/.MaskedVByte
Although
4 bytes
is the maximum size of a raw integer, in theMaskedVByte
codec is required a continuation bit for each byte. Then, the maximum size of an integer encoded with this one will be5 bytes
. Then,buf
could be redefined as5 * block_size
:pisa/include/pisa/codec/maskedvbyte.hpp
Line 21 in 35dd03b
In same way, the buffers of
VarintGB
andVarintG8IU
codecs could be resized as4 + 1 bytes
, assuming (for simplicity)1 byte descriptor
for each integer. Currently the buffer size for these codecs is2 * 4 * block_size
, too.The text was updated successfully, but these errors were encountered: