Improve block codecs #407

gustingonzalez · 2020-06-04T19:29:29Z

Hi everyone! I bring up for discussion some possible improvements that could be introduced in the block codecs.

Interpolative

Starting here:

pisa/include/pisa/codec/block_codecs.hpp

Line 151 in 35dd03b

uint8_t const* inbuf = in;

Why not just use in instead of inbuf?

On the other hand, in the next line:

pisa/include/pisa/codec/block_codecs.hpp

Line 152 in 35dd03b

if (sum_of_values == uint32_t(-1)) {

Could there be an improvement using numeric_limits<uint32_t>::max() instead of uint32_t(-1)?

Simple16

Given that 28 is the maximum batch size, I think that the buf size could be reduced to 4 * block_size + block_size % 28 instead of 2 * 8 * block_size:

pisa/include/pisa/codec/simple16.hpp

Line 14 in 35dd03b

thread_local std::vector<uint8_t> buf(2 * 8 * block_size);

Although the buffer size is not a problem for the current memory sizes, if you want to use a block_size greater than 128, the memory saving is more substantial. On the other hand, the Simple8b codec could be improved in a similar way.

Again on Simple16, on the decoding could be used a memcpy instruction instead a raw loop:

pisa/include/pisa/codec/simple16.hpp

Line 30 in 35dd03b

for (size_t i = 0; i < n; ++i) {

Alternatively, it could be used the std::copy function, but it should be slower than memcpy as with https://lemire.me/blog/2020/01/20/filling-large-arrays-with-zeroes-quickly-in-c/.

MaskedVByte

Although 4 bytes is the maximum size of a raw integer, in the MaskedVByte codec is required a continuation bit for each byte. Then, the maximum size of an integer encoded with this one will be 5 bytes. Then, buf could be redefined as 5 * block_size:

pisa/include/pisa/codec/maskedvbyte.hpp

Line 21 in 35dd03b

thread_local std::vector<uint8_t> buf(2 * block_size * sizeof(uint32_t));

In same way, the buffers of VarintGB and VarintG8IU codecs could be resized as 4 + 1 bytes, assuming (for simplicity) 1 byte descriptor for each integer. Currently the buffer size for these codecs is 2 * 4 * block_size, too.

The text was updated successfully, but these errors were encountered:

gustingonzalez · 2020-06-04T20:04:23Z

On another subtopic, after some quick tests I've also found that using thread_local adds a slight speed penalty, can it be? If so, is its use essential when decoding?

elshize · 2020-06-04T20:14:32Z

I haven't looked into all of them yet, but a few points:

Since the block size is always 128, I'd replace thread_local std::vector<uint8_t> with constexpr std::array<uint8_t, size>, which will also require to do static constexpr uint64_t block_size = 128;.
I don't see the reason for inbuf either.
I totally agree about numeric_limits
I agree with removing the raw loop but we should just use std::copy. We should measure just in case, but I'm sure it will be as fast as can be. There are a few problems with that post, which some people also point out in the comment. In short, the compilation was done on -O2 and also there was an implicit type casting that at the moment prevented gcc from optimizing it away (you can see how it can happen, if you try to copy, say, 16-bit integers to an array of 32-bit integers, then you can't just memset as is). From what I see in the comments, even that was already resolved. Anyway, we should have absolutely no issues copying between two arrays of the same type with std::copy. That said, we should test if there's any difference.

elshize · 2020-06-08T12:08:10Z

@amallia can you weigh in on these?

elshize · 2022-11-20T02:16:18Z

On another subtopic, after some quick tests I've also found that using thread_local adds a slight speed penalty, can it be? If so, is its use essential when decoding?

@gustingonzalez I can think of two ways it could affect performance. (1) because we are using dynamic memory, it is possible that it is implemented by the compiler as a lazy initialization, which would add a check each time. (2) even if not dynamic, thread local array would be still somewhere farther from the top of the execution stack, which may affect caching. On both accounts, it's hard to say without running some experiments.

I just submitted a PR: #492 As far as I can tell, there's nothing there that could slow anything down; however, I did change vector to array in the Simple codecs. This could potentially make it faster due to (1) from above.

I wasn't able to run any benchmarks at the moment. If you could check performance, that would help a lot.

There is nothing about the algorithm that makes thread local necessary. The idea is, as far as I can tell (I didn't write this code myself), is that we don't want to repeatedly allocate memory. If we change it to array, we won't have to allocate, so we might be able to remove it. But idk how that would affect caching and inlining. We would have to measure.

To be honest, I don't see how just removing thread_local in the current implementation (with vector) can improve performance, but maybe I'm missing something...

FYI @JMMackenzie @amallia

elshize · 2022-11-20T20:08:44Z

I performed some experiments on CW12B. I used CIFF query-only index, exported to PISA, then ran BMW with and without the changes in the related PR. I used Web Track queries from 12-14. For both baseline (master) and the changed version (improve-block-codecs) I ran all queries 20 times, and recorded the results. I only ran Simple16 at this time, but the changes I made for Simple8b are almost identical.

It seems that in fact the changes I made improve performance. I'm guessing it's because thread local on heap-allocated variable was doing some sort of lazy initialization which could have slowed it down + it was on the heap as opposed to on the stack as it is now.

Here is a box-and-whiskers plot for different quantiles (and average):

And zoomed in on the average:

I did this on my desktop, but I have to say that given my reasoning and those results, I am fairly convinced that this change at least won't slow us down.

I would suggest implementing static-length std::array for buffers for the rest of the codecs mentioned in this thread. I will want to run the same test for those as well eventually.

@amallia @JMMackenzie let me know what you think about all this.

gustingonzalez added the enhancement New feature or request label Jun 4, 2020

elshize mentioned this issue Nov 20, 2022

Improve block codecs #492

Merged

elshize self-assigned this Nov 20, 2022

elshize closed this as completed in #492 Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve block codecs #407

Improve block codecs #407

gustingonzalez commented Jun 4, 2020

gustingonzalez commented Jun 4, 2020

elshize commented Jun 4, 2020

elshize commented Jun 8, 2020

elshize commented Nov 20, 2022

elshize commented Nov 20, 2022

Improve block codecs #407

Improve block codecs #407

Comments

gustingonzalez commented Jun 4, 2020

Interpolative

Simple16

MaskedVByte

gustingonzalez commented Jun 4, 2020

elshize commented Jun 4, 2020

elshize commented Jun 8, 2020

elshize commented Nov 20, 2022

elshize commented Nov 20, 2022