refactor: use `BlocksSerializer` to replace `StringBlock` to simplify the serialization #17667

KKould · 2025-03-28T08:44:22Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

fixes: Optimization: Eliminate StringBlock in HTTP Server Data Serialization #17662

when using StringBlocks to query large amounts of data, data was frequently materialized and there were meaningless data conversions. For example, the String type was first converted to Bytes and then to String.

This PR uses BlocksSerializer to collect DataBlocks that need to be serialized. When serde serialization is performed, it will imitate the behavior of Vec for serialization, thus avoiding the need for DataBlock to be converted to string line by line in the original StringBlock and then collected as Vec<Vec<Option<String>>>

When extracting data from a DataBlock, BlocksSerializer will try to directly extract data from String, Variant, and other types that are Strings, avoiding the meaningless conversion to Bytes in the middle.

Originally, when processing remaining_size and remaining_rows, the specific size of the data is calculated row by row. However, when the amount of data is large, this will take up some overhead. Therefore, this PR uses the Proportional Splitting. Assuming that the data in DataBlock is uniform, take_rows = (remain_size * block_rows) / block_size is used to determine how many rows of data can be obtained under the remaining_size limit, and then further limit it through remaining_row.

test case:

CREATE TABLE test AS
SELECT 
    number AS id,
    JSON_OBJECT(
        'id', CAST(number % 100000000 AS STRING),
        'event_type', CONCAT('submit_abcd', CAST(number % 100 AS STRING)),
        'client_code', CONCAT('aaaaaaaaaaaaaaaaaaa', CAST(number % 100 AS STRING)),
        'activity_id', CONCAT('aaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbccccccccccccccddddddddddddddddeeeeeeeeeeeeeeeee', CAST(number % 100 AS STRING))
    ) AS payload
FROM numbers(50000000);


SELECT * FROM test;

Release Mode:

Before: 50000000 rows read in 66.891 sec. Processed 50 million rows, 10.19 GiB (747.48 thousand rows/s, 155.92 MiB/s)
After: 50000000 rows read in 53.671 sec. Processed 50 million rows, 10.19 GiB (931.6 thousand rows/s, 194.33 MiB/s)

after.pb.gz
before.pb.gz

Tests

Unit Test
Logic Test
Benchmark Test
No Test - There are already tests covering this: test_max_size_per_page & test_max_size_per_page_total_rows

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

… serialization Signed-off-by: Kould <[email protected]>

src/query/service/src/servers/http/v1/query/blocks_serializer.rs

Signed-off-by: Kould <[email protected]>

sundy-li · 2025-03-28T13:59:28Z

Maybe we need to add a new HTTPOutputFormat in https://github.com/datafuselabs/databend/blob/d57c648cfd564189ced1e080524b76d388775a9b/src/query/formats/src/output_format

Signed-off-by: Kould <[email protected]>

src/query/service/src/servers/http/v1/query/blocks_serializer.rs

src/query/service/src/servers/http/v1/query/page_manager.rs

Signed-off-by: Kould <[email protected]>

KKould changed the title ~~perf: use ` to replace StringBlock` to simplify the serialization~~ refactor: use ` to replace StringBlock` to simplify the serialization Mar 28, 2025

github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Mar 28, 2025

perf: use BlocksSerializer to replace StringBlock to simplify the…

7e83c17

… serialization Signed-off-by: Kould <[email protected]>

KKould force-pushed the perf/eliminate_string_block branch from 51bdc5e to 7e83c17 Compare March 28, 2025 08:51

b41sh reviewed Mar 28, 2025

View reviewed changes

src/query/service/src/servers/http/v1/query/blocks_serializer.rs Show resolved Hide resolved

b41sh requested review from youngsofun and sundy-li March 28, 2025 09:15

KKould force-pushed the perf/eliminate_string_block branch 2 times, most recently from 1e09ad5 to 69edfa2 Compare March 28, 2025 11:08

chore: codefmt

637d083

Signed-off-by: Kould <[email protected]>

KKould force-pushed the perf/eliminate_string_block branch from 69edfa2 to 637d083 Compare March 28, 2025 12:30

KKould changed the title ~~refactor: use ` to replace StringBlock` to simplify the serialization~~ refactor: use BlocksSerializer to replace StringBlock to simplify the serialization Mar 28, 2025

chore: fix ci

fd89bdf

Signed-off-by: Kould <[email protected]>

youngsofun reviewed Mar 29, 2025

View reviewed changes

src/query/service/src/servers/http/v1/query/blocks_serializer.rs Outdated Show resolved Hide resolved

youngsofun reviewed Mar 29, 2025

View reviewed changes

src/query/service/src/servers/http/v1/query/page_manager.rs Show resolved Hide resolved

youngsofun approved these changes Mar 29, 2025

View reviewed changes

KKould added 2 commits March 30, 2025 12:32

chore: fix typo

086843b

Signed-off-by: Kould <[email protected]>

chore: resize remain_size for flightsql

002fb88

Signed-off-by: Kould <[email protected]>

sundy-li approved these changes Mar 31, 2025

View reviewed changes

b41sh approved these changes Mar 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: use `BlocksSerializer` to replace `StringBlock` to simplify the serialization #17667

refactor: use `BlocksSerializer` to replace `StringBlock` to simplify the serialization #17667

KKould commented Mar 28, 2025 •

edited

Loading

sundy-li commented Mar 28, 2025

refactor: use BlocksSerializer to replace StringBlock to simplify the serialization #17667

Are you sure you want to change the base?

refactor: use BlocksSerializer to replace StringBlock to simplify the serialization #17667

Conversation

KKould commented Mar 28, 2025 • edited Loading

Summary

Tests

Type of change

sundy-li commented Mar 28, 2025

refactor: use `BlocksSerializer` to replace `StringBlock` to simplify the serialization #17667

refactor: use `BlocksSerializer` to replace `StringBlock` to simplify the serialization #17667

KKould commented Mar 28, 2025 •

edited

Loading