Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use BlocksSerializer to replace StringBlock to simplify the serialization #17667

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

KKould
Copy link
Member

@KKould KKould commented Mar 28, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

when using StringBlocks to query large amounts of data, data was frequently materialized and there were meaningless data conversions. For example, the String type was first converted to Bytes and then to String.

This PR uses BlocksSerializer to collect DataBlocks that need to be serialized. When serde serialization is performed, it will imitate the behavior of Vec for serialization, thus avoiding the need for DataBlock to be converted to string line by line in the original StringBlock and then collected as Vec<Vec<Option<String>>>

When extracting data from a DataBlock, BlocksSerializer will try to directly extract data from String, Variant, and other types that are Strings, avoiding the meaningless conversion to Bytes in the middle.

Originally, when processing remaining_size and remaining_rows, the specific size of the data is calculated row by row. However, when the amount of data is large, this will take up some overhead. Therefore, this PR uses the Proportional Splitting. Assuming that the data in DataBlock is uniform, take_rows = (remain_size * block_rows) / block_size is used to determine how many rows of data can be obtained under the remaining_size limit, and then further limit it through remaining_row.

test case:

CREATE TABLE test AS
SELECT 
    number AS id,
    JSON_OBJECT(
        'id', CAST(number % 100000000 AS STRING),
        'event_type', CONCAT('submit_abcd', CAST(number % 100 AS STRING)),
        'client_code', CONCAT('aaaaaaaaaaaaaaaaaaa', CAST(number % 100 AS STRING)),
        'activity_id', CONCAT('aaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbccccccccccccccddddddddddddddddeeeeeeeeeeeeeeeee', CAST(number % 100 AS STRING))
    ) AS payload
FROM numbers(50000000);


SELECT * FROM test;

Release Mode:

  • Before: 50000000 rows read in 66.891 sec. Processed 50 million rows, 10.19 GiB (747.48 thousand rows/s, 155.92 MiB/s)
  • After: 50000000 rows read in 53.671 sec. Processed 50 million rows, 10.19 GiB (931.6 thousand rows/s, 194.33 MiB/s)

e132dab7e5a558b90f124541436ed401

after.pb.gz
before.pb.gz

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - There are already tests covering this: test_max_size_per_page & test_max_size_per_page_total_rows

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@KKould KKould changed the title perf: use ` to replace StringBlock` to simplify the serialization refactor: use ` to replace StringBlock` to simplify the serialization Mar 28, 2025
@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Mar 28, 2025
@KKould KKould force-pushed the perf/eliminate_string_block branch from 51bdc5e to 7e83c17 Compare March 28, 2025 08:51
@b41sh b41sh requested review from youngsofun and sundy-li March 28, 2025 09:15
@KKould KKould force-pushed the perf/eliminate_string_block branch 2 times, most recently from 1e09ad5 to 69edfa2 Compare March 28, 2025 11:08
Signed-off-by: Kould <[email protected]>
@KKould KKould force-pushed the perf/eliminate_string_block branch from 69edfa2 to 637d083 Compare March 28, 2025 12:30
@sundy-li
Copy link
Member

Maybe we need to add a new HTTPOutputFormat in https://github.com/datafuselabs/databend/blob/d57c648cfd564189ced1e080524b76d388775a9b/src/query/formats/src/output_format

@KKould KKould changed the title refactor: use ` to replace StringBlock` to simplify the serialization refactor: use BlocksSerializer to replace StringBlock to simplify the serialization Mar 28, 2025
Signed-off-by: Kould <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimization: Eliminate StringBlock in HTTP Server Data Serialization
4 participants