Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PanicOnWarnings option to detect SQL warnings and fail the copy process #1500

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

grodowski
Copy link

@grodowski grodowski commented Feb 28, 2025

Related issue: #1498

Description

This PR adds a CLI flag --panic-on-warnings that makes gh-ost fail after copying each batch when:

  • the expected, selected row count differs from actual copied row count (affectedRows)
  • SQL warnings are present, indicating why some rows were dropped

I also decided to always log all SQL warnings, which could be useful to find issues with data truncation or not null constraints even when the row count matches.

Opening as draft with a few remaining items:

  • review documentation
  • review test coverage
  • check explain analyze on the updated range select query

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

@grodowski
Copy link
Author

grodowski commented Mar 11, 2025

Ran gh-ost locally on a 1.7GiB table with 110k rows with generated blob data to check if there's any noticable impact of the added count(*) for each batch.

Migrate cmd:

gh-ost --allow-on-master --allow-master-master --table="large_books" --alter="add index author_index(author(255))" --database="schema_migrations_test_app_development" --user="root" --password="" --initially-drop-old-table --initially-drop-ghost-table --default-retries=3 --verbose --debug --host=$MYSQL_HOST --port=$MYSQL_PORT --execute

Generic time benchmark on two binaries build on master and grodowski/raise_on_warnings (gh-ost-new)

gh-ost-master --allow-on-master --allow-master-master ... 0.99s user 1.45s system 7% cpu 30.708 total
gh-ost-master --allow-on-master --allow-master-master ... 0.98s user 1.43s system 7% cpu 31.321 total
gh-ost-new --allow-on-master --allow-master-master ... 1.00s user 1.46s system 7% cpu 32.743 total
gh-ost-new --allow-on-master --allow-master-master ... 0.98s user 1.41s system 7% cpu 31.117 total

Then, ran these explains on the copy query (BuildUniqueKeyRangeEndPreparedQueryViaTemptable):

-- before
explain analyze select /* gh-ost mydb.tbl test */ id
from (
    select
        id
    from
        large_books
    where id >= 50000 and id <= 52000
    order by
        id asc
    limit 1000) select_osc_chunk
order by
    id desc
limit 1

| -> Limit: 1 row(s)  (cost=1021..1021 rows=1) (actual time=0.943..0.943 rows=1 loops=1)
    -> Sort: select_osc_chunk.id DESC, limit input to 1 row(s) per chunk  (cost=1021..1021 rows=1) (actual time=0.942..0.942 rows=1 loops=1)
        -> Table scan on select_osc_chunk  (cost=906..921 rows=1000) (actual time=0.746..0.843 rows=1000 loops=1)
            -> Materialize  (cost=906..906 rows=1000) (actual time=0.744..0.744 rows=1000 loops=1)
                -> Limit: 1000 row(s)  (cost=806 rows=1000) (actual time=0.137..0.64 rows=1000 loops=1)
                    -> Filter: ((large_books.id >= 50000) and (large_books.id <= 52000))  (cost=806 rows=3992) (actual time=0.136..0.579 rows=1000 loops=1)
                        -> Covering index range scan on large_books using PRIMARY over (50000 <= id <= 52000)  (cost=806 rows=3992) (actual time=0.132..0.446 rows=1000 loops=1)
-- after
explain analyze select /* gh-ost mydb.tbl test */
    id,
    (select count(*) from (
        select
            id
        from
            large_books
        where id >= 50000 and id <= 52000
        order by
            id asc
        limit 1000
    ) select_osc_chunk)
from (
    select
        id
    from
        large_books
    where id >= 50000 and id <= 52000
    order by
        id asc
    limit 1000
) select_osc_chunk
order by
    id desc
limit 1;

| -> Limit: 1 row(s)  (cost=1021..1021 rows=1) (actual time=1.21..1.21 rows=1 loops=1)
    -> Sort: select_osc_chunk.id DESC, limit input to 1 row(s) per chunk  (cost=1021..1021 rows=1) (actual time=1.21..1.21 rows=1 loops=1)
        -> Table scan on select_osc_chunk  (cost=906..921 rows=1000) (actual time=0.955..1.08 rows=1000 loops=1)
            -> Materialize  (cost=906..906 rows=1000) (actual time=0.953..0.953 rows=1000 loops=1)
                -> Limit: 1000 row(s)  (cost=806 rows=1000) (actual time=0.161..0.813 rows=1000 loops=1)
                    -> Filter: ((large_books.id >= 50000) and (large_books.id <= 52000))  (cost=806 rows=3992) (actual time=0.16..0.732 rows=1000 loops=1)
                        -> Covering index range scan on large_books using PRIMARY over (50000 <= id <= 52000)  (cost=806 rows=3992) (actual time=0.157..0.574 rows=1000 loops=1)
-> Select #2 (subquery in projection; run only once)
    -> Aggregate: count(0)  (cost=1021..1021 rows=1) (actual time=1.04..1.04 rows=1 loops=1)
        -> Table scan on select_osc_chunk  (cost=906..921 rows=1000) (actual time=0.894..0.992 rows=1000 loops=1)
            -> Materialize  (cost=906..906 rows=1000) (actual time=0.894..0.894 rows=1000 loops=1)
                -> Limit: 1000 row(s)  (cost=806 rows=1000) (actual time=0.14..0.774 rows=1000 loops=1)
                    -> Filter: ((large_books.id >= 50000) and (large_books.id <= 52000))  (cost=806 rows=3992) (actual time=0.14..0.694 rows=1000 loops=1)
                        -> Covering index range scan on large_books using PRIMARY over (50000 <= id <= 52000)  (cost=806 rows=3992) (actual time=0.139..0.535 rows=1000 loops=1)

Looks like there will be no significant change in perf?

@grodowski grodowski marked this pull request as ready for review March 11, 2025 14:57
@Copilot Copilot bot review requested due to automatic review settings March 11, 2025 14:57
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new CLI flag (panic-on-warnings) that causes gh-ost to fail when SQL warnings are detected during row copying and also logs all SQL warnings.

  • A new flag "--panic-on-warnings" is added in the main CLI and corresponding context fields.
  • The iteration range calculation now also returns an expected row count with adjusted logic to process SQL warnings.
  • Various SQL builder functions and tests have been updated to support the enhanced query structure and error handling.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
dev.yml Adds configuration for testing environments and dependency management.
go/logic/applier.go Updates range end calculation and SQL warning processing logic.
go/sql/builder_test.go Adjusts expected queries and argument lists in tests to match new behavior.
go/logic/migrator.go Incorporates expected range size into chunk iteration and warning checks.
go/sql/builder.go Modifies SQL builder queries with minor refactorings and marks an unused variable.
go/cmd/gh-ost/main.go Introduces the new CLI flag for panic-on-warnings.
go/base/context.go Adds new context fields for managing SQL warnings and panic configuration.
Comments suppressed due to low confidence (1)

go/sql/builder.go:278

  • [nitpick] The variable 'uniqueKeyColumnDescending' is declared but not used anywhere in the function. Removing it could improve code clarity.
uniqueKeyColumnDescending := make([]string, len(uniqueKeyColumnNames)) // TODO unused variable

for _, warning := range this.migrationContext.MigrationLastInsertSQLWarnings {
this.migrationContext.Log.Infof("ApplyIterationInsertQuery has SQL warnings! %s", warning)
}
if expectedRangeSize != rowsAffected {
Copy link
Preview

Copilot AI Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error generated by terminateRowIteration is not being returned, which may inadvertently allow execution to continue even after detecting SQL warnings. Consider returning the error from terminateRowIteration to abort the operation as intended.

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
@meiji163
Copy link
Contributor

meiji163 commented Mar 12, 2025

Looks good! I will test the performance of this internally (hopefully this week)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants