Skip to content

DataShard and SchemeShard: handle borrowed parts in data erasure #15451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 1, 2025

Conversation

lex007in
Copy link
Collaborator

@lex007in lex007in commented Mar 7, 2025

Changelog entry

...

Changelog category

  • Not for changelog (changelog entry is not required)

Description for reviewers

Return error in case of borrowed parts being present in DataShard. SchemeShard will retry these failed DataCleanup attempts.
In case of split/merge SchemeShard will wait old tablet deletion.

Copy link

github-actions bot commented Mar 7, 2025

🟢 2025-03-26 00:46:56 UTC The validation of the Pull Request description is successful.

Copy link

github-actions bot commented Mar 7, 2025

2025-03-07 09:50:34 UTC Pre-commit check linux-x86_64-release-asan for a71b5d4 has started.
2025-03-07 09:50:49 UTC Artifacts will be uploaded here
2025-03-07 09:53:33 UTC ya make is running...
🟡 2025-03-07 11:28:50 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
12133 11881 0 184 32 36

2025-03-07 11:30:07 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-03-07 11:49:00 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
317 (only retried tests) 261 0 18 7 31

2025-03-07 11:49:13 UTC ya make is running... (failed tests rerun, try 3)
🟢 2025-03-07 12:01:19 UTC Tests successful.

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
92 (only retried tests) 60 0 0 4 28

🟢 2025-03-07 12:01:29 UTC Build successful.
🟡 2025-03-07 12:01:59 UTC ydbd size 3.7 GiB changed* by +322.7 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: a790a6d merge: a71b5d4 diff diff %
ydbd size 3 994 271 672 Bytes 3 994 602 080 Bytes +322.7 KiB +0.008%
ydbd stripped size 1 388 750 024 Bytes 1 388 830 888 Bytes +79.0 KiB +0.006%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Mar 7, 2025

2025-03-07 09:51:08 UTC Pre-commit check linux-x86_64-relwithdebinfo for a71b5d4 has started.
2025-03-07 09:51:12 UTC Artifacts will be uploaded here
2025-03-07 09:54:03 UTC ya make is running...
🟡 2025-03-07 11:17:15 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
26498 23916 0 2 2466 114

2025-03-07 11:19:33 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-03-07 11:36:08 UTC Tests successful.

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
161 (only retried tests) 57 0 0 0 104

🟢 2025-03-07 11:36:18 UTC Build successful.
🟢 2025-03-07 11:36:37 UTC ydbd size 2.1 GiB changed* by +2.2 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 1237379 merge: a71b5d4 diff diff %
ydbd size 2 293 645 624 Bytes 2 293 647 920 Bytes +2.2 KiB +0.000%
ydbd stripped size 480 484 512 Bytes 480 484 960 Bytes +448 Bytes +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@lex007in lex007in requested a review from snaury March 7, 2025 12:27
@lex007in lex007in marked this pull request as ready for review March 7, 2025 12:27
@lex007in lex007in self-assigned this Mar 7, 2025
@lex007in lex007in requested a review from a team as a code owner March 21, 2025 10:32
@lex007in lex007in changed the title LocalDB: add waiting for borrow parts returning in DataCleanup logic DataShard and SchemeShard: handle borrowed parts in data erasure Mar 21, 2025
Copy link

github-actions bot commented Mar 21, 2025

2025-03-21 10:33:45 UTC Pre-commit check linux-x86_64-relwithdebinfo for 4ea03fe has started.
2025-03-21 10:34:00 UTC Artifacts will be uploaded here
2025-03-21 10:36:56 UTC ya make is running...
🔴 2025-03-21 11:04:20 UTC Build failed, see the logs. Also see fail summary

Copy link

github-actions bot commented Mar 21, 2025

2025-03-21 10:35:10 UTC Pre-commit check linux-x86_64-release-asan for 4ea03fe has started.
2025-03-21 10:35:24 UTC Artifacts will be uploaded here
2025-03-21 10:38:07 UTC ya make is running...
🔴 2025-03-21 11:04:28 UTC Build failed, see the logs. Also see fail summary

Copy link

github-actions bot commented Mar 21, 2025

2025-03-21 11:23:32 UTC Pre-commit check linux-x86_64-release-asan for a21d5c0 has started.
2025-03-21 11:23:37 UTC Artifacts will be uploaded here
2025-03-21 11:26:29 UTC ya make is running...
🟡 2025-03-21 12:40:41 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
12162 12076 0 29 22 35

2025-03-21 12:41:49 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-03-21 12:55:39 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
124 (only retried tests) 79 0 3 8 34

2025-03-21 12:55:50 UTC ya make is running... (failed tests rerun, try 3)
🟡 2025-03-21 13:08:31 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
73 (only retried tests) 30 0 3 6 34

🟢 2025-03-21 13:08:38 UTC Build successful.
🟢 2025-03-21 13:09:07 UTC ydbd size 3.8 GiB changed* by +9.4 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 6dc2e21 merge: a21d5c0 diff diff %
ydbd size 4 073 892 992 Bytes 4 073 902 624 Bytes +9.4 KiB +0.000%
ydbd stripped size 1 409 061 352 Bytes 1 409 064 040 Bytes +2.6 KiB +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@lex007in lex007in requested a review from molotkov-and March 21, 2025 11:23
Copy link

github-actions bot commented Mar 21, 2025

2025-03-21 11:30:41 UTC Pre-commit check linux-x86_64-relwithdebinfo for a21d5c0 has started.
2025-03-21 11:31:08 UTC Artifacts will be uploaded here
2025-03-21 11:34:26 UTC ya make is running...
🟡 2025-03-21 12:39:46 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
19670 18329 0 5 1229 107

2025-03-21 12:41:39 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-03-21 12:53:09 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
154 (only retried tests) 48 0 1 0 105

2025-03-21 12:53:21 UTC ya make is running... (failed tests rerun, try 3)
🔴 2025-03-21 13:03:41 UTC Some tests failed, follow the links below.

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
147 (only retried tests) 42 0 1 0 104

🟢 2025-03-21 13:03:50 UTC Build successful.
🟢 2025-03-21 13:04:17 UTC ydbd size 2.2 GiB changed* by +4.4 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 6dc2e21 merge: a21d5c0 diff diff %
ydbd size 2 314 348 712 Bytes 2 314 353 208 Bytes +4.4 KiB +0.000%
ydbd stripped size 484 361 728 Bytes 484 362 624 Bytes +896 Bytes +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Mar 21, 2025

2025-03-21 14:39:32 UTC Pre-commit check linux-x86_64-relwithdebinfo for 0e825db has started.
2025-03-21 14:40:11 UTC Artifacts will be uploaded here
2025-03-21 14:43:31 UTC ya make is running...
🟢 2025-03-21 15:49:46 UTC ydbd size 2.2 GiB changed* by +7.7 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 22e1472 merge: 0e825db diff diff %
ydbd size 2 315 427 832 Bytes 2 315 435 744 Bytes +7.7 KiB +0.000%
ydbd stripped size 484 584 608 Bytes 484 585 888 Bytes +1.2 KiB +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Mar 21, 2025

2025-03-21 14:47:17 UTC Pre-commit check linux-x86_64-release-asan for 0e825db has started.
2025-03-21 14:47:32 UTC Artifacts will be uploaded here
2025-03-21 14:50:27 UTC ya make is running...
🟢 2025-03-21 16:14:39 UTC ydbd size 3.8 GiB changed* by +64.8 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 331ebd5 merge: 0e825db diff diff %
ydbd size 4 075 664 496 Bytes 4 075 730 880 Bytes +64.8 KiB +0.002%
ydbd stripped size 1 409 660 072 Bytes 1 409 671 144 Bytes +10.8 KiB +0.001%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Response = std::make_unique<TEvDataShard::TEvForceDataCleanupResult>(
record.GetDataCleanupGeneration(),
Self->TabletID(),
NKikimrTxDataShard::TEvForceDataCleanupResult::FAILED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Минорное: хорошо кроме статуса без каких-либо подробностей отправлять ещё какой-то ErrorReason, который бы можно было залоггировать на стороне клиента (например SchemeShard'а). Также меня несколько смущает, что эта ошибка будет повторяться, пока на шарде что-то не изменится, но о том что что-то изменилось узнать нельзя. Будет ли SchemeShard повторять запрос снова и снова?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Если добавить стоку ErrorReason, то увеличится размер сообщения на ровном месте, хотя может это и не страшно.
Да, SchemeShard будет всё время ретраить, и все ошибки которые может вернуть DataCleanup, можно и нужно ретраить. Можно переименовать как-то явно FAILED -> RETRYABLE_ERROR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Сделал больше разных енумов и залогировал.

LOG_DEBUG_S(ctx, NKikimrServices::FLAT_TX_SCHEMESHARD,
"TTxCompleteDataErasureShard: data erasure failed at DataShard #" << record.GetTabletId()
<< ", schemestard: " << Self->TabletID());
return; // will be retried after timout in the queue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Из комментария не ясно что это за таймаут и когда запрос повторят? Точно ли не нужно с ним в очереди что-то сделать?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Эта очередь так устроена, что нужно явно звать OnDone() для задач, обработка которых завершена (там внутри OnDone задача удаляется из очереди в этот момент). Если не OnDone не вызывать, то потом вызывается обработчик таймаута, который в случае очиски вот тут: https://github.com/ydb-platform/ydb/blob/main/ydb/core/tx/schemeshard/schemeshard__tenant_data_erasure_manager.cpp#L182 -- именно он ретраит в конце. Сам таймаут задаётся в конфиге очистки.

Copy link

github-actions bot commented Mar 21, 2025

2025-03-21 17:05:32 UTC Pre-commit check linux-x86_64-release-asan for 0dfc973 has started.
2025-03-21 17:05:47 UTC Artifacts will be uploaded here
2025-03-21 17:08:58 UTC ya make is running...
🟡 2025-03-21 18:55:59 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
14132 14029 0 58 9 36

2025-03-21 18:57:15 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-03-21 19:09:33 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
153 (only retried tests) 115 0 3 2 33

2025-03-21 19:09:45 UTC ya make is running... (failed tests rerun, try 3)
🟡 2025-03-21 19:21:01 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
64 (only retried tests) 29 0 3 1 31

🟢 2025-03-21 19:21:08 UTC Build successful.
🟢 2025-03-21 19:21:38 UTC ydbd size 3.8 GiB changed* by +10.8 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: a95ce0d merge: 0dfc973 diff diff %
ydbd size 4 075 721 248 Bytes 4 075 732 312 Bytes +10.8 KiB +0.000%
ydbd stripped size 1 409 668 456 Bytes 1 409 671 496 Bytes +3.0 KiB +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Mar 21, 2025

2025-03-21 17:07:12 UTC Pre-commit check linux-x86_64-relwithdebinfo for 0dfc973 has started.
2025-03-21 17:07:27 UTC Artifacts will be uploaded here
2025-03-21 17:10:27 UTC ya make is running...
🟡 2025-03-21 18:44:00 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
28614 26004 0 3 2492 115

2025-03-21 18:47:52 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-03-21 18:58:36 UTC Tests successful.

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
166 (only retried tests) 63 0 0 0 103

🟢 2025-03-21 18:58:46 UTC Build successful.
🟢 2025-03-21 18:59:05 UTC ydbd size 2.2 GiB changed* by +5.6 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: a95ce0d merge: 0dfc973 diff diff %
ydbd size 2 315 431 248 Bytes 2 315 436 944 Bytes +5.6 KiB +0.000%
ydbd stripped size 484 584 992 Bytes 484 586 048 Bytes +1.0 KiB +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@lex007in lex007in requested a review from snaury March 24, 2025 07:41
Copy link

github-actions bot commented Mar 26, 2025

2025-03-26 00:46:58 UTC Pre-commit check linux-x86_64-release-asan for f688c04 has started.
2025-03-26 00:47:13 UTC Artifacts will be uploaded here
2025-03-26 00:50:15 UTC ya make is running...
🟡 2025-03-26 02:33:13 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
14150 14066 0 39 11 34

2025-03-26 02:34:22 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-03-26 02:46:42 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
122 (only retried tests) 83 0 5 3 31

2025-03-26 02:46:52 UTC ya make is running... (failed tests rerun, try 3)
🟡 2025-03-26 02:58:07 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
65 (only retried tests) 28 0 5 4 28

🟢 2025-03-26 02:58:14 UTC Build successful.
🟢 2025-03-26 02:58:42 UTC ydbd size 3.8 GiB changed* by +10.2 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: f7971e1 merge: f688c04 diff diff %
ydbd size 4 085 067 968 Bytes 4 085 078 456 Bytes +10.2 KiB +0.000%
ydbd stripped size 1 411 362 856 Bytes 1 411 365 576 Bytes +2.7 KiB +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Mar 26, 2025

2025-03-26 00:47:08 UTC Pre-commit check linux-x86_64-relwithdebinfo for f688c04 has started.
2025-03-26 00:47:33 UTC Artifacts will be uploaded here
2025-03-26 00:51:19 UTC ya make is running...
🟡 2025-03-26 02:30:18 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
28695 26026 0 4 2558 107

2025-03-26 02:32:38 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-03-26 02:45:15 UTC Tests successful.

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
168 (only retried tests) 66 0 0 0 102

🟢 2025-03-26 02:45:22 UTC Build successful.
🟢 2025-03-26 02:45:45 UTC ydbd size 2.2 GiB changed* by +5.5 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: f7971e1 merge: f688c04 diff diff %
ydbd size 2 322 075 280 Bytes 2 322 080 896 Bytes +5.5 KiB +0.000%
ydbd stripped size 485 431 104 Bytes 485 432 096 Bytes +992 Bytes +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Mar 26, 2025

2025-03-26 13:20:06 UTC Pre-commit check linux-x86_64-release-asan for 6d2861a has started.
2025-03-26 13:20:13 UTC Artifacts will be uploaded here
2025-03-26 13:23:26 UTC ya make is running...
🟡 2025-03-26 15:18:39 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
14175 14042 0 84 14 35

2025-03-26 15:20:03 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-03-26 15:34:54 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
178 (only retried tests) 137 0 6 3 32

2025-03-26 15:35:05 UTC ya make is running... (failed tests rerun, try 3)
🟡 2025-03-26 15:50:00 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
65 (only retried tests) 29 0 4 3 29

🟢 2025-03-26 15:50:09 UTC Build successful.
🟡 2025-03-26 15:50:41 UTC ydbd size 3.8 GiB changed* by +231.3 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 22ccfe5 merge: 6d2861a diff diff %
ydbd size 4 110 540 160 Bytes 4 110 777 032 Bytes +231.3 KiB +0.006%
ydbd stripped size 1 420 396 008 Bytes 1 420 475 336 Bytes +77.5 KiB +0.006%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

}
}
if (DataErasureManager->GetStatus() == EDataErasureStatus::IN_PROGRESS) {
Execute(CreateTxAddEntryToDataErasure(dataErasureShards), this->ActorContext());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А насколько это вообще безопасно тут делать? Ведь SetPartitioning вызывается из операции split/merge в транзакции, а здесь шедулится какая-то другая транзакция. И завершение split/merge может успешно закоммититься, а до этой транзакции даже очередь не дойдёт. В итоге шарды окажутся просто потерянными? Ну и ещё смущает, что SetPartitioning вызывается в рамках загрузки schemeshard'а, вообще тут кажется никогда не предполагалось какой-то такой сложной логики/действий.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Да, главное, что смущает -- что SetPartitioning() перегружается несвойственными ей делами. Я уже предлагал посмотреть, как можно переделать.

От вызова SetPartitioning() во время TxInit защищаться и не надо -- это ключевая вещь для продолжения процесса data erasure, если он уже работал до рестарта. Хотя конечно не хватает комментариев c описанием зависимостей в порядке загрузки состояния DataErasureManager и шардов таблиц.

до этой транзакции даже очередь не дойдёт

Это если schemeshard перезапустится?
Тогда статус data erasure будет IN_PROGRESS и как раз SetPartitioning() во время TxInit отработают и обновят актуальный список шардов для чистки. Логически верно, но кажется напряжно. Было бы лучше на рестарте выполнять один проход по общему списку шардов вместо отдельной транзакции на каждую таблицу.

Copy link

github-actions bot commented Mar 26, 2025

2025-03-26 13:29:05 UTC Pre-commit check linux-x86_64-relwithdebinfo for 6d2861a has started.
2025-03-26 13:29:12 UTC Artifacts will be uploaded here
2025-03-26 13:32:30 UTC ya make is running...
🟡 2025-03-26 15:13:44 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
28720 26063 0 3 2542 112

2025-03-26 15:16:29 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-03-26 15:30:53 UTC Tests successful.

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
161 (only retried tests) 55 0 0 0 106

🟢 2025-03-26 15:31:06 UTC Build successful.
🟡 2025-03-26 15:31:26 UTC ydbd size 2.2 GiB changed* by +112.3 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 22ccfe5 merge: 6d2861a diff diff %
ydbd size 2 342 731 376 Bytes 2 342 846 344 Bytes +112.3 KiB +0.005%
ydbd stripped size 489 884 928 Bytes 489 906 048 Bytes +20.6 KiB +0.004%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@lex007in lex007in requested a review from ijon March 26, 2025 14:38
Comment on lines +183 to +185
if (Self->DataErasureManager->GetStatus() == EDataErasureStatus::IN_PROGRESS) {
Self->Execute(Self->CreateTxCancelDataErasureShards({ShardIdx}));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Почему потребовалось переносить запуск CancelDataErasureShards сюда?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

В прошлом месте таблетка ещё не была удалена, и очистка могла успешно завершится до удаления этой таблетки, что плохо и нарушает гарантии удаления. Сюда же попадаем уже после того, как хайв ответил, что таблетка удалена.
(Да, ещё остаётся проблема, что хайв на самом деле отвечает до того, как удалил данные в блобсторадже, но это собираемся отдельно доделывать).

}
}
if (DataErasureManager->GetStatus() == EDataErasureStatus::IN_PROGRESS) {
Execute(CreateTxAddEntryToDataErasure(dataErasureShards), this->ActorContext());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Да, главное, что смущает -- что SetPartitioning() перегружается несвойственными ей делами. Я уже предлагал посмотреть, как можно переделать.

От вызова SetPartitioning() во время TxInit защищаться и не надо -- это ключевая вещь для продолжения процесса data erasure, если он уже работал до рестарта. Хотя конечно не хватает комментариев c описанием зависимостей в порядке загрузки состояния DataErasureManager и шардов таблиц.

до этой транзакции даже очередь не дойдёт

Это если schemeshard перезапустится?
Тогда статус data erasure будет IN_PROGRESS и как раз SetPartitioning() во время TxInit отработают и обновят актуальный список шардов для чистки. Логически верно, но кажется напряжно. Было бы лучше на рестарте выполнять один проход по общему списку шардов вместо отдельной транзакции на каждую таблицу.

Copy link
Collaborator

@ijon ijon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for now.
But follow-up development is needed.

@lex007in lex007in merged commit 4556432 into ydb-platform:main Apr 1, 2025
15 checks passed
@lex007in lex007in deleted the borrow branch April 1, 2025 14:23
@lex007in lex007in mentioned this pull request Apr 1, 2025
lex007in added a commit that referenced this pull request Apr 4, 2025
Fixes for data cleanup edge cases:
- DataShard: clean readsets in DataCleanup (#15438)
- DataShard and SchemeShard: handle borrowed parts in data erasure (#15451)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants