Skip to content

[v24.3.x] datalake: shutdown datalake managers before stopping partitions #25901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

mmaslankaprv
Copy link
Member

Backport of PR #25307

Datalake managers may cache the shared pointer to the state machines.
When partitions are stopped the `raft::consensus` pointer is destroyed.
This may lead to a situation in which a coordinator accesses a state
machine with dangling pointer.

This needs a more systematic solution in the state machine itself but in
order to quickly fix the problem and manage the services lifecycle
correctly we simply stop the manager before stopping all partitions.
This way no datalake services will try to access a `raft::consensus`
instance that has already been removed.

Fixes: CORE-8380

Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 41fa87b)
@mmaslankaprv mmaslankaprv added this to the v24.3.x-next milestone Apr 23, 2025
@mmaslankaprv mmaslankaprv added the kind/backport PRs targeting a stable branch label Apr 23, 2025
@mmaslankaprv mmaslankaprv marked this pull request as ready for review April 23, 2025 14:15
@mmaslankaprv mmaslankaprv enabled auto-merge April 23, 2025 14:15
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 23, 2025

Retry command for Build#64949

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/datalake/compaction_gaps_test.py::CompactionGapsTest.test_translation_no_gaps@{"cloud_storage_type":1}
tests/rptest/tests/random_node_operations_test.py::RandomNodeOperationsTest.test_node_operations@{"cloud_storage_type":1,"enable_failures":false,"mixed_versions":false,"with_iceberg":true,"with_tiered_storage":true}

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#64949
test_id test_kind job_url test_status passed
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/64949#01966301-e49e-48d9-8c72-82b976999902 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/64949#01966346-8b89-4424-b610-662e720b950b FLAKY 17/21
rptest.tests.datalake.compaction_gaps_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/64949#01966346-8b8a-4fec-bae5-6bade3994655 FLAKY 12/21
translator_test_rpfixture.translator_test_rpfixture unit https://buildkite.com/redpanda/redpanda/builds/64949#01966301-e49f-4ce9-9e21-7ce0af0abced FLAKY 1/2

@andrwng
Copy link
Contributor

andrwng commented Apr 23, 2025

For posterity, seems like the conflict is a log line that didn't exist in this branch and is now pulled in

@mmaslankaprv mmaslankaprv disabled auto-merge April 24, 2025 06:23
@mmaslankaprv mmaslankaprv merged commit 8e1205c into redpanda-data:v24.3.x Apr 24, 2025
14 of 17 checks passed
@piyushredpanda piyushredpanda modified the milestones: v24.3.x-next, v24.3.11 Apr 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v24.3.x] datalake: shutdown datalake managers before stopping partitions
4 participants