[FLINK-37366] Allow configurable retry for Kafka topic metadata fetch #155

suez1224 · 2025-02-26T07:52:47Z

add topic.metadata.max.retry and topic.metadata.retry.interval.ms options to configure kafka topic metadata fetch retry.
implement the kafka topic metadata fetch failure retry logic in KafkaSubscriberUtils.getTopicMetadata()

boring-cyborg · 2025-02-26T07:52:51Z

Thanks for opening this pull request! Please check out our contributing guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)

AHeise

Before doing a detailed review. I have a fundamental question on the approach that I inlined. PTAL

AHeise · 2025-02-26T09:10:44Z

...java/org/apache/flink/connector/kafka/source/enumerator/subscriber/KafkaSubscriberUtils.java

+            AdminClient adminClient, Set<String> topicNames, Properties properties) {
+        int maxRetries =
+                KafkaSourceOptions.getOption(
+                        properties,
+                        KafkaSourceOptions.TOPIC_METADATA_REQUEST_MAX_RETRY,
+                        Integer::parseInt);
+        long retryDelay =
+                KafkaSourceOptions.getOption(
+                        properties,
+                        KafkaSourceOptions.TOPIC_METADATA_REQUEST_RETRY_INTERVAL_MS,
+                        Long::parseLong);
+        for (int attempt = 0; attempt <= maxRetries; attempt++) {
+            try {
+                return adminClient.describeTopics(topicNames).allTopicNames().get();
+            } catch (Exception e) {
+                if (attempt == maxRetries) {
+                    throw new RuntimeException(
+                            String.format("Failed to get metadata for topics %s.", topicNames), e);
+                } else {
+                    LOG.warn(
+                            "Attempt {} to get metadata for topics {} failed. Retrying in {} ms...",
+                            attempt,
+                            topicNames,
+                            retryDelay);
+                    try {
+                        TimeUnit.MILLISECONDS.sleep(retryDelay);
+                    } catch (InterruptedException ie) {
+                        Thread.currentThread().interrupt(); // Restore interrupted state
+                        LOG.error("Thread was interrupted during metadata fetch retry delay.", ie);
+                    }
+                }
+            }


Instead of introducing our own retry logic, can't we reuse what's already implemented in the KafkaAdmin?

public static final String RETRIES_CONFIG = "retries"; public static final String RETRIES_DOC = "Setting a value greater than zero will cause the client to resend any request that fails with a potentially transient error." + " It is recommended to set the value to either zero or `MAX_VALUE` and use corresponding timeout parameters to control how long a client should retry a request."; public static final String RETRY_BACKOFF_MS_CONFIG = "retry.backoff.ms"; public static final String RETRY_BACKOFF_MS_DOC = "The amount of time to wait before attempting to retry a failed request to a given topic partition. This avoids repeatedly sending requests in a tight loop under some failure scenarios.";

You should be able to set it with even without your PR like this

properties.retries = '10', properties.retry.backoff.ms = '30000',

But that also influences the consumer retry behavior.

We could think about supporting properties.admin.retry = 10. WDYT?

thanks for the suggestion, @AHeise . However, the default value for the retries config is already set to MAX_INT (see code and confluent doc here. And I believe Flink does not overwrite the config value. w/o my PR, the flink job will fail as soon as the metadata request fails. So I don't think this config control the behavior for failed metadata requests from AdminClient.

Could you please share the failure? If it's something that the AdminClient doesn't deem to be retriable than it's obvious that we need your solution.

It actually may be worth to draft a test that fails without your fix and succeeds with it. We do have some tests that remove the kafka broker (search for stopBroker).

boring-cyborg bot added the component=Connectors/Kafka label Feb 26, 2025

suez1224 force-pushed the FLINK-37366 branch 3 times, most recently from 926456f to 308cc0d Compare February 26, 2025 08:24

[FLINK-37366] Allow configurable retry for Kafka topic metadata fetch

4718dc9

suez1224 force-pushed the FLINK-37366 branch from 308cc0d to 4718dc9 Compare February 26, 2025 08:26

AHeise reviewed Feb 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-37366] Allow configurable retry for Kafka topic metadata fetch #155

[FLINK-37366] Allow configurable retry for Kafka topic metadata fetch #155

suez1224 commented Feb 26, 2025 •

edited

Loading

boring-cyborg bot commented Feb 26, 2025

AHeise left a comment

AHeise Feb 26, 2025

suez1224 Mar 9, 2025

AHeise Mar 13, 2025

[FLINK-37366] Allow configurable retry for Kafka topic metadata fetch #155

Are you sure you want to change the base?

[FLINK-37366] Allow configurable retry for Kafka topic metadata fetch #155

Conversation

suez1224 commented Feb 26, 2025 • edited Loading

boring-cyborg bot commented Feb 26, 2025

AHeise left a comment

Choose a reason for hiding this comment

AHeise Feb 26, 2025

Choose a reason for hiding this comment

suez1224 Mar 9, 2025

Choose a reason for hiding this comment

AHeise Mar 13, 2025

Choose a reason for hiding this comment

suez1224 commented Feb 26, 2025 •

edited

Loading