[FLINK-36611][pipeline-connector][kafka] Add schema info to output of Kafka sink #3791

MOBIN-F · 2024-12-11T02:57:05Z

Currently, the output of Kafka sink in debezium format looks like this:

{
  "before": {
    "id": 4,
    "name": "John",
    "address": "New York",
    "phone_number": "2222",
    "age": 12
  },
  "after": {
    "id": 4,
    "name": "John",
    "address": "New York",
    "phone_number": "1234",
    "age": 12
  },
  "op": "u",
  "source": {
    "db": null,
    "table": "customers"
  }
}

It contains record data with full before/after and db info, but schema info wasn't included.

However, In some scenarios, we need this information to determine the type of data. For example, Paimon's Kafka CDC source requires this type information, otherwise all types are considered String, refer to https://paimon.apache.org/docs/0.9/flink/cdc-ingestion/kafka-cdc/#supported-formats.

Considering that this will increase the data load, I suggest adding a parameter to configure whether to enable it.

...um/src/main/java/org/apache/flink/cdc/debezium/event/DebeziumEventDeserializationSchema.java

MOBIN-F · 2024-12-11T03:05:14Z

.../apache/flink/cdc/connectors/kafka/json/debezium/DebeziumJsonRowDataSerializationSchema.java

+                // escape characters such as "\"
+                String schemaValue = node.get("schema").asText();
+                JsonNode schemaNode = mapper.readTree(schemaValue);
+                node.set("schema", schemaNode);


Because the schema is passed to the downstream as a string, and there is a nested json in the schema, if the json string is put into jsonNode, there will be ["]. The JsonNode.asText() method can solve this problem well.

lvyanquan · 2025-02-20T03:02:40Z

Hi, @MOBIN-F. Is there any blocker for this being ready?

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/event/DataChangeEvent.java

...mysql/src/main/java/org/apache/flink/cdc/connectors/mysql/source/MySqlDataSourceOptions.java

...sql-cdc/src/main/java/org/apache/flink/cdc/connectors/mysql/table/MySqlReadableMetadata.java

flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-kafka/pom.xml

...ava/org/apache/flink/cdc/connectors/kafka/json/debezium/DebeziumJsonSerializationSchema.java

...tor-kafka/src/main/java/org/apache/flink/cdc/connectors/kafka/sink/KafkaDataSinkOptions.java

...tor-kafka/src/main/java/org/apache/flink/cdc/connectors/kafka/sink/KafkaDataSinkFactory.java

...fka/src/main/java/org/apache/flink/cdc/connectors/kafka/json/ChangeLogJsonFormatFactory.java

...ava/org/apache/flink/cdc/connectors/kafka/json/debezium/DebeziumJsonSerializationSchema.java

lvyanquan · 2025-03-03T02:48:08Z

...ava/org/apache/flink/cdc/connectors/kafka/json/debezium/DebeziumJsonSerializationSchema.java

+                case VARCHAR:
+                case VARBINARY:
+                default:
+                    field = SchemaBuilder.string();


This includes ARRAY/MAP/ROW, There may be some issues if all they were converted to SchemaBuilder.string.
You can check if there is a better type, and if not, it is also acceptable.

debezium-json does not seem to receive ARRAY/MAP/ROW type data because the mysql does not support this type

lvyanquan · 2025-03-03T02:55:25Z

Thanks @MOBIN-F for this update.
Overall, This modification is good for me, left some comments about the code structure.

Additionally, you can manually test the data types in Paimon action to verify its correctness.

MOBIN-F · 2025-03-05T07:20:32Z

docs/content/docs/connectors/pipeline-connectors/kafka.md

 <div class="wy-table-responsive">
 <table class="colwidths-auto docutils">
    <thead>
      <tr>
        <th class="text-left">CDC type</th>
        <th class="text-left">JSON type</th>
+        <th class="text-left">Literal type</th>
+        <th class="text-left">Semantic type</th>


Do we need to add Literal/Semantic type to the documentation? It seems like users won't care about it? @lvyanquan

~~Yes, I suggest not add this as users won't pay much attention on it.~~

I think adding more information is not a disadvantage, so we can keep them.

MOBIN-F · 2025-03-05T07:30:41Z

Thanks @MOBIN-F for this update. Overall, This modification is good for me, left some comments about the code structure.

Additionally, you can manually test the data types in Paimon action to verify its correctness.

I tested 【pipeline-kafka-->kafka-->paimon action->paimon table】, and it worked as expected. Maybe I can add some e2e tests later.

.../apache/flink/cdc/connectors/kafka/json/debezium/DebeziumJsonRowDataSerializationSchema.java

docs/content/docs/connectors/pipeline-connectors/kafka.md

lvyanquan

LGTM.

Left some minor comments.

flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-kafka/pom.xml

lvyanquan

LGTM.

lvyanquan · 2025-03-13T11:04:44Z

Hi @leonardBang @ruanhang1993 could you help to check about this?

leonardBang

Thanks @MOBIN-F and @lvyanquan for the contribution, LGTM, wait the CI green

MOBIN-F · 2025-03-14T09:38:31Z

CI failed, but it has nothing to do with this PR because this PR only involves the kafka-pipeline module

github-actions bot added values-pipeline-connector common runtime mysql-cdc-connector base e2e-tests mysql-pipeline-connector kafka-pipeline-connector labels Dec 11, 2024

MOBIN-F commented Dec 11, 2024

View reviewed changes

...um/src/main/java/org/apache/flink/cdc/debezium/event/DebeziumEventDeserializationSchema.java Outdated Show resolved Hide resolved

MOBIN-F commented Dec 11, 2024

View reviewed changes

github-actions bot removed the base label Dec 12, 2024

MOBIN-F marked this pull request as ready for review February 20, 2025 03:10

lvyanquan reviewed Feb 20, 2025

View reviewed changes

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/event/DataChangeEvent.java Outdated Show resolved Hide resolved

lvyanquan reviewed Feb 20, 2025

View reviewed changes

...mysql/src/main/java/org/apache/flink/cdc/connectors/mysql/source/MySqlDataSourceOptions.java Outdated Show resolved Hide resolved

MOBIN-F force-pushed the release-support-debezium-json-include-schema branch from 61ba472 to 1a2f93e Compare February 22, 2025 12:28

github-actions bot removed values-pipeline-connector common runtime mysql-pipeline-connector labels Feb 22, 2025

lvyanquan reviewed Feb 24, 2025

View reviewed changes

...sql-cdc/src/main/java/org/apache/flink/cdc/connectors/mysql/table/MySqlReadableMetadata.java Outdated Show resolved Hide resolved

MOBIN-F added 3 commits February 28, 2025 18:51

support pipeline-connector-kafka output debezium schema info

fcfc130

Fix compatibility with Flink 1.19/1.20 RowDataToJsonConverters

ed149c6

add test

b7632e6

MOBIN-F force-pushed the release-support-debezium-json-include-schema branch from 1a2f93e to b7632e6 Compare February 28, 2025 10:53

github-actions bot removed the mysql-cdc-connector label Feb 28, 2025

lvyanquan reviewed Mar 3, 2025

View reviewed changes

...ava/org/apache/flink/cdc/connectors/kafka/json/debezium/DebeziumJsonSerializationSchema.java Outdated Show resolved Hide resolved

lvyanquan reviewed Mar 3, 2025

View reviewed changes

MOBIN-F added 2 commits March 5, 2025 15:12

fix

e9d201b

add docs

03aa9bb

github-actions bot added the docs Improvements or additions to documentation label Mar 5, 2025

MOBIN-F commented Mar 5, 2025

View reviewed changes

lvyanquan reviewed Mar 12, 2025

View reviewed changes

.../apache/flink/cdc/connectors/kafka/json/debezium/DebeziumJsonRowDataSerializationSchema.java Outdated Show resolved Hide resolved

docs/content/docs/connectors/pipeline-connectors/kafka.md Show resolved Hide resolved

lvyanquan approved these changes Mar 12, 2025

View reviewed changes

github-actions bot added the reviewed label Mar 12, 2025

lvyanquan suggested changes Mar 12, 2025

View reviewed changes

flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-kafka/pom.xml Outdated Show resolved Hide resolved

github-actions bot removed the reviewed label Mar 12, 2025

fix comment

395a201

lvyanquan approved these changes Mar 13, 2025

View reviewed changes

github-actions bot added the reviewed label Mar 13, 2025

leonardBang approved these changes Mar 13, 2025

View reviewed changes

github-actions bot added the approved label Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-36611][pipeline-connector][kafka] Add schema info to output of Kafka sink #3791

[FLINK-36611][pipeline-connector][kafka] Add schema info to output of Kafka sink #3791

MOBIN-F commented Dec 11, 2024

MOBIN-F Dec 11, 2024

lvyanquan commented Feb 20, 2025

lvyanquan Mar 3, 2025

MOBIN-F Mar 5, 2025

lvyanquan commented Mar 3, 2025

MOBIN-F Mar 5, 2025

lvyanquan Mar 12, 2025 •

edited

Loading

MOBIN-F commented Mar 5, 2025

lvyanquan left a comment

lvyanquan left a comment

lvyanquan commented Mar 13, 2025

leonardBang left a comment

MOBIN-F commented Mar 14, 2025

[FLINK-36611][pipeline-connector][kafka] Add schema info to output of Kafka sink #3791

Are you sure you want to change the base?

[FLINK-36611][pipeline-connector][kafka] Add schema info to output of Kafka sink #3791

Conversation

MOBIN-F commented Dec 11, 2024

MOBIN-F Dec 11, 2024

Choose a reason for hiding this comment

lvyanquan commented Feb 20, 2025

lvyanquan Mar 3, 2025

Choose a reason for hiding this comment

MOBIN-F Mar 5, 2025

Choose a reason for hiding this comment

lvyanquan commented Mar 3, 2025

MOBIN-F Mar 5, 2025

Choose a reason for hiding this comment

lvyanquan Mar 12, 2025 • edited Loading

Choose a reason for hiding this comment

MOBIN-F commented Mar 5, 2025

lvyanquan left a comment

Choose a reason for hiding this comment

lvyanquan left a comment

Choose a reason for hiding this comment

lvyanquan commented Mar 13, 2025

leonardBang left a comment

Choose a reason for hiding this comment

MOBIN-F commented Mar 14, 2025

lvyanquan Mar 12, 2025 •

edited

Loading