Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-36931][cdc] FlinkCDC YAML supports batch mode #3812

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/content.zh/docs/core-concept/data-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,4 +115,5 @@ under the License.
|-----------------|---------------------------------------|-------------------|
| name | 这个 pipeline 的名称,会用在 Flink 集群中作为作业的名称。 | optional |
| parallelism | pipeline的全局并发度,默认值是1。 | optional |
| local-time-zone | 作业级别的本地时区。 | optional |
| local-time-zone | 作业级别的本地时区。 | optional |
| batch-mode.enabled | 仅使用批处理模式来同步当前快照数据。 | optional |
7 changes: 1 addition & 6 deletions docs/content.zh/docs/core-concept/transform.md
Original file line number Diff line number Diff line change
Expand Up @@ -466,9 +466,4 @@ pipeline:
|---------------|--------|-------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| openai.model | STRING | required | Name of model to be called, for example: "text-embedding-3-small", Available options are "text-embedding-3-small", "text-embedding-3-large", "text-embedding-ada-002". |
| openai.host | STRING | required | Host of the Model server to be connected, for example: `http://langchain4j.dev/demo/openai/v1`. |
| openai.apikey | STRING | required | Api Key for verification of the Model server, for example, "demo". |

# 已知的限制
* 目前,transform 不能与路由规则一起使用。这将会在未来版本中支持。
* 计算列不能引用没有出现在最终投影结果中的被裁剪的列。这将在将来的版本中解决。
* 正则匹配不同schema的表不支持。如果需要,需要编写多个规则。
| openai.apikey | STRING | required | Api Key for verification of the Model server, for example, "demo". |
3 changes: 2 additions & 1 deletion docs/content/docs/core-concept/data-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,4 +133,5 @@ The following config options of Data Pipeline level are supported:
|-----------------|-----------------------------------------------------------------------------------------|-------------------|
| name | The name of the pipeline, which will be submitted to the Flink cluster as the job name. | optional |
| parallelism | The global parallelism of the pipeline. Defaults to 1. | optional |
| local-time-zone | The local time zone defines current session time zone id. | optional |
| local-time-zone | The local time zone defines current session time zone id. | optional |
| batch-mode.enabled | Only use batch processing mode to synchronize the current snapshot data. | optional |
8 changes: 1 addition & 7 deletions docs/content/docs/core-concept/transform.md
Original file line number Diff line number Diff line change
Expand Up @@ -471,10 +471,4 @@ The following built-in models are provided:
|---------------|--------|-------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| openai.model | STRING | required | Name of model to be called, for example: "text-embedding-3-small", Available options are "text-embedding-3-small", "text-embedding-3-large", "text-embedding-ada-002". |
| openai.host | STRING | required | Host of the Model server to be connected, for example: `http://langchain4j.dev/demo/openai/v1`. |
| openai.apikey | STRING | required | Api Key for verification of the Model server, for example, "demo". |


# Known limitations
* Currently, transform doesn't work with route rules. It will be supported in future versions.
* Computed columns cannot reference trimmed columns that do not present in final projection results. This will be fixed in future versions.
* Regular matching of tables with different schemas is not supported. If necessary, multiple rules need to be written.
| openai.apikey | STRING | required | Api Key for verification of the Model server, for example, "demo". |
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,7 @@ private void testSchemaEvolutionTypesParsing(
.put("parallelism", "4")
.put("schema.change.behavior", "evolve")
.put("schema-operator.rpc-timeout", "1 h")
.put("batch-mode.enabled", "false")
.build()));

@Test
Expand Down Expand Up @@ -502,6 +503,7 @@ void testParsingFullDefinitionFromString() throws Exception {
.put("parallelism", "4")
.put("schema.change.behavior", "evolve")
.put("schema-operator.rpc-timeout", "1 h")
.put("batch-mode.enabled", "false")
.build()));

private final PipelineDef defWithOptional =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ pipeline:
parallelism: 4
schema.change.behavior: evolve
schema-operator.rpc-timeout: 1 h
batch-mode.enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: since batchMode is disabled by default, maybe we can turn it on here to verify if it could be enabled correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many parameters are not effective in batch mode, so "false" is written here.

model:
model-name: GET_EMBEDDING
class-name: OpenAIEmbeddingModel
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,10 @@ public interface DataSourceFactory extends Factory {

/** Creates a {@link DataSource} instance. */
DataSource createDataSource(Context context);

/** Checking if this {@link DataSource} could be created in batch mode. */
default void verifyBatchMode(Context context) {
throw new IllegalArgumentException(
"Pipeline batch mode couldn't be used with this source connector.");
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,12 @@ public class PipelineOptions {
.defaultValue(1)
.withDescription("Parallelism of the pipeline");

public static final ConfigOption<Boolean> PIPELINE_BATCH_MODE_ENABLED =
ConfigOptions.key("batch-mode.enabled")
.booleanType()
.defaultValue(false)
.withDescription("Run pipeline job in batch mode instead of streaming mode");

public static final ConfigOption<SchemaChangeBehavior> PIPELINE_SCHEMA_CHANGE_BEHAVIOR =
ConfigOptions.key("schema.change.behavior")
.enumType(SchemaChangeBehavior.class)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

package org.apache.flink.cdc.composer.flink;

import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.cdc.common.annotation.Internal;
import org.apache.flink.cdc.common.configuration.Configuration;
import org.apache.flink.cdc.common.event.Event;
Expand Down Expand Up @@ -111,6 +112,13 @@ private void translate(StreamExecutionEnvironment env, PipelineDef pipelineDef)
SchemaChangeBehavior schemaChangeBehavior =
pipelineDefConfig.get(PipelineOptions.PIPELINE_SCHEMA_CHANGE_BEHAVIOR);

boolean isBatchMode = pipelineDefConfig.get(PipelineOptions.PIPELINE_BATCH_MODE_ENABLED);
if (isBatchMode) {
env.setRuntimeMode(RuntimeExecutionMode.BATCH);
} else {
env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
}

// Initialize translators
DataSourceTranslator sourceTranslator = new DataSourceTranslator();
TransformTranslator transformTranslator = new TransformTranslator();
Expand Down Expand Up @@ -145,7 +153,8 @@ private void translate(StreamExecutionEnvironment env, PipelineDef pipelineDef)
pipelineDef.getUdfs(),
pipelineDef.getModels(),
dataSource.supportedMetadataColumns(),
isParallelMetadataSource);
isParallelMetadataSource,
isBatchMode);

// PreTransform ---> PostTransform
stream =
Expand Down Expand Up @@ -186,6 +195,7 @@ private void translate(StreamExecutionEnvironment env, PipelineDef pipelineDef)
schemaOperatorTranslator.translateRegular(
stream,
parallelism,
isBatchMode,
dataSink.getMetadataApplier()
.setAcceptedSchemaEvolutionTypes(
pipelineDef
Expand All @@ -199,13 +209,18 @@ private void translate(StreamExecutionEnvironment env, PipelineDef pipelineDef)
stream,
parallelism,
parallelism,
isBatchMode,
schemaOperatorIDGenerator.generate(),
dataSink.getDataChangeEventHashFunctionProvider(parallelism));
}

// Schema Operator -> Sink -> X
sinkTranslator.translate(
pipelineDef.getSink(), stream, dataSink, schemaOperatorIDGenerator.generate());
pipelineDef.getSink(),
stream,
dataSink,
isBatchMode,
schemaOperatorIDGenerator.generate());
}

private void addFrameworkJars() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
import org.apache.flink.cdc.composer.definition.SinkDef;
import org.apache.flink.cdc.composer.flink.FlinkEnvironmentUtils;
import org.apache.flink.cdc.composer.utils.FactoryDiscoveryUtils;
import org.apache.flink.cdc.runtime.operators.sink.DataBatchSinkFunctionOperator;
import org.apache.flink.cdc.runtime.operators.sink.DataSinkFunctionOperator;
import org.apache.flink.cdc.runtime.operators.sink.DataSinkWriterOperatorFactory;
import org.apache.flink.core.io.SimpleVersionedSerializer;
Expand All @@ -46,6 +47,7 @@
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.apache.flink.streaming.api.operators.OneInputStreamOperatorFactory;
import org.apache.flink.streaming.api.operators.StreamSink;
import org.apache.flink.streaming.api.transformations.LegacySinkTransformation;
import org.apache.flink.streaming.api.transformations.PhysicalTransformation;

Expand Down Expand Up @@ -82,20 +84,29 @@ public void translate(
DataStream<Event> input,
DataSink dataSink,
OperatorID schemaOperatorID) {
translate(sinkDef, input, dataSink, false, schemaOperatorID);
}

public void translate(
SinkDef sinkDef,
DataStream<Event> input,
DataSink dataSink,
boolean isBatchMode,
OperatorID schemaOperatorID) {
// Get sink provider
EventSinkProvider eventSinkProvider = dataSink.getEventSinkProvider();
String sinkName = generateSinkName(sinkDef);
if (eventSinkProvider instanceof FlinkSinkProvider) {
// Sink V2
FlinkSinkProvider sinkProvider = (FlinkSinkProvider) eventSinkProvider;
Sink<Event> sink = sinkProvider.getSink();
sinkTo(input, sink, sinkName, schemaOperatorID);
sinkTo(input, sink, sinkName, isBatchMode, schemaOperatorID);
} else if (eventSinkProvider instanceof FlinkSinkFunctionProvider) {
// SinkFunction
FlinkSinkFunctionProvider sinkFunctionProvider =
(FlinkSinkFunctionProvider) eventSinkProvider;
SinkFunction<Event> sinkFunction = sinkFunctionProvider.getSinkFunction();
sinkTo(input, sinkFunction, sinkName, schemaOperatorID);
sinkTo(input, sinkFunction, sinkName, isBatchMode, schemaOperatorID);
}
}

Expand All @@ -104,6 +115,7 @@ void sinkTo(
DataStream<Event> input,
Sink<Event> sink,
String sinkName,
boolean isBatchMode,
OperatorID schemaOperatorID) {
DataStream<Event> stream = input;
// Pre-write topology
Expand All @@ -112,22 +124,27 @@ void sinkTo(
}

if (sink instanceof TwoPhaseCommittingSink) {
addCommittingTopology(sink, stream, sinkName, schemaOperatorID);
addCommittingTopology(sink, stream, sinkName, isBatchMode, schemaOperatorID);
} else {
stream.transform(
SINK_WRITER_PREFIX + sinkName,
CommittableMessageTypeInfo.noOutput(),
new DataSinkWriterOperatorFactory<>(sink, schemaOperatorID));
new DataSinkWriterOperatorFactory<>(sink, isBatchMode, schemaOperatorID));
}
}

private void sinkTo(
DataStream<Event> input,
SinkFunction<Event> sinkFunction,
String sinkName,
boolean isBatchMode,
OperatorID schemaOperatorID) {
DataSinkFunctionOperator sinkOperator =
new DataSinkFunctionOperator(sinkFunction, schemaOperatorID);
StreamSink<Event> sinkOperator;
if (isBatchMode) {
sinkOperator = new DataBatchSinkFunctionOperator(sinkFunction);
} else {
sinkOperator = new DataSinkFunctionOperator(sinkFunction, schemaOperatorID);
}
final StreamExecutionEnvironment executionEnvironment = input.getExecutionEnvironment();
PhysicalTransformation<Event> transformation =
new LegacySinkTransformation<>(
Expand All @@ -143,23 +160,23 @@ private <CommT> void addCommittingTopology(
Sink<Event> sink,
DataStream<Event> inputStream,
String sinkName,
boolean isBatchMode,
OperatorID schemaOperatorID) {
TypeInformation<CommittableMessage<CommT>> typeInformation =
CommittableMessageTypeInfo.of(() -> getCommittableSerializer(sink));
DataStream<CommittableMessage<CommT>> written =
inputStream.transform(
SINK_WRITER_PREFIX + sinkName,
typeInformation,
new DataSinkWriterOperatorFactory<>(sink, schemaOperatorID));
new DataSinkWriterOperatorFactory<>(sink, isBatchMode, schemaOperatorID));

DataStream<CommittableMessage<CommT>> preCommitted = written;
if (sink instanceof WithPreCommitTopology) {
preCommitted =
((WithPreCommitTopology<Event, CommT>) sink).addPreCommitTopology(written);
}

// TODO: Hard coding stream mode and checkpoint
boolean isBatchMode = false;
// TODO: Hard coding checkpoint
boolean isCheckpointingEnabled = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is it possible to enable checkpointing in batch mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some sinks need to rely on checkpointing to complete the writing.

DataStream<CommittableMessage<CommT>> committed =
preCommitted.transform(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,11 +84,13 @@ public DataSource createDataSource(
// Add source JAR to environment
FactoryDiscoveryUtils.getJarPathByIdentifier(sourceFactory)
.ifPresent(jar -> FlinkEnvironmentUtils.addJar(env, jar));
return sourceFactory.createDataSource(
FactoryHelper.DefaultContext context =
new FactoryHelper.DefaultContext(
sourceDef.getConfig(),
pipelineConfig,
Thread.currentThread().getContextClassLoader()));
Thread.currentThread().getContextClassLoader());
sourceFactory.verifyBatchMode(context);
return sourceFactory.createDataSource(context);
}

private String generateDefaultSourceName(SourceDef sourceDef) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,13 @@
import org.apache.flink.cdc.runtime.partitioning.PartitioningEvent;
import org.apache.flink.cdc.runtime.partitioning.PartitioningEventKeySelector;
import org.apache.flink.cdc.runtime.partitioning.PostPartitionProcessor;
import org.apache.flink.cdc.runtime.partitioning.RegularPrePartitionBatchOperator;
import org.apache.flink.cdc.runtime.partitioning.RegularPrePartitionOperator;
import org.apache.flink.cdc.runtime.typeutils.EventTypeInfo;
import org.apache.flink.cdc.runtime.typeutils.PartitioningEventTypeInfo;
import org.apache.flink.runtime.jobgraph.OperatorID;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;

/**
* Translator used to build {@link RegularPrePartitionOperator} or {@link
Expand All @@ -46,15 +48,40 @@ public DataStream<Event> translateRegular(
int downstreamParallelism,
OperatorID schemaOperatorID,
HashFunctionProvider<DataChangeEvent> hashFunctionProvider) {
return input.transform(
"PrePartition",
new PartitioningEventTypeInfo(),
new RegularPrePartitionOperator(
schemaOperatorID, downstreamParallelism, hashFunctionProvider))
.setParallelism(upstreamParallelism)
.partitionCustom(new EventPartitioner(), new PartitioningEventKeySelector())
.map(new PostPartitionProcessor(), new EventTypeInfo())
.name("PostPartition");
return translateRegular(
input,
upstreamParallelism,
downstreamParallelism,
false,
schemaOperatorID,
hashFunctionProvider);
}

public DataStream<Event> translateRegular(
DataStream<Event> input,
int upstreamParallelism,
int downstreamParallelism,
boolean isBatchMode,
OperatorID schemaOperatorID,
HashFunctionProvider<DataChangeEvent> hashFunctionProvider) {
SingleOutputStreamOperator<Event> singleOutputStreamOperator =
input.transform(
isBatchMode ? "BatchPrePartition" : "PrePartition",
new PartitioningEventTypeInfo(),
isBatchMode
? new RegularPrePartitionBatchOperator(
downstreamParallelism, hashFunctionProvider)
: new RegularPrePartitionOperator(
schemaOperatorID,
downstreamParallelism,
hashFunctionProvider))
.setParallelism(upstreamParallelism)
.partitionCustom(new EventPartitioner(), new PartitioningEventKeySelector())
.map(new PostPartitionProcessor(), new EventTypeInfo())
.name(isBatchMode ? "BatchPostPartition" : "PostPartition");
return isBatchMode
? singleOutputStreamOperator.setParallelism(downstreamParallelism)
: singleOutputStreamOperator;
}

public DataStream<PartitioningEvent> translateDistributed(
Expand Down
Loading