Optimize time between job creation and handling #11231

npepinpe · 2022-12-09T09:48:36Z

Description

We would like to reduce the time between a job is created until it is received by a JobHandler. One idea is to explore a push-based approach, where workers can open long living streams and receive jobs as soon as they are available, without having to poll for them.

To reiterate, the primary goal is: reduce the time between job creation and client handling. It is not to create a push-based pipeline - this is simply a proposed solution.

As a first step, we'll investigate a push based approach. The first iteration will consist of getting feedback on the performance gain as early as possible. This means ignoring most failure paths, most edge cases, QoL features, etc.

Prototype

Give feedback

Write draft proposal for a push-based activation pipeline zeebe#11232

component/engine component/gateway kind/research
Prototype push-based job activation pipeline zeebe#11233

component/engine component/gateway kind/research
Options

Organizational

Give feedback

Define scope for alpha target
Define scope for first production-ready target
Provide an estimated timeline of deliverables
Options

The initial alpha target will implement end-to-end job pushing (from broker to client), with naive solution for the more complex problems (log persistence, failure handling, and flow control), and will only implement support for the Java client.

Alpha

Give feedback

Introduce external job activation API zeebe#11706

component/broker component/engine component/stream-platform kind/toil version:8.2.0
Add broker-side job stream service API zeebe#11707

component/broker kind/toil version:8.2.0
Activate and push jobs to external job activator zeebe#11711

11 of 11

kind/epic
Move stream management to transport module zeebe#12010

component/broker component/transport kind/toil version:8.2.0
Add gRPC stream API gateway implementation zeebe#11713

kind/toil version:8.3.0 version:8.3.0-alpha1
Define interface for StreamMetrics in transport and provide concrete class in Broker. Follow up of Move stream management to transport module #12010
Aggregate multiple client streams with same metadata and streamType to a single stream zeebe#12253

version:8.3.0 version:8.3.0-alpha1
Improve error handling of ClientStreamManager zeebe#12385

area/reliability component/transport kind/toil
Additional job push metrics zeebe#12384

area/observability component/transport kind/toil version:8.3.0 version:8.3.0-alpha1
Try pushing activated job to next logical stream on failure before yielding zeebe#12386

area/performance component/transport kind/toil version:8.3.0 version:8.3.0-alpha1 version:8.3.0-alpha2
Allow listening for updates in BrokerTopologyListener zeebe#12387

area/maintainability area/reliability component/gateway kind/toil version:8.3.0 version:8.3.0-alpha1
Provide ClientStreamer instance to the GatewayGrpcService/EndpointManager zeebe#12388

component/gateway kind/toil version:8.3.0 version:8.3.0-alpha2
Implement ClientStreamConsumer adapter to forward jobs to gRPC clients zeebe#12389

component/gateway kind/toil version:8.3.0 version:8.3.0-alpha4
Register ClientStreamConsumer adapters for every gRPC stream observer zeebe#12390

component/gateway kind/toil version:8.3.0 version:8.3.0-alpha4
Introduce job receiver in the gateway zeebe#11712

component/gateway kind/toil version:8.3.0 version:8.3.0-alpha1
Add gRPC job stream API zeebe#11708

kind/feature version:8.3.0 version:8.3.0-alpha4
Add Java client job stream API implementation zeebe#11709

component/clients kind/feature scope/clients-java
Integrate end-to-end job stream lifecycle management zeebe#11710

area/test kind/toil version:8.3.0 version:8.3.0-alpha5 version:8.3.0-alpha6
Define actor names for new remote stream actors zeebe#13281

kind/toil version:8.3.0 version:8.3.0-alpha4
Cancel on-going remote stream registration on stream removal zeebe#13061

area/performance area/reliability component/transport kind/bug version:8.3.0 version:8.3.0-alpha4
Allow generating the job stream ID before registration zeebe#13049

component/gateway component/transport kind/toil
Provide access to an executor in the endpoint manager zeebe#13048

area/maintainability area/performance component/gateway kind/toil version:8.3.0 version:8.3.0-alpha5 version:8.3.0-alpha6
Handle stream consumers changes during push zeebe#13038

component/transport kind/bug version:8.3.0 version:8.3.0-alpha3
Avoid usage of actor in gateway job stream zeebe#13037

component/transport kind/toil version:8.1.14 version:8.2.8
Endless client job stream registration zeebe#13036

component/broker component/gateway kind/bug severity/low version:8.3.0 version:8.3.0-alpha3
Validate user input before registering a worker to the job stream zeebe#13465

component/gateway kind/feature version:8.3.0 version:8.3.0-alpha4
Add integration tests for Job Push zeebe#13516

kind/feature version:8.3.0 version:8.3.0-alpha5 version:8.3.0-alpha6
Backoff on job worker poll even on success zeebe#13846

area/performance component/clients kind/toil
Integrate job worker streaming and metrics into benchmarks zeebe#13914

area/test component/clients kind/toil version:8.3.0 version:8.3.0-alpha5 version:8.3.0-alpha6
Change protocol ID of the stream transport protocol zeebe#14033

kind/toil severity/low version:8.3.0 version:8.3.0-alpha6
Options

Production

Give feedback

Prototype client flow control solution
Allow clients to control job streaming flow/back pressure
Document client flow control solution
End-to-end testing of job streaming zeebe#13915

area/test kind/toil
Allow enabling/disabling job stream/push feature zeebe#12029

kind/toil
Options

Timeline

We'll approach this in two phases. Alpha and production-ready. Target for the alpha scope is 8.3.0-alpha2 (so the June release). Target for production-ready would be 8.3 (so September).

The text was updated successfully, but these errors were encountered:

npepinpe · 2023-01-17T14:41:31Z

A quick summary (you can read more in the prototype issue), the prototype was a success I think. There's still many issues to solve, but it's clear the approach is much faster:

For the common case, you skip having to send an ACTIVATE command. This means removing a required commit from your hot path.
For the common case, it removes the need to query every partition from the gateway, so you don't need to wait for each request.

It's however a more complex approach than simple polling, and we would likely want to continue supporting polling, at least for the foreseeable future, which means increasing the complexity of the application. That said, the performance is much, much faster for the happy path, so it's worth pursuing.

npepinpe · 2023-02-16T21:03:06Z

I've updated the issue with a tentative breakdown for the alpha target. Next week, me, @deepthidevaki, and @koevskinikola will fix the scope for the alpha and production-ready target and finalize a provisional timeline of deliverables. This will be a soft timeline to help us better organize our time, not a fixed one.

npepinpe · 2023-02-24T21:20:35Z

Alpha scope

Note The main goal of the first alpha is to provide the minimum fundamental parts of the feature on which we can iterate. The idea is that we cannot properly compare various solutions for flow control, failure handling, job routing, etc., without having the base pipeline in place.

Clients will be able to open so-called job streams. A job stream will target a specific job type, and will define a set of job activation properties identical to the existing job worker. A client job stream will consist of a long living unidirectional stream between the client and the gateway, on which the gateway will push ActivatedJob instances. This ensures compatibility with the JobHandler interface and the existing JobWorker API.

The gateway will aggregate clients by their activation properties. For the first client with logically equivalent activation properties, it will open a stream to each broker for this job type, passing these properties and a unique ID identifying the aggregated stream. Clients which open a stream after that, on this gateway, will not cause the gateway to send a request to the broker. On the last client to close a job stream of a given type, the gateway will send a request to the broker to close the stream for this type to itself.

Note Both the gateway and the broker will use our membership service to detect membership changes, and add/remove job streams accordingly.

The broker will have a new top-level service: the job stream service. This consists of a shared job stream registry, an API request handler to add/remove streams from the registry, and a streaming client to push activated jobs to registered streams.

The job stream registry will provide a synchronous, thread-safe API to fetch a job stream, if any, for a given job type. This must be synchronous since it will be accessed by the engine during command processing. A job stream in the registry will consist of an association of a job type, a set of job activation properties, and a list of possible recipients (gateway + unique aggregated stream ID).

The API request handler will receive the gateway’s open/close stream requests, and add/ remove the gateway from the possible recipients for activated jobs of a given type. When the last recipient is removed from a job stream, it is removed from the registry.

When a job would be made available, the engine will query the registry for an available stream. If there is none, then the job is made activate-able, and the job type is broadcasted. This is important for compatibility with the current long-polling approach. If there is one, then the job will be activated immediately during the same command processing, using the stream’s job activation properties. Once activated, the job and its variables are handed over (as a side-effect) to a specific consumer which will then forward the job to the stream's gateway (including the job, its key, variables, and the unique aggregated stream ID).

In more details:

Client

The Java client can open a long living stream which can activate jobs and receive them. It can pass a worker name, activation timeout, and a set of fetch variables. The semantics for activation will be the exact same ones as that with the ActivateJob command. It will receive a stream of ActivatedJob instances, such that the results can be fed to a JobHandler that it shares with a JobWorker.
Activation of existing jobs before the stream is created will be handled initially by the existing job worker/poller.
- As a stretch goal, if possible, we can integrate it as an opt-in feature of the job worker, so that the stream can be used seamlessly with the worker (i.e. the same job handler is reused). As we don't have experimental features yet on the client-side, we should consider whether this is worth doing now or wait.
The client will emit a metric of how many jobs it receives (note: if this is already achievable with standard gRPC metric interceptors, this can be omitted).
If the stretch goal is met, then we should also implement Introduce JobWorker metrics for the Java client #4700

Gateway

Stream management

The gateway will implement a new JobStream API, which can handle long-living client streams. Streams will be aggregated by their activation properties, and the aggregated stream given a unique ID.
- When a new aggregated stream is created, the gateway will register it to all brokers.
- When a new aggregated stream is removed, the gateway will remove it from all brokers.
- When a new broker is added to the topology, the gateway will register each of its existing aggregated streams.
- When the gateway is shutting down, it should broadcast a single, best-of effort request to all brokers to remove its streams.
- When a client is detected as dead, if it is the last client for the aggregated stream, the gateway will notify all brokers to remove this stream.
The gateway will emit metrics of how many clients are registered and how many unique streams there are.

Job proxy

The gateway will implement a new internal endpoint to receive a job, a job key, and the aggregated stream unique ID.
- Received jobs will be transformed into ActivatedJob and forwarded in a round-robin fashion to one of the client streams associated with the aggregated stream.
- If there are no clients available, the gateway will yield the job back to its partition leader.
- If the client is detected as dead when forwarding the job, the gateway will yield the job back to its partition leader.
The gateway will emit metrics reporting how many jobs it receives per partition, the rate of reception, and how many jobs it forwards, and the rate of forwarding.

Broker

Stream management

The broker will implement a new API endpoint to manage aggregated job streams, where gateways can add and remove unique streams, and remove all streams associated to a gateway.
- When a gateway is removed from the topology, all associated streams should be removed from the broker.

Job push

The broker will expose via an interface the set of registered unique streams and their activation properties to the stream platform, such that the engine can activate jobs with these properties.
The broker will implement a new component which will receive an activated job, its key, its variables, and the activation properties which were used to activate the job.
- This is to be exposed to the engine, indirectly via an API of course.
This component try pushing the job to all associated recipients for the given activation properties.
- We can implement this in a very naive way initially, best-of effort well distributed, e.g. randomly or round-robin.
- When a job fails to be pushed, the broker will retry the next possible recipient for the given activation properties.
- If none of the recipients succeed, we can use a log stream writer to yield the job back to the engine.
- If there was a leader change in between, the job can be considered "lost".
- There is initially no flow control.

Job activation

When a job would be made available, we should always try to activate it immediately if possible using the API described above. This means:
- When a job fails, if there are retries left
- When a job is created
- When a job is retried with back off
- When a job timed out
- When a job-incident is resolved

npepinpe · 2023-03-10T17:01:56Z

Open questions relating to the production-ready scope. For each, we should come up with plausible test scenarios which we will use to identify and define problems, as well as refine and iterate over potential solutions.

Workload distribution

In the alpha scope, we pick a random stream out of all registered streams, and activate the job for it, then push it downstream. This is sub-optimal, as it does not take into account slower workers (e.g. in a different network region, lower resources, unstable workers, performance varying by input/payload). Furthermore, there is a risk of workers being overwhelmed by brokers.

Possible test scenarios:

Too few workers to handle throughput of cluster-wide job pushes.
- Will the workers crash?
- What is the impact on overall process latency? With and without flow control?
Half the workers are much slower than the other half.
- What is the impact on throughput?
- What is the impact on overall processing latency? With and without flow control?

Back-filling job streams

In the alpha scope, we will rely on the existing long-polling solution (without modification) to activate jobs that were made activate-able could not be pushed downstream. This is potentially sub-optimal if there is a large backlog of such jobs, as we could already start pushing them downstream without having to wait for a JobBatch.ACTIVATE command.

Possible test scenario:

Create a large backlog of jobs with no streams or workers available, then register streams and start workers.
- What is the impact on latency if jobs can be pushed (i.e. back-filled), versus only polled?

Intelligent failure handling

In the alpha scope, we will adopt a naive approach to failures.

When the broker fails to push a job downstream, the job is yielded back to the engine, which should make it activate-able again via a Job.FAIL command. This indirectly causes a retry, but only after the command is processed asynchronously.
When the gateway fails to forward a job downstream to a client, it will yield it back to the broker by sending a Job.FAIL command.

Since streams can be logically identical, as identified by their type and activation properties, it should be possible to intelligently forward a job to another stream if possible, minimizing interactions between the gateway/broker, or the broker/engine, and potentially removing a complete commit barrier from the latency profile.

NOTE It's a bit difficult here to create controlled test scenarios, so ideas are welcome! But ideally, we would be able to inject failures between the broker/gateway, or gateway/client, and observe the difference in overall latency when retrying immediately to another stream instead of yielding the job back to the engine.

This may end up being quite complex when mixed with the flow control solution, so we will want to approach everything one step at a time.

Batch stream registry API

For the alpha scope, we implemented simple single-item RPCs for the gateway/broker stream API. Gateways can add/remove one stream at a time, or remove all associated streams. It may be useful to allow adding/removing multiple streams at once in the future. For example, when a new broker joins the cluster, every gateway will want to add all of their streams to that broker. If there is a high enough number, it's much more efficient to send fewer batched requests with all of them.

Possible test scenario:

With a cluster of 3 brokers and 2 gateways, shutdown one broker, then create a high number of streams, e.g. 100 per gateway. Then start the broker, and measure impact on overall latency (including impact on the gateway) and startup (on the broker). First without batching, then if we see it could be an issue, with.

deepthidevaki · 2023-04-06T10:17:10Z

An open question from #11713 (comment)

Optimize retrying of request sent from client to server. Currently AddRequest and RemoveRequest are retried indefinitely. What is the impact of this? Should we stop retrying at some point? When we stop retrying, should we return an error to the client? Should we use backoff for retry?

npepinpe · 2023-04-18T20:57:42Z

Another open question:

When pushing a job from the broker to the gateway times out, how should we handle this? We probably don't want to retry, as this could cause the job to be sent out twice. Do we want some form of handshake between broker/gateway before forwarding the job?

deepthidevaki · 2023-05-04T07:49:56Z

Added #12663 as follow up.

berkaycanbc · 2023-05-09T11:52:06Z

A note from our meeting with @koevskinikola and @npepinpe.

In the initial phase, if pushing a job fails, we yield the job back and make it available (activatable) for long polling. In other words, initially, there will not be a mechanism to push the job again when it fails pushing.

npepinpe · 2023-05-15T08:55:33Z

Another open question, when yielding jobs back on failure, we could run into the following problems:

Thundering herd, where we suddenly see a spike of failed jobs, resulting in a lot of commands being appended to the log at once. This leads to delays in further commands across the partition.
Related, but yielding a job when the log is backed up could result in this job taking, say, 15 seconds to be yielded. If it will time out in 5, this is just pointless busy work.

deepthidevaki · 2023-05-15T09:31:08Z

Related, but yielding a job when the log is backed up could result in this job taking, say, 15 seconds to be yielded. If it will time out in 5, this is just pointless busy work.

What would happen right now, if this happens?

Push failed
Yield command appended
Job timed-out
Job pushed, or activated by polling
Yield command is processed

npepinpe · 2023-05-15T10:39:38Z

Genau, 💥

We might want to include some form of sequencing or lease as part of the job yield, i.e.

Activate job, returns you a "lease" (can be anything - logical clock, timestamp, the activated record position, etc.)
On yield, you yield back that specific lease. If the current holder is not the same, it's a no-op

koevskinikola · 2023-05-30T09:46:16Z

I've included #12773 in the tasks related to this ticket so we don't loose track of the issue.

ZPA will try to schedule time for it once work on the gateway/client side for this topic is done.

npepinpe · 2023-07-03T12:23:28Z

Another thought:

We may want to preemptively close long living gRPC streams once they reach a certain age to allow rebalancing between gateways and avoid having too many clients on the same gateway. This can be done in a future iteration, but it's likely that users will eventually run into the problem of having too many clients converging to the same gateways if they live long enough.

npepinpe · 2023-08-10T08:34:47Z

As we're nearing the end of the alpha scope, I would propose the following for the production scope:

Implement flow control/workload distribution (see Workload distribution)
Improved error handling (see Intelligent error handling)

And of course fixing any bugs and issues we find in the mean time 🙃

deepthidevaki · 2023-08-10T09:41:16Z

Intelligent error handling

Don't we already retry with identical stream on both gateway and broker? What is missing?

npepinpe · 2023-08-10T11:35:14Z

Hm, I thought I had expanded that. The missing part (which is not in the linked comment, even if I was sure I'd added it 😄) is about thundering herd issues. If we pushed out many jobs, and suddenly all fail (e.g. all time out), we would be yielding many jobs back and possibly overloading the writer/processor, leading to high back pressure.

npepinpe · 2023-09-06T08:15:55Z

One caveat to anyone thinking of using this before it's production ready. We already ran benchmarks with larger clusters, and without flow control, you not only have to scale your workers to accomodate the peak load at all times, but you have to also scale your gateways. There is no failsafe in the gateway to reject requests, so it will keep allocating memory to cope and will eventually crash.

When introducing flow control, we'll likely want to add two form of limiting: per client, and per gateway (since the gateway can only do so much anyway).

korthout · 2024-01-02T13:27:49Z

@npepinpe Do we have plans to support the stream activated jobs RPC in zeebe-process-test?

1012: Support using process instance migration from the zeebe client r=korthout a=korthout ## Description  This adds support for the new `MigrateProcessInstance` RPC to ZPT, such that users can try migrating process instances from ZPT. This allows testing migrations before doing so in production. > [!NOTE] > This does not add assertions for the migration. That is out of scope. Additionally, this fixes a test case that no longer added value. The test case was intended to ensure that we don't forget supporting RPCs in ZPT, but the implementation of the test did not function correctly anymore. Now it is able to correctly detect unsupported RPCs again, which [highlighted that `streamActivatedJobs` is not supported by ZPT](camunda/camunda#11231 (comment)). ## Related issues  closes #972  ## Definition of Done  _Not all items need to be done depending on the issue and the pull request._ Code changes: * [x] The changes are backwards compatibility with previous versions * [ ] If it fixes a bug then PRs are created to backport the fix Testing: * [x] There are unit/integration tests that verify all acceptance criterias of the issue * [x] New tests are written to ensure backwards compatibility with further versions * [ ] The behavior is tested manually Documentation: * [ ] Javadoc has been written * [ ] The documentation is updated Co-authored-by: Nico Korthout <[email protected]>

npepinpe · 2024-01-03T09:01:45Z

We should, although it's perfectly usable without.

npepinpe · 2024-01-04T09:47:55Z

All tasks were completed or pushed to the backlog, 🎉

npepinpe added the kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. label Dec 9, 2022

npepinpe self-assigned this Dec 9, 2022

npepinpe added kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues and removed kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. labels Dec 9, 2022

npepinpe changed the title ~~Write proposal for push-based job activation pipeline~~ Optimize time between job creation and handling Dec 9, 2022

ChrisKujawa added component/engine component/gateway labels Jan 4, 2023

koevskinikola mentioned this issue May 30, 2023

[EPIC] Zeebe Gateway / clients support job pushing #12888

Closed

5 tasks

falko mentioned this issue Sep 7, 2023

Job Push camunda-community-hub/camunda-8-benchmark#92

Closed

ChrisKujawa mentioned this issue Oct 11, 2023

Create new flow control module #14688

Closed

korthout mentioned this issue Jan 2, 2024

Support using process instance migration from the zeebe client camunda/zeebe-process-test#1012

Merged

7 tasks

npepinpe closed this as completed Jan 4, 2024

ChrisKujawa mentioned this issue Jan 16, 2024

Recently, we have been encountering frequent occurrences of UNHEALTHY partitions in our Zeebe setup #15630

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize time between job creation and handling #11231

Optimize time between job creation and handling #11231

npepinpe commented Dec 9, 2022 •

edited

Loading

Prototype

Organizational

Alpha

Production

npepinpe commented Jan 17, 2023

npepinpe commented Feb 16, 2023

npepinpe commented Feb 24, 2023

npepinpe commented Mar 10, 2023 •

edited

Loading

deepthidevaki commented Apr 6, 2023

npepinpe commented Apr 18, 2023

deepthidevaki commented May 4, 2023

berkaycanbc commented May 9, 2023

npepinpe commented May 15, 2023

deepthidevaki commented May 15, 2023

npepinpe commented May 15, 2023 •

edited

Loading

koevskinikola commented May 30, 2023

npepinpe commented Jul 3, 2023

npepinpe commented Aug 10, 2023

deepthidevaki commented Aug 10, 2023

npepinpe commented Aug 10, 2023

npepinpe commented Sep 6, 2023

korthout commented Jan 2, 2024

npepinpe commented Jan 3, 2024

npepinpe commented Jan 4, 2024

Optimize time between job creation and handling #11231

Optimize time between job creation and handling #11231

Comments

npepinpe commented Dec 9, 2022 • edited Loading

Prototype

Organizational

Alpha

Production

Timeline

npepinpe commented Jan 17, 2023

npepinpe commented Feb 16, 2023

npepinpe commented Feb 24, 2023

Alpha scope

Client

Gateway

Stream management

Job proxy

Broker

Stream management

Job push

Job activation

npepinpe commented Mar 10, 2023 • edited Loading

Workload distribution

Back-filling job streams

Intelligent failure handling

Batch stream registry API

deepthidevaki commented Apr 6, 2023

npepinpe commented Apr 18, 2023

deepthidevaki commented May 4, 2023

berkaycanbc commented May 9, 2023

npepinpe commented May 15, 2023

deepthidevaki commented May 15, 2023

npepinpe commented May 15, 2023 • edited Loading

koevskinikola commented May 30, 2023

npepinpe commented Jul 3, 2023

npepinpe commented Aug 10, 2023

deepthidevaki commented Aug 10, 2023

npepinpe commented Aug 10, 2023

npepinpe commented Sep 6, 2023

korthout commented Jan 2, 2024

npepinpe commented Jan 3, 2024

npepinpe commented Jan 4, 2024

npepinpe commented Dec 9, 2022 •

edited

Loading

npepinpe commented Mar 10, 2023 •

edited

Loading

npepinpe commented May 15, 2023 •

edited

Loading