-
Notifications
You must be signed in to change notification settings - Fork 657
Optimize time between job creation and handling #11231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A quick summary (you can read more in the prototype issue), the prototype was a success I think. There's still many issues to solve, but it's clear the approach is much faster:
It's however a more complex approach than simple polling, and we would likely want to continue supporting polling, at least for the foreseeable future, which means increasing the complexity of the application. That said, the performance is much, much faster for the happy path, so it's worth pursuing. |
I've updated the issue with a tentative breakdown for the alpha target. Next week, me, @deepthidevaki, and @koevskinikola will fix the scope for the alpha and production-ready target and finalize a provisional timeline of deliverables. This will be a soft timeline to help us better organize our time, not a fixed one. |
Alpha scope
Clients will be able to open so-called job streams. A job stream will target a specific job type, and will define a set of job activation properties identical to the existing job worker. A client job stream will consist of a long living unidirectional stream between the client and the gateway, on which the gateway will push The gateway will aggregate clients by their activation properties. For the first client with logically equivalent activation properties, it will open a stream to each broker for this job type, passing these properties and a unique ID identifying the aggregated stream. Clients which open a stream after that, on this gateway, will not cause the gateway to send a request to the broker. On the last client to close a job stream of a given type, the gateway will send a request to the broker to close the stream for this type to itself.
The broker will have a new top-level service: the job stream service. This consists of a shared job stream registry, an API request handler to add/remove streams from the registry, and a streaming client to push activated jobs to registered streams. The job stream registry will provide a synchronous, thread-safe API to fetch a job stream, if any, for a given job type. This must be synchronous since it will be accessed by the engine during command processing. A job stream in the registry will consist of an association of a job type, a set of job activation properties, and a list of possible recipients (gateway + unique aggregated stream ID). The API request handler will receive the gateway’s open/close stream requests, and add/ remove the gateway from the possible recipients for activated jobs of a given type. When the last recipient is removed from a job stream, it is removed from the registry. When a job would be made available, the engine will query the registry for an available stream. If there is none, then the job is made activate-able, and the job type is broadcasted. This is important for compatibility with the current long-polling approach. If there is one, then the job will be activated immediately during the same command processing, using the stream’s job activation properties. Once activated, the job and its variables are handed over (as a side-effect) to a specific consumer which will then forward the job to the stream's gateway (including the job, its key, variables, and the unique aggregated stream ID). In more details: Client
GatewayStream management
Job proxy
BrokerStream management
Job push
Job activation
|
Open questions relating to the production-ready scope. For each, we should come up with plausible test scenarios which we will use to identify and define problems, as well as refine and iterate over potential solutions. Workload distributionIn the alpha scope, we pick a random stream out of all registered streams, and activate the job for it, then push it downstream. This is sub-optimal, as it does not take into account slower workers (e.g. in a different network region, lower resources, unstable workers, performance varying by input/payload). Furthermore, there is a risk of workers being overwhelmed by brokers. Possible test scenarios:
Back-filling job streamsIn the alpha scope, we will rely on the existing long-polling solution (without modification) to activate jobs that were made activate-able could not be pushed downstream. This is potentially sub-optimal if there is a large backlog of such jobs, as we could already start pushing them downstream without having to wait for a Possible test scenario:
Intelligent failure handlingIn the alpha scope, we will adopt a naive approach to failures.
Since streams can be logically identical, as identified by their type and activation properties, it should be possible to intelligently forward a job to another stream if possible, minimizing interactions between the gateway/broker, or the broker/engine, and potentially removing a complete commit barrier from the latency profile.
Batch stream registry APIFor the alpha scope, we implemented simple single-item RPCs for the gateway/broker stream API. Gateways can add/remove one stream at a time, or remove all associated streams. It may be useful to allow adding/removing multiple streams at once in the future. For example, when a new broker joins the cluster, every gateway will want to add all of their streams to that broker. If there is a high enough number, it's much more efficient to send fewer batched requests with all of them. Possible test scenario:
|
An open question from #11713 (comment)
|
Another open question:
|
Added #12663 as follow up. |
A note from our meeting with @koevskinikola and @npepinpe. In the initial phase, if pushing a job fails, we yield the job back and make it available (activatable) for long polling. In other words, initially, there will not be a mechanism to push the job again when it fails pushing. |
Another open question, when yielding jobs back on failure, we could run into the following problems:
|
What would happen right now, if this happens?
|
Genau, 💥 We might want to include some form of sequencing or lease as part of the job yield, i.e.
|
I've included #12773 in the tasks related to this ticket so we don't loose track of the issue. ZPA will try to schedule time for it once work on the gateway/client side for this topic is done. |
Another thought: We may want to preemptively close long living gRPC streams once they reach a certain age to allow rebalancing between gateways and avoid having too many clients on the same gateway. This can be done in a future iteration, but it's likely that users will eventually run into the problem of having too many clients converging to the same gateways if they live long enough. |
As we're nearing the end of the alpha scope, I would propose the following for the production scope:
And of course fixing any bugs and issues we find in the mean time 🙃 |
Don't we already retry with identical stream on both gateway and broker? What is missing? |
Hm, I thought I had expanded that. The missing part (which is not in the linked comment, even if I was sure I'd added it 😄) is about thundering herd issues. If we pushed out many jobs, and suddenly all fail (e.g. all time out), we would be yielding many jobs back and possibly overloading the writer/processor, leading to high back pressure. |
One caveat to anyone thinking of using this before it's production ready. We already ran benchmarks with larger clusters, and without flow control, you not only have to scale your workers to accomodate the peak load at all times, but you have to also scale your gateways. There is no failsafe in the gateway to reject requests, so it will keep allocating memory to cope and will eventually crash. When introducing flow control, we'll likely want to add two form of limiting: per client, and per gateway (since the gateway can only do so much anyway). |
@npepinpe Do we have plans to support the stream activated jobs RPC in zeebe-process-test? |
1012: Support using process instance migration from the zeebe client r=korthout a=korthout ## Description <!-- Please explain the changes you made here. --> This adds support for the new `MigrateProcessInstance` RPC to ZPT, such that users can try migrating process instances from ZPT. This allows testing migrations before doing so in production. > [!NOTE] > This does not add assertions for the migration. That is out of scope. Additionally, this fixes a test case that no longer added value. The test case was intended to ensure that we don't forget supporting RPCs in ZPT, but the implementation of the test did not function correctly anymore. Now it is able to correctly detect unsupported RPCs again, which [highlighted that `streamActivatedJobs` is not supported by ZPT](camunda/camunda#11231 (comment)). ## Related issues <!-- Which issues are closed by this PR or are related --> closes #972 <!-- Cut-off marker _All lines under and including the cut-off marker will be removed from the merge commit message_ ## Definition of Ready Please check the items that apply, before requesting a review. You can find more details about these items in our wiki page about [Pull Requests and Code Reviews](https://github.com/camunda/zeebe/wiki/Pull-Requests-and-Code-Reviews). * [ ] I've reviewed my own code * [ ] I've written a clear changelist description * [ ] I've narrowly scoped my changes * [ ] I've separated structural from behavioural changes --> ## Definition of Done <!-- Please check the items that apply, before merging or (if possible) before requesting a review. --> _Not all items need to be done depending on the issue and the pull request._ Code changes: * [x] The changes are backwards compatibility with previous versions * [ ] If it fixes a bug then PRs are created to backport the fix Testing: * [x] There are unit/integration tests that verify all acceptance criterias of the issue * [x] New tests are written to ensure backwards compatibility with further versions * [ ] The behavior is tested manually Documentation: * [ ] Javadoc has been written * [ ] The documentation is updated Co-authored-by: Nico Korthout <[email protected]>
1012: Support using process instance migration from the zeebe client r=korthout a=korthout ## Description <!-- Please explain the changes you made here. --> This adds support for the new `MigrateProcessInstance` RPC to ZPT, such that users can try migrating process instances from ZPT. This allows testing migrations before doing so in production. > [!NOTE] > This does not add assertions for the migration. That is out of scope. Additionally, this fixes a test case that no longer added value. The test case was intended to ensure that we don't forget supporting RPCs in ZPT, but the implementation of the test did not function correctly anymore. Now it is able to correctly detect unsupported RPCs again, which [highlighted that `streamActivatedJobs` is not supported by ZPT](camunda/camunda#11231 (comment)). ## Related issues <!-- Which issues are closed by this PR or are related --> closes #972 <!-- Cut-off marker _All lines under and including the cut-off marker will be removed from the merge commit message_ ## Definition of Ready Please check the items that apply, before requesting a review. You can find more details about these items in our wiki page about [Pull Requests and Code Reviews](https://github.com/camunda/zeebe/wiki/Pull-Requests-and-Code-Reviews). * [ ] I've reviewed my own code * [ ] I've written a clear changelist description * [ ] I've narrowly scoped my changes * [ ] I've separated structural from behavioural changes --> ## Definition of Done <!-- Please check the items that apply, before merging or (if possible) before requesting a review. --> _Not all items need to be done depending on the issue and the pull request._ Code changes: * [x] The changes are backwards compatibility with previous versions * [ ] If it fixes a bug then PRs are created to backport the fix Testing: * [x] There are unit/integration tests that verify all acceptance criterias of the issue * [x] New tests are written to ensure backwards compatibility with further versions * [ ] The behavior is tested manually Documentation: * [ ] Javadoc has been written * [ ] The documentation is updated Co-authored-by: Nico Korthout <[email protected]>
1012: Support using process instance migration from the zeebe client r=korthout a=korthout ## Description <!-- Please explain the changes you made here. --> This adds support for the new `MigrateProcessInstance` RPC to ZPT, such that users can try migrating process instances from ZPT. This allows testing migrations before doing so in production. > [!NOTE] > This does not add assertions for the migration. That is out of scope. Additionally, this fixes a test case that no longer added value. The test case was intended to ensure that we don't forget supporting RPCs in ZPT, but the implementation of the test did not function correctly anymore. Now it is able to correctly detect unsupported RPCs again, which [highlighted that `streamActivatedJobs` is not supported by ZPT](camunda/camunda#11231 (comment)). ## Related issues <!-- Which issues are closed by this PR or are related --> closes #972 <!-- Cut-off marker _All lines under and including the cut-off marker will be removed from the merge commit message_ ## Definition of Ready Please check the items that apply, before requesting a review. You can find more details about these items in our wiki page about [Pull Requests and Code Reviews](https://github.com/camunda/zeebe/wiki/Pull-Requests-and-Code-Reviews). * [ ] I've reviewed my own code * [ ] I've written a clear changelist description * [ ] I've narrowly scoped my changes * [ ] I've separated structural from behavioural changes --> ## Definition of Done <!-- Please check the items that apply, before merging or (if possible) before requesting a review. --> _Not all items need to be done depending on the issue and the pull request._ Code changes: * [x] The changes are backwards compatibility with previous versions * [ ] If it fixes a bug then PRs are created to backport the fix Testing: * [x] There are unit/integration tests that verify all acceptance criterias of the issue * [x] New tests are written to ensure backwards compatibility with further versions * [ ] The behavior is tested manually Documentation: * [ ] Javadoc has been written * [ ] The documentation is updated Co-authored-by: Nico Korthout <[email protected]>
We should, although it's perfectly usable without. |
All tasks were completed or pushed to the backlog, 🎉 |
Description
We would like to reduce the time between a job is created until it is received by a
JobHandler
. One idea is to explore a push-based approach, where workers can open long living streams and receive jobs as soon as they are available, without having to poll for them.To reiterate, the primary goal is: reduce the time between job creation and client handling. It is not to create a push-based pipeline - this is simply a proposed solution.
As a first step, we'll investigate a push based approach. The first iteration will consist of getting feedback on the performance gain as early as possible. This means ignoring most failure paths, most edge cases, QoL features, etc.
Prototype
Organizational
The initial alpha target will implement end-to-end job pushing (from broker to client), with naive solution for the more complex problems (log persistence, failure handling, and flow control), and will only implement support for the Java client.
Alpha
Production
Timeline
We'll approach this in two phases. Alpha and production-ready. Target for the alpha scope is 8.3.0-alpha2 (so the June release). Target for production-ready would be 8.3 (so September).
The text was updated successfully, but these errors were encountered: