-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(17180): EventCreator ignores HealthMonitor update when squelching enabled #18387
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: mxtartaglia <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces a race condition. The following conditions must be met in order to safely flush the wiring framework:
- no new data may be entering the system
- all cycles in the wiring graph must be broken. This is done by squelching.
- data flowing within the wiring diagram is now a DAG. Flush components in topological order.
I'm not sure if squelching is going to cause a bug since didn't do a deep dive on the reported issue. But I'm fairly confident that removing it is going to open the door for different bugs.
The complex workflow the system currently uses is in a large part due to the fact that it reuses things after a reconnect. It would be WAY simpler and safer if things were simply discarded and reconstructed after a reconnect.
Hey @cody-littley, thanks for the insight. Truly appreciated. background: Oleg's analysis exposed a situation where, after a reconnect, when squelching is enabled, the EventCreator, if previously registered from the health monitor an unhealthy duration, would ignore all incoming inputs, even the one that, from the health monitor, would inform that the platform is in a healthy status again, failing to start creating events once more.
I see your point; I think that could happen if we completely remove the squelching feature (as is happening here PR). I am removing the squelching ONLY from the EventCreator in the current PR. The drawback is that it relies too much on how things work today.
Agreed, although we are far from that to be possible, other options that appeared are:
Do you think I'm missing something? Eager to hear back from you... |
I agree with Cody. Even if the fix would solve the observed issue, it would introduce much risk. |
@netopyr
I wouldn't say it was initially intended for any other use. @netopyr, which of the alternatives would you prefer, if not this one? A 3rd option proposed by Oleg is to have a different type of wire that would enable a direct modification of component's internal state from other tasksSchedulers belonging to connected components. |
@cody-littley @mxtartaglia-sl @netopyr |
If I understand the issue correctly, the squelched component is accidentally squelching the message that informs it that things are now in a healthy state again? If that's the case, a simple hack that wouldn't introduce new bugs would be to have the health monitor periodically send an "everything is good" message... perhaps once a second or so. (Note, this message probably shouldn't be sent while the node is reconnecting... don't want to clog up queues.) Currently, it only sends that message when the status changes, but it wouldn't hurt anything to send it more frequently. |
@cody-littley correct, that is the issue. |
We considered this approach, but we concluded that it will cause a lot more tasks to be created, which might have a performance impact. Because of this, we thought the current approach would be safer |
It depends on the frequency. There are currently thousands, if not tens of thousands, of tasks flowing through the wiring framework each second. A few extra every second or two is unlikely to produce a measurable delta. |
I agree with this. |
Currently, the check is done 1000 times per second, there are 3 wires bound to its output. This is at least an additional 3000 tasks per second, I don't think this is insignificant. @OlegMazurov @poulok Thoughts on this? |
Is a |
Depends on the scheduler type, but for most used cases, yes. |
It is an instance that is sent to the ForkJoinPool to be executed asynchronously. Its sort of like a queue, but not exactly. |
I believe we can afford that as a tentative solution (we have resources available even at high TPS). |
we are going to move forward with this approach:
|
Signed-off-by: mxtartaglia <[email protected]>
Signed-off-by: mxtartaglia <[email protected]>
@@ -71,6 +86,7 @@ public HealthMonitor( | |||
|
|||
this.metrics = new HealthMonitorMetrics(metrics, healthLogThreshold); | |||
this.schedulers = new ArrayList<>(); | |||
this.healthyReportThreshold = Duration.ofSeconds(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have this configurable?
Description:
When squelched, the EventCreator, if previously registered an unhealthy status by the health monitor, ignores the message that informs it that things are now in a healthy state again. Thus, it fails to start creating events again, given that the monitor doesn't repeat reports if they are unchanged.
After this change, the monitor tracks the transitions to healthy and reports the healthy state repeatedly (even if unchanged) every 1 second.
Related issue(s):
Fixes #17180
Notes for reviewer:
Checklist