Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(17180): EventCreator ignores HealthMonitor update when squelching enabled #18387

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

mxtartaglia-sl
Copy link
Contributor

@mxtartaglia-sl mxtartaglia-sl commented Mar 18, 2025

Description:

When squelched, the EventCreator, if previously registered an unhealthy status by the health monitor, ignores the message that informs it that things are now in a healthy state again. Thus, it fails to start creating events again, given that the monitor doesn't repeat reports if they are unchanged.

After this change, the monitor tracks the transitions to healthy and reports the healthy state repeatedly (even if unchanged) every 1 second.

Related issue(s):

Fixes #17180

Notes for reviewer:

Checklist

  • Documented (Code comments, README, etc.)
  • Tested (unit, integration, etc.)

@mxtartaglia-sl mxtartaglia-sl added this to the v0.61 milestone Mar 18, 2025
@mxtartaglia-sl mxtartaglia-sl self-assigned this Mar 18, 2025
Copy link
Contributor

@cody-littley cody-littley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces a race condition. The following conditions must be met in order to safely flush the wiring framework:

  • no new data may be entering the system
  • all cycles in the wiring graph must be broken. This is done by squelching.
  • data flowing within the wiring diagram is now a DAG. Flush components in topological order.

I'm not sure if squelching is going to cause a bug since didn't do a deep dive on the reported issue. But I'm fairly confident that removing it is going to open the door for different bugs.

The complex workflow the system currently uses is in a large part due to the fact that it reuses things after a reconnect. It would be WAY simpler and safer if things were simply discarded and reconstructed after a reconnect.

@mxtartaglia-sl
Copy link
Contributor Author

mxtartaglia-sl commented Mar 19, 2025

Hey @cody-littley, thanks for the insight. Truly appreciated.

background: Oleg's analysis exposed a situation where, after a reconnect, when squelching is enabled, the EventCreator, if previously registered from the health monitor an unhealthy duration, would ignore all incoming inputs, even the one that, from the health monitor, would inform that the platform is in a healthy status again, failing to start creating events once more.

I'm not sure if squelching is going to cause a bug since I didn't do a deep dive on the reported issue. But I'm fairly confident that removing it is going to open the door for different bugs.

I see your point; I think that could happen if we completely remove the squelching feature (as is happening here PR).

I am removing the squelching ONLY from the EventCreator in the current PR.
The rationale is:
a) the eventcreator #maybeCreateEvent wire is already not emitting new events, given the platform status.
b) eventcreator # "health info" wire must accept input regardless of a reconnect being processed.
c) eventcreator # "PlatformStatus" wire comes from the statusStateMachine, which was currently flushed before we used to enable squelching on the eventcreator component, so it would be ok to register the new input.
d) eventcreator # "event window" is replaced in ReconnectStateLoader#loadReconnectState()
e) whatever happens with eventcreator # "PlatformEvent" is later cleared using the "clear" wire.

The drawback is that it relies too much on how things work today.

It would be WAY simpler and safer if things were simply discarded and reconstructed after a reconnect.

Agreed, although we are far from that to be possible, other options that appeared are:

  • The health monitor would report the duration even if it is unchanged, ensuring that the event creator would eventually receive the input when it stops squelching.
  • Squelching is configurable per wire instead of per taskscheduler.

Do you think I'm missing something? Eager to hear back from you...

@netopyr
Copy link
Contributor

netopyr commented Mar 20, 2025

I agree with Cody. Even if the fix would solve the observed issue, it would introduce much risk. PlatformCoordinator.clear() could never be used in any other scenario. Can we ensure this is the case and always will be?

@mxtartaglia-sl
Copy link
Contributor Author

@netopyr PlatformCoordinator#clear javadoc states:

    /**
     * Safely clears the system in preparation for reconnect. After this method is called, there should be no work
     * sitting in any of the wiring queues, and all internal data structures within wiring components that need to be
     * cleared to prepare for a reconnect should be cleared.
     */

I wouldn't say it was initially intended for any other use.
That said, I'm all up for not introducing the indirect coupling and leaving the squelching (at least, if really not necessary) as documentation of the relationship. @lpetrovic05 any thoughts?

@netopyr, which of the alternatives would you prefer, if not this one?

A 3rd option proposed by Oleg is to have a different type of wire that would enable a direct modification of component's internal state from other tasksSchedulers belonging to connected components.

@mxtartaglia-sl mxtartaglia-sl marked this pull request as ready for review March 20, 2025 21:18
@mxtartaglia-sl mxtartaglia-sl requested a review from a team as a code owner March 20, 2025 21:18
@lpetrovic05
Copy link
Contributor

@cody-littley @mxtartaglia-sl @netopyr
My thought is this. What Cody said about reconstructing is definitely the best and simplest approach, and this is what we plan to do. But in the meantime, we should probably fix this bug. This to me seems the simplest and safest approach to fixing this issue until we get around to rebuilding everything.

@cody-littley
Copy link
Contributor

cody-littley commented Mar 21, 2025

If I understand the issue correctly, the squelched component is accidentally squelching the message that informs it that things are now in a healthy state again?

If that's the case, a simple hack that wouldn't introduce new bugs would be to have the health monitor periodically send an "everything is good" message... perhaps once a second or so. (Note, this message probably shouldn't be sent while the node is reconnecting... don't want to clog up queues.) Currently, it only sends that message when the status changes, but it wouldn't hurt anything to send it more frequently.

@mxtartaglia-sl
Copy link
Contributor Author

@cody-littley correct, that is the issue.

@lpetrovic05
Copy link
Contributor

If I understand the issue correctly, the squelched component is accidentally squelching the message that informs it that things are now in a healthy state again?

If that's the case, a simple hack that wouldn't introduce new bugs would be to have the health monitor periodically send an "everything is good" message... perhaps once a second or so. (Note, this message probably shouldn't be sent while the node is reconnecting... don't want to clog up queues.) Currently, it only sends that message when the status changes, but it wouldn't hurt anything to send it more frequently.

We considered this approach, but we concluded that it will cause a lot more tasks to be created, which might have a performance impact. Because of this, we thought the current approach would be safer

@cody-littley
Copy link
Contributor

a lot more tasks to be created

It depends on the frequency. There are currently thousands, if not tens of thousands, of tasks flowing through the wiring framework each second. A few extra every second or two is unlikely to produce a measurable delta.

@edward-swirldslabs
Copy link
Contributor

It depends on the frequency. There are currently thousands, if not tens of thousands, of tasks flowing through the wiring framework each second. A few extra every second or two is unlikely to produce a measurable delta.

I agree with this.

@lpetrovic05
Copy link
Contributor

It depends on the frequency. There are currently thousands, if not tens of thousands, of tasks flowing through the wiring framework each second. A few extra every second or two is unlikely to produce a measurable delta.

I agree with this.

Currently, the check is done 1000 times per second, there are 3 wires bound to its output. This is at least an additional 3000 tasks per second, I don't think this is insignificant.

@OlegMazurov @poulok Thoughts on this?

@edward-swirldslabs
Copy link
Contributor

edward-swirldslabs commented Mar 24, 2025

It depends on the frequency. There are currently thousands, if not tens of thousands, of tasks flowing through the wiring framework each second. A few extra every second or two is unlikely to produce a measurable delta.

I agree with this.

Currently, the check is done 1000 times per second, there are 3 wires bound to its output. This is at least an additional 3000 tasks per second, I don't think this is insignificant.

@OlegMazurov @poulok Thoughts on this?

Is a task something that is generating an item on a queue that has to be handled, or java byte code executing to do a branch test?

@mxtartaglia-sl
Copy link
Contributor Author

mxtartaglia-sl commented Mar 24, 2025

Is a task something that is generating an item on a queue that has to be handled

Depends on the scheduler type, but for most used cases, yes.

@lpetrovic05
Copy link
Contributor

It depends on the frequency. There are currently thousands, if not tens of thousands, of tasks flowing through the wiring framework each second. A few extra every second or two is unlikely to produce a measurable delta.

I agree with this.

Currently, the check is done 1000 times per second, there are 3 wires bound to its output. This is at least an additional 3000 tasks per second, I don't think this is insignificant.
@OlegMazurov @poulok Thoughts on this?

Is a task something that is generating an item on a queue that has to be handled, or java byte code executing to do a branch test?

It is an instance that is sent to the ForkJoinPool to be executed asynchronously. Its sort of like a queue, but not exactly.

@OlegMazurov
Copy link
Contributor

Currently, the check is done 1000 times per second, there are 3 wires bound to its output. This is at least an additional 3000 tasks per second, I don't think this is insignificant.

I believe we can afford that as a tentative solution (we have resources available even at high TPS).

@mxtartaglia-sl
Copy link
Contributor Author

we are going to move forward with this approach:

have the health monitor periodically send an "everything is good" message... perhaps once a second or so.

@mxtartaglia-sl mxtartaglia-sl changed the title fix(17180): Squelching is removed from eventCreator fix(17180): Health monitor reports healthy status every 1-second Mar 26, 2025
@mxtartaglia-sl mxtartaglia-sl changed the title fix(17180): Health monitor reports healthy status every 1-second fix(17180): EventCreator ignores HealthMonitor update when squelching enabled Mar 26, 2025
@@ -71,6 +86,7 @@ public HealthMonitor(

this.metrics = new HealthMonitorMetrics(metrics, healthLogThreshold);
this.schedulers = new ArrayList<>();
this.healthyReportThreshold = Duration.ofSeconds(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have this configurable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Node 2 never returns to ACTIVE after reconnect in FCM-VM-MultiSBReconnect-3R-1k-20m JRS Test
7 participants