fix(17180): EventCreator ignores HealthMonitor update when squelching enabled #18387

mxtartaglia-sl · 2025-03-18T15:03:53Z

Description:

When squelched, the EventCreator, if previously registered an unhealthy status by the health monitor, ignores the message that informs it that things are now in a healthy state again. Thus, it fails to start creating events again, given that the monitor doesn't repeat reports if they are unchanged.

After this change, the monitor tracks the transitions to healthy and reports the healthy state repeatedly (even if unchanged) every 1 second.

Related issue(s):

Fixes #17180

Notes for reviewer:

Checklist

Documented (Code comments, README, etc.)
Tested (unit, integration, etc.)

Signed-off-by: mxtartaglia <[email protected]>

cody-littley

This change introduces a race condition. The following conditions must be met in order to safely flush the wiring framework:

no new data may be entering the system
all cycles in the wiring graph must be broken. This is done by squelching.
data flowing within the wiring diagram is now a DAG. Flush components in topological order.

I'm not sure if squelching is going to cause a bug since didn't do a deep dive on the reported issue. But I'm fairly confident that removing it is going to open the door for different bugs.

The complex workflow the system currently uses is in a large part due to the fact that it reuses things after a reconnect. It would be WAY simpler and safer if things were simply discarded and reconstructed after a reconnect.

mxtartaglia-sl · 2025-03-19T21:50:49Z

Hey @cody-littley, thanks for the insight. Truly appreciated.

background: Oleg's analysis exposed a situation where, after a reconnect, when squelching is enabled, the EventCreator, if previously registered from the health monitor an unhealthy duration, would ignore all incoming inputs, even the one that, from the health monitor, would inform that the platform is in a healthy status again, failing to start creating events once more.

I'm not sure if squelching is going to cause a bug since I didn't do a deep dive on the reported issue. But I'm fairly confident that removing it is going to open the door for different bugs.

I see your point; I think that could happen if we completely remove the squelching feature (as is happening here PR).

I am removing the squelching ONLY from the EventCreator in the current PR.
The rationale is:
a) the eventcreator #maybeCreateEvent wire is already not emitting new events, given the platform status.
b) eventcreator # "health info" wire must accept input regardless of a reconnect being processed.
c) eventcreator # "PlatformStatus" wire comes from the statusStateMachine, which was currently flushed before we used to enable squelching on the eventcreator component, so it would be ok to register the new input.
d) eventcreator # "event window" is replaced in ReconnectStateLoader#loadReconnectState()
e) whatever happens with eventcreator # "PlatformEvent" is later cleared using the "clear" wire.

The drawback is that it relies too much on how things work today.

It would be WAY simpler and safer if things were simply discarded and reconstructed after a reconnect.

Agreed, although we are far from that to be possible, other options that appeared are:

The health monitor would report the duration even if it is unchanged, ensuring that the event creator would eventually receive the input when it stops squelching.
Squelching is configurable per wire instead of per taskscheduler.

Do you think I'm missing something? Eager to hear back from you...

netopyr · 2025-03-20T15:44:01Z

I agree with Cody. Even if the fix would solve the observed issue, it would introduce much risk. PlatformCoordinator.clear() could never be used in any other scenario. Can we ensure this is the case and always will be?

mxtartaglia-sl · 2025-03-20T18:52:46Z

@netopyr PlatformCoordinator#clear javadoc states:

    /**
     * Safely clears the system in preparation for reconnect. After this method is called, there should be no work
     * sitting in any of the wiring queues, and all internal data structures within wiring components that need to be
     * cleared to prepare for a reconnect should be cleared.
     */

I wouldn't say it was initially intended for any other use.
That said, I'm all up for not introducing the indirect coupling and leaving the squelching (at least, if really not necessary) as documentation of the relationship. @lpetrovic05 any thoughts?

@netopyr, which of the alternatives would you prefer, if not this one?

A 3rd option proposed by Oleg is to have a different type of wire that would enable a direct modification of component's internal state from other tasksSchedulers belonging to connected components.

lpetrovic05 · 2025-03-21T13:09:13Z

@cody-littley @mxtartaglia-sl @netopyr
My thought is this. What Cody said about reconstructing is definitely the best and simplest approach, and this is what we plan to do. But in the meantime, we should probably fix this bug. This to me seems the simplest and safest approach to fixing this issue until we get around to rebuilding everything.

cody-littley · 2025-03-21T16:34:31Z

If I understand the issue correctly, the squelched component is accidentally squelching the message that informs it that things are now in a healthy state again?

If that's the case, a simple hack that wouldn't introduce new bugs would be to have the health monitor periodically send an "everything is good" message... perhaps once a second or so. (Note, this message probably shouldn't be sent while the node is reconnecting... don't want to clog up queues.) Currently, it only sends that message when the status changes, but it wouldn't hurt anything to send it more frequently.

mxtartaglia-sl · 2025-03-21T20:19:29Z

@cody-littley correct, that is the issue.

lpetrovic05 · 2025-03-24T10:07:12Z

If I understand the issue correctly, the squelched component is accidentally squelching the message that informs it that things are now in a healthy state again?

If that's the case, a simple hack that wouldn't introduce new bugs would be to have the health monitor periodically send an "everything is good" message... perhaps once a second or so. (Note, this message probably shouldn't be sent while the node is reconnecting... don't want to clog up queues.) Currently, it only sends that message when the status changes, but it wouldn't hurt anything to send it more frequently.

We considered this approach, but we concluded that it will cause a lot more tasks to be created, which might have a performance impact. Because of this, we thought the current approach would be safer

cody-littley · 2025-03-24T14:19:01Z

a lot more tasks to be created

It depends on the frequency. There are currently thousands, if not tens of thousands, of tasks flowing through the wiring framework each second. A few extra every second or two is unlikely to produce a measurable delta.

edward-swirldslabs · 2025-03-24T14:39:33Z

It depends on the frequency. There are currently thousands, if not tens of thousands, of tasks flowing through the wiring framework each second. A few extra every second or two is unlikely to produce a measurable delta.

I agree with this.

lpetrovic05 · 2025-03-24T15:31:50Z

It depends on the frequency. There are currently thousands, if not tens of thousands, of tasks flowing through the wiring framework each second. A few extra every second or two is unlikely to produce a measurable delta.

I agree with this.

Currently, the check is done 1000 times per second, there are 3 wires bound to its output. This is at least an additional 3000 tasks per second, I don't think this is insignificant.

@OlegMazurov @poulok Thoughts on this?

edward-swirldslabs · 2025-03-24T15:36:32Z

It depends on the frequency. There are currently thousands, if not tens of thousands, of tasks flowing through the wiring framework each second. A few extra every second or two is unlikely to produce a measurable delta.

I agree with this.

Currently, the check is done 1000 times per second, there are 3 wires bound to its output. This is at least an additional 3000 tasks per second, I don't think this is insignificant.

@OlegMazurov @poulok Thoughts on this?

Is a task something that is generating an item on a queue that has to be handled, or java byte code executing to do a branch test?

mxtartaglia-sl · 2025-03-24T15:39:50Z

Is a task something that is generating an item on a queue that has to be handled

Depends on the scheduler type, but for most used cases, yes.

lpetrovic05 · 2025-03-24T15:40:25Z

It depends on the frequency. There are currently thousands, if not tens of thousands, of tasks flowing through the wiring framework each second. A few extra every second or two is unlikely to produce a measurable delta.

I agree with this.

Currently, the check is done 1000 times per second, there are 3 wires bound to its output. This is at least an additional 3000 tasks per second, I don't think this is insignificant.
@OlegMazurov @poulok Thoughts on this?

Is a task something that is generating an item on a queue that has to be handled, or java byte code executing to do a branch test?

It is an instance that is sent to the ForkJoinPool to be executed asynchronously. Its sort of like a queue, but not exactly.

OlegMazurov · 2025-03-25T01:36:39Z

Currently, the check is done 1000 times per second, there are 3 wires bound to its output. This is at least an additional 3000 tasks per second, I don't think this is insignificant.

I believe we can afford that as a tentative solution (we have resources available even at high TPS).

mxtartaglia-sl · 2025-03-25T13:41:52Z

we are going to move forward with this approach:

have the health monitor periodically send an "everything is good" message... perhaps once a second or so.

…ng-event-creator

Signed-off-by: mxtartaglia <[email protected]>

timo0 · 2025-03-27T09:02:16Z

...work/src/main/java/com/swirlds/component/framework/model/internal/monitor/HealthMonitor.java

@@ -71,6 +86,7 @@ public HealthMonitor(

        this.metrics = new HealthMonitorMetrics(metrics, healthLogThreshold);
        this.schedulers = new ArrayList<>();
+        this.healthyReportThreshold = Duration.ofSeconds(1);


should we have this configurable?

fix(17180): Squelching is removed from eventCreator

74b35b8

Signed-off-by: mxtartaglia <[email protected]>

mxtartaglia-sl added this to the v0.61 milestone Mar 18, 2025

mxtartaglia-sl self-assigned this Mar 18, 2025

cody-littley suggested changes Mar 18, 2025

View reviewed changes

mxtartaglia-sl requested a review from cody-littley March 20, 2025 12:47

mxtartaglia-sl marked this pull request as ready for review March 20, 2025 21:18

mxtartaglia-sl requested a review from a team as a code owner March 20, 2025 21:18

mxtartaglia-sl added 3 commits March 25, 2025 10:44

Merge remote-tracking branch 'origin/main' into 17180-remove-squelchi…

bf4327f

…ng-event-creator

fix(17180): HealthMonitor reports healthy duration after one second

7dacf0a

Signed-off-by: mxtartaglia <[email protected]>

fix(17180): HealthMonitor reports healthy duration after one second

a9a2a90

Signed-off-by: mxtartaglia <[email protected]>

mxtartaglia-sl changed the title ~~fix(17180): Squelching is removed from eventCreator~~ fix(17180): Health monitor reports healthy status every 1-second Mar 26, 2025

mxtartaglia-sl changed the title ~~fix(17180): Health monitor reports healthy status every 1-second~~ fix(17180): EventCreator ignores HealthMonitor update when squelching enabled Mar 26, 2025

mxtartaglia-sl and others added 3 commits March 26, 2025 15:45

Merge branch 'main' into 17180-remove-squelching-event-creator

74db4d3

fix(17180): Revert squelching removal

346de39

Signed-off-by: mxtartaglia <[email protected]>

Merge branch 'main' into 17180-remove-squelching-event-creator

e9c2bda

timo0 reviewed Mar 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(17180): EventCreator ignores HealthMonitor update when squelching enabled #18387

fix(17180): EventCreator ignores HealthMonitor update when squelching enabled #18387

mxtartaglia-sl commented Mar 18, 2025 •

edited

Loading

cody-littley left a comment •

edited

Loading

mxtartaglia-sl commented Mar 19, 2025 •

edited

Loading

netopyr commented Mar 20, 2025

mxtartaglia-sl commented Mar 20, 2025

lpetrovic05 commented Mar 21, 2025

cody-littley commented Mar 21, 2025 •

edited

Loading

mxtartaglia-sl commented Mar 21, 2025

lpetrovic05 commented Mar 24, 2025

cody-littley commented Mar 24, 2025

edward-swirldslabs commented Mar 24, 2025

lpetrovic05 commented Mar 24, 2025

edward-swirldslabs commented Mar 24, 2025 •

edited

Loading

mxtartaglia-sl commented Mar 24, 2025 •

edited

Loading

lpetrovic05 commented Mar 24, 2025

OlegMazurov commented Mar 25, 2025

mxtartaglia-sl commented Mar 25, 2025

timo0 Mar 27, 2025

fix(17180): EventCreator ignores HealthMonitor update when squelching enabled #18387

Are you sure you want to change the base?

fix(17180): EventCreator ignores HealthMonitor update when squelching enabled #18387

Conversation

mxtartaglia-sl commented Mar 18, 2025 • edited Loading

cody-littley left a comment • edited Loading

Choose a reason for hiding this comment

mxtartaglia-sl commented Mar 19, 2025 • edited Loading

netopyr commented Mar 20, 2025

mxtartaglia-sl commented Mar 20, 2025

lpetrovic05 commented Mar 21, 2025

cody-littley commented Mar 21, 2025 • edited Loading

mxtartaglia-sl commented Mar 21, 2025

lpetrovic05 commented Mar 24, 2025

cody-littley commented Mar 24, 2025

edward-swirldslabs commented Mar 24, 2025

lpetrovic05 commented Mar 24, 2025

edward-swirldslabs commented Mar 24, 2025 • edited Loading

mxtartaglia-sl commented Mar 24, 2025 • edited Loading

lpetrovic05 commented Mar 24, 2025

OlegMazurov commented Mar 25, 2025

mxtartaglia-sl commented Mar 25, 2025

timo0 Mar 27, 2025

Choose a reason for hiding this comment

mxtartaglia-sl commented Mar 18, 2025 •

edited

Loading

cody-littley left a comment •

edited

Loading

mxtartaglia-sl commented Mar 19, 2025 •

edited

Loading

cody-littley commented Mar 21, 2025 •

edited

Loading

edward-swirldslabs commented Mar 24, 2025 •

edited

Loading

mxtartaglia-sl commented Mar 24, 2025 •

edited

Loading