Increase instance size used for RHEL tests [DI-435] #917

JackPGreen · 2025-03-11T22:47:35Z

The RHEL smoke test regularly fails due to:

error: timed out waiting for the condition on pods/test-13631969388-1-17-hazelcast-enterprise-mancenter-0

After extensive investigation, I believe the root cause is an inadequate instance size.

Specifically, my hypothesis is:

MC regularly takes ~30 seconds to start, even if ultimately successful on our current instances - locally this is <5 seconds
In the Helm chart, we allow 30 seconds before beginning liveness probes
If a liveness probe fails (i.e. it's still starting), the instance is restarted
This leads to regular MC test instance restarts - normally eventually an invocation starts up quick enough, but sometimes not.

Due to this transient behaviour it's difficult to be certain, but tested:

with the existing runner, failed after 3 re-runs
with a faster runner, did not fail with this error after 8 re-runs

Fixes: DI-435

Post-merge:

backport
retag

The RHEL smoke test regularly fails due to: > error: timed out waiting for the condition on pods/test-13631969388-1-17-hazelcast-enterprise-mancenter-0 After extensive investigation, I _believe_ the root cause is an inadequate instance size. Specifically, my hypothesis is: - MC [regularly takes ~30 seconds to start, even if ultimately successful](https://github.com/hazelcast/hazelcast-docker/actions/runs/13631962459) on our current instances - locally this is <5 seconds - In the Helm chart, [we allow 30 seconds before beginning liveness probes](https://github.com/hazelcast/charts/blob/7cf90413100187335332ebccefd58781873fd696/stable/hazelcast-enterprise/values.yaml#L499) - If a liveness probe fails (i.e. it's still starting), the instance is restarted - This leads to regular MC test instance restarts - normally eventually _an_ invocation starts up quick enough, but sometimes not. Due to this transient behaviour it's difficult to be certain, but tested: - with the existing runner, [failed after 3 re-runs](https://github.com/hazelcast/hazelcast-docker/actions/runs/13783658071/job/38574527022) - with a faster runner, [did not fail _with this error_ after 8 re-runs](https://github.com/hazelcast/hazelcast-docker/actions/runs/13792764938) Fixes: [DI-435](https://hazelcast.atlassian.net/browse/DI-435) Post-merge: - [ ] backport - [ ] retag

.github/workflows/tag_image_push_rhel.yml

…-DI-435]

Backport of #917

This reverts commit dc05127.

Reverts #917 Despite testing being unable to reproduce the issue, as soon as merged [it failed in the same way as before](https://github.com/hazelcast/hazelcast-docker/actions/runs/13858020706) showing this fix does nothing.

JackPGreen requested review from ldziedziul and nishaatr March 11, 2025 22:47

JackPGreen requested a review from a team as a code owner March 11, 2025 22:47

nishaatr approved these changes Mar 12, 2025

View reviewed changes

.github/workflows/tag_image_push_rhel.yml Outdated Show resolved Hide resolved

Update tag_image_push_rhel.yml

466b910

JackPGreen enabled auto-merge (squash) March 14, 2025 13:03

ldziedziul approved these changes Mar 14, 2025

View reviewed changes

Merge branch 'master' into Increase-instance-size-used-for-RHEL-tests…

7b22562

…-DI-435]

JackPGreen merged commit dc05127 into master Mar 14, 2025
17 checks passed

JackPGreen deleted the Increase-instance-size-used-for-RHEL-tests-DI-435] branch March 14, 2025 13:12

JackPGreen added a commit that referenced this pull request Mar 14, 2025

Increase instance size used for RHEL tests [5.5.z] (#920)

ce35d75

Backport of #917

JackPGreen added a commit that referenced this pull request Mar 14, 2025

Increase instance size used for RHEL tests [5.4.z] (#921)

a70f4ec

Backport of #917

JackPGreen added a commit that referenced this pull request Mar 14, 2025

Increase instance size used for RHEL tests [5.5.5] (#924)

0ef82a7

Backport of #917

JackPGreen added a commit that referenced this pull request Mar 14, 2025

Increase instance size used for RHEL tests [5.3.z] (#922)

4988494

Backport of #917

JackPGreen added a commit that referenced this pull request Mar 14, 2025

Increase instance size used for RHEL tests [5.3.8] (#923)

7146a00

Backport of #917

JackPGreen added a commit that referenced this pull request Mar 14, 2025

Increase instance size used for RHEL tests [5.4.1] (#925)

807f5d1

Backport of #917

JackPGreen added a commit that referenced this pull request Mar 14, 2025

Revert "Increase instance size used for RHEL tests [DI-435] (#917)"

06aea56

This reverts commit dc05127.

JackPGreen mentioned this pull request Mar 14, 2025

Revert "Increase instance size used for RHEL tests [DI-435]" #926

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase instance size used for RHEL tests [DI-435] #917

Increase instance size used for RHEL tests [DI-435] #917

JackPGreen commented Mar 11, 2025 •

edited

Loading

Increase instance size used for RHEL tests [DI-435] #917

Increase instance size used for RHEL tests [DI-435] #917

Conversation

JackPGreen commented Mar 11, 2025 • edited Loading

JackPGreen commented Mar 11, 2025 •

edited

Loading