Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase instance size used for RHEL tests [DI-435] #917

Merged
merged 3 commits into from
Mar 14, 2025

Conversation

JackPGreen
Copy link
Collaborator

@JackPGreen JackPGreen commented Mar 11, 2025

The RHEL smoke test regularly fails due to:

error: timed out waiting for the condition on pods/test-13631969388-1-17-hazelcast-enterprise-mancenter-0

After extensive investigation, I believe the root cause is an inadequate instance size.

Specifically, my hypothesis is:

Due to this transient behaviour it's difficult to be certain, but tested:

Fixes: DI-435

Post-merge:

  • backport
  • retag

The RHEL smoke test regularly fails due to:
> error: timed out waiting for the condition on pods/test-13631969388-1-17-hazelcast-enterprise-mancenter-0

After extensive investigation, I _believe_ the root cause is an inadequate instance size.

Specifically, my hypothesis is:
- MC [regularly takes ~30 seconds to start, even if ultimately successful](https://github.com/hazelcast/hazelcast-docker/actions/runs/13631962459) on our current instances - locally this is <5 seconds
- In the Helm chart, [we allow 30 seconds before beginning liveness probes](https://github.com/hazelcast/charts/blob/7cf90413100187335332ebccefd58781873fd696/stable/hazelcast-enterprise/values.yaml#L499)
- If a liveness probe fails (i.e. it's still starting), the instance is restarted
- This leads to regular MC test instance restarts - normally eventually _an_ invocation starts up quick enough, but sometimes not.

Due to this transient behaviour it's difficult to be certain, but tested:
- with the existing runner, [failed after 3 re-runs](https://github.com/hazelcast/hazelcast-docker/actions/runs/13783658071/job/38574527022)
- with a faster runner, [did not fail _with this error_ after 8 re-runs](https://github.com/hazelcast/hazelcast-docker/actions/runs/13792764938)


Fixes: [DI-435](https://hazelcast.atlassian.net/browse/DI-435)

Post-merge:
- [ ] backport
- [ ] retag
@JackPGreen JackPGreen requested a review from a team as a code owner March 11, 2025 22:47
@JackPGreen JackPGreen enabled auto-merge (squash) March 14, 2025 13:03
@JackPGreen JackPGreen merged commit dc05127 into master Mar 14, 2025
17 checks passed
@JackPGreen JackPGreen deleted the Increase-instance-size-used-for-RHEL-tests-DI-435] branch March 14, 2025 13:12
JackPGreen added a commit that referenced this pull request Mar 14, 2025
JackPGreen added a commit that referenced this pull request Mar 14, 2025
JackPGreen added a commit that referenced this pull request Mar 14, 2025
JackPGreen added a commit that referenced this pull request Mar 14, 2025
JackPGreen added a commit that referenced this pull request Mar 14, 2025
JackPGreen added a commit that referenced this pull request Mar 14, 2025
JackPGreen added a commit that referenced this pull request Mar 14, 2025
JackPGreen added a commit that referenced this pull request Mar 14, 2025
Reverts #917

Despite testing being unable to reproduce the issue, as soon as merged
[it failed in the same way as
before](https://github.com/hazelcast/hazelcast-docker/actions/runs/13858020706)
showing this fix does nothing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants