Skip to content

template list.peerings timeout #25508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
EtienneBruines opened this issue Mar 24, 2025 · 2 comments
Open

template list.peerings timeout #25508

EtienneBruines opened this issue Mar 24, 2025 · 2 comments
Assignees
Labels
stage/waiting-on-upstream This issue is waiting on an upstream PR review theme/consul-template type/bug

Comments

@EtienneBruines
Copy link
Contributor

EtienneBruines commented Mar 24, 2025

Nomad version

Nomad v1.9.7
BuildDate 2025-03-11T09:07:15Z
Revision f869597+CHANGES

Operating system and Environment details

Ubuntu 24.04.2 LTS on amd64

Issue

The Nomad job does not handle the wait-period for list.peerings correctly. Even though it wants to use a blocking-query to get the latest updates, it doesn't seem patient enough to wait for changes. Not sure whether this logic is in Nomad and how it uses Consul Template, or whether this is an issue within Consul Template v0.40.0. It only occurs in Nomad v1.9.7 and above, ever since the Consul Template dependency got a bump to v0.40.0

Template failed: list.peerings: Get "http://127.0.0.1:8500/v1/peerings?index=41156796&stale=&wait=300000ms": context deadline exceeded

Reproduction steps

  • Have a Consul cluster with cluster peering enabled (not sure if important)
  • Make that peered cluster export a service (not sure if important)
  • Have a Nomad job template block like this (see below)
  • Wait for 20-30 minutes (by default 12 attempts) to see the job fail with the context deadline exceeded error message
template {
        data = <<EOF
{{ range peerings }}
{{ range .StreamStatus.ImportedServices }}
# {{ . }}
{{ end }}
{{ end }}
EOF
        destination = "local/backends.cfg"
        env = false
        change_mode = "signal"
        change_signal = "SIGUSR2"
      }

Expected Result

No timeouts.

Actual Result

Timeouts which cause the job to fail.

Job file (if appropriate)

See related snippet above.

Nomad Server logs (if appropriate)

Not applicable.

Nomad Client logs (if appropriate)

This might contain a few (not many) allocs that are unrelated. The issue with list.peerings does clearly show.

{"@level":"info","@message":"(runner) rendered \"(dynamic)\" =\u003e \"/opt/nomad/data/alloc/b3e17da0-8fec-fb9b-51b9-b14ea2c404e2/haproxy/local/backends.cfg\"","@module":"agent","@timestamp":"2025-03-24T15:48:44.400586Z"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-03-24T15:48:44.823143Z","alloc_id":"b3e17da0-8fec-fb9b-51b9-b14ea2c404e2","failed":false,"msg":"Template re-rendered","task":"haproxy","type":"Signaling"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 1 after \"250ms\")","@module":"agent","@timestamp":"2025-03-24T15:48:53.277751Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T15:48:53.277803Z"}
{"@level":"info","@message":"(runner) rendered \"(dynamic)\" =\u003e \"/opt/nomad/data/alloc/b3e17da0-8fec-fb9b-51b9-b14ea2c404e2/haproxy/local/backends.cfg\"","@module":"agent","@timestamp":"2025-03-24T15:48:54.377757Z"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-03-24T15:48:55.970820Z","alloc_id":"b3e17da0-8fec-fb9b-51b9-b14ea2c404e2","failed":false,"msg":"Template re-rendered","task":"haproxy","type":"Signaling"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T15:49:53.529783Z"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 2 after \"500ms\")","@module":"agent","@timestamp":"2025-03-24T15:49:53.529873Z"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 3 after \"1s\")","@module":"agent","@timestamp":"2025-03-24T15:50:54.030772Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T15:50:54.030826Z"}
{"@level":"info","@message":"(runner) rendered \"(dynamic)\" =\u003e \"/opt/nomad/data/alloc/b3e17da0-8fec-fb9b-51b9-b14ea2c404e2/haproxy/local/backends.cfg\"","@module":"agent","@timestamp":"2025-03-24T15:51:29.773247Z"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-03-24T15:51:32.146192Z","alloc_id":"b3e17da0-8fec-fb9b-51b9-b14ea2c404e2","failed":false,"msg":"Template re-rendered","task":"haproxy","type":"Signaling"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 4 after \"2s\")","@module":"agent","@timestamp":"2025-03-24T15:51:55.032151Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T15:51:55.032215Z"}
{"@level":"info","@message":"(runner) rendered \"(dynamic)\" =\u003e \"/opt/nomad/data/alloc/b3e17da0-8fec-fb9b-51b9-b14ea2c404e2/haproxy/local/backends.cfg\"","@module":"agent","@timestamp":"2025-03-24T15:52:25.322546Z"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-03-24T15:52:26.641251Z","alloc_id":"b3e17da0-8fec-fb9b-51b9-b14ea2c404e2","failed":false,"msg":"Template re-rendered","task":"haproxy","type":"Signaling"}
{"@level":"info","@message":"(runner) rendered \"(dynamic)\" =\u003e \"/opt/nomad/data/alloc/b3e17da0-8fec-fb9b-51b9-b14ea2c404e2/haproxy/local/backends.cfg\"","@module":"agent","@timestamp":"2025-03-24T15:52:41.801646Z"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-03-24T15:52:45.268276Z","alloc_id":"b3e17da0-8fec-fb9b-51b9-b14ea2c404e2","failed":false,"msg":"Template re-rendered","task":"haproxy","type":"Signaling"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 5 after \"4s\")","@module":"agent","@timestamp":"2025-03-24T15:52:57.033830Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T15:52:57.033891Z"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 6 after \"8s\")","@module":"agent","@timestamp":"2025-03-24T15:54:01.035274Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T15:54:01.035344Z"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 7 after \"16s\")","@module":"agent","@timestamp":"2025-03-24T15:55:09.037003Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T15:55:09.037086Z"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 8 after \"32s\")","@module":"agent","@timestamp":"2025-03-24T15:56:25.039093Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T15:56:25.039145Z"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 9 after \"1m0s\")","@module":"agent","@timestamp":"2025-03-24T15:57:57.040381Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T15:57:57.040425Z"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 10 after \"1m0s\")","@module":"agent","@timestamp":"2025-03-24T15:59:57.041280Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T15:59:57.041330Z"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 11 after \"1m0s\")","@module":"agent","@timestamp":"2025-03-24T16:01:57.042940Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T16:01:57.042997Z"}
{"@level":"warn","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (retry attempt 12 after \"1m0s\")","@module":"agent","@timestamp":"2025-03-24T16:03:57.044784Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T16:03:57.044858Z"}
{"@level":"error","@message":"(view) list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded (exceeded maximum retries)","@module":"agent","@timestamp":"2025-03-24T16:05:57.045917Z"}
{"@level":"error","@message":"(runner) sending server error back to caller","@module":"agent","@timestamp":"2025-03-24T16:05:57.045988Z"}
{"@level":"error","@message":"(runner) watcher reported error: list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded","@module":"agent","@timestamp":"2025-03-24T16:05:57.046013Z"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-03-24T16:05:57.046067Z","alloc_id":"b3e17da0-8fec-fb9b-51b9-b14ea2c404e2","failed":true,"msg":"Template failed: list.peerings: Get \"http://127.0.0.1:8500/v1/peerings?index=41156796\u0026stale=\u0026wait=300000ms\": context deadline exceeded","task":"haproxy","type":"Killing"}
@EtienneBruines
Copy link
Contributor Author

EtienneBruines commented Mar 24, 2025

This seems to be an issue in the Consul Template code. I submitted a PR (hashicorp/consul-template#2042) that should fix this behavior. Once that is merged, we can get that fix in Nomad.

As a workaround, we are pinning the clients to Nomad v1.9.6 until this is resolved. The issue is causing downtime on clients running Nomad v1.9.7 and above.

As a "better" workaround we are now setting block_query_wait to 55s to be able to update our Nomad releases until the issue is fixed.

@tgross
Copy link
Member

tgross commented Mar 24, 2025

Hi @EtienneBruines! I think your assessment of the problem is accurate. I've left some comments on your PR hashicorp/consul-template#2042 (review). I'm not an approver for that repo, but if we can work together to get that patch hammered into shape I can nudge it along with our colleagues.

@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Mar 24, 2025
@tgross tgross added stage/waiting-on-upstream This issue is waiting on an upstream PR review theme/consul-template labels Mar 24, 2025
@tgross tgross self-assigned this Mar 24, 2025
@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/waiting-on-upstream This issue is waiting on an upstream PR review theme/consul-template type/bug
Projects
Development

No branches or pull requests

2 participants