Handling errors of InfluxDB under high load

When influxdb load reaches its limits it starts producing various error states that signal the client to back off and retry when the load gets lower. This is common for clusters where nobody wants to pay for unused cluster nodes.

Telegraf Implementation Status

Telegraf is retrying to write failed measurements until these are written. There is no limit on number of retries. See the line 126 below:

https://github.com/influxdata/telegraf/blob/master/internal/models/running_output.go/#L126

Telegraf is checking specific errors from influx db so that measurement writes that can't be fixed by retrying aren't written over and over.

https://github.com/influxdata/telegraf/blob/master/plugins/outputs/influxdb/influxdb.go

There is a limit on the buffer of failed writes (10000 entries by default). Additional failed writes replace the oldest entries in it when it gets filled.

As shown here default client http timeout for Telegraf is 5 seconds, see below:

https://github.com/influxdata/telegraf/blob/master/plugins/outputs/influxdb/client/http.go

and

https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/#clienttimeouts

Write interval Jitter

Telegraf implements variable flush interval on internal buffers so that the load of multiple nodes gets spreaded over time and there are no repeated spikes when multiple clients get synchronized writing batches to InfluxDB.

cache-max-memory-size

InfluxDB maintains a cache on input (configured by cache-max-memory-size CMD option). When the limit is reached InfluxDB rejects further writes with message: ("cache-max-memory-size exceeded: (%d/%d)", n, limit)

Influx DB Java

Uses default retrofit timeouts:

Connection timeout: ten seconds
Read timeout: ten seconds
Write timeout: ten seconds

https://futurestud.io/tutorials/retrofit-2-customize-network-timeouts These timeouts can't be configured right now.

No flush interval jittering

No retry on backoff messages

Benchmarking tool

There is benchmarking tool used for measure InfluxDB performace. It's behavior should be in sync how clients behave so that the results can be reproduce in real world. Benchmarking tool listens for error messages from the server and slows down when certain errors from the server are reported:

engine: cache maximum memory size exceeded
write failed: hinted handoff queue not empty
write failed: read message type: read tcp
i/o timeout
write failed: engine: cache-max-memory-size exceeded
timeout
write failed: can not exceed max connections of 500

Provide feedback

Saved searches

Use saved searches to filter your results more quickly