-
Notifications
You must be signed in to change notification settings - Fork 478
Handling errors of InfluxDB under high load
When influxdb load reaches its limits it starts producing various error states that signal the client to back off and retry when the load gets lower. This is common for clusters where nobody wants to pay for unused cluster nodes.
Telegraf is retrying to write failed measurements until these are written. There is no limit on number of retries. See the line 126 below:
Telegraf is checking specific errors from influx db so that measurement writes that can't be fixed by retrying aren't written over and over.
There is a limit on the buffer of failed writes (10000 entries by default). Additional failed writes replace the oldest entries in it when it gets filled.
As shown here default client http timeout for Telegraf is 5 seconds, see below:
and
https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/#clienttimeouts
Telegraf implements variable flush interval on internal buffers so that the load of multiple nodes gets spreaded over time and there are no repeated spikes when multiple clients get synchronized writing batches to InfluxDB.
InfluxDB maintains a cache on input (configured by cache-max-memory-size CMD option). When the limit is reached InfluxDB rejects further writes with message: ("cache-max-memory-size exceeded: (%d/%d)", n, limit)
Uses default retrofit timeouts:
- Connection timeout: ten seconds
- Read timeout: ten seconds
- Write timeout: ten seconds
https://futurestud.io/tutorials/retrofit-2-customize-network-timeouts These timeouts can't be configured right now.
There is benchmarking tool used for measure InfluxDB performace. It's behavior should be in sync how clients behave so that the results can be reproduce in real world. Benchmarking tool listens for error messages from the server and slows down when certain errors from the server are reported:
- engine: cache maximum memory size exceeded
- write failed: hinted handoff queue not empty
- write failed: read message type: read tcp
- i/o timeout
- write failed: engine: cache-max-memory-size exceeded
- timeout
- write failed: can not exceed max connections of 500