Strange FluentD memory leak in Kubernetes Cluster #3202

mitsos1os · 2020-12-12T18:04:40Z

Describe the bug
We are up against a really strange - frustrating problem. I do not have any experience with FluentD at all, so I will try to give a representation as complete as possible.

We have deployed FluentD as a DaemonSet in a Kubernetes cluster. FluentD is configured to gather logs from multiple sources (Docker daemon, network, etc...) and send them to a hosted AWS ElasticSearch.

Along with the mentioned logging, we have in-app mechanisms that log directly to FluentD through a separate @type forward source created only for these in-app logging mechanisms, which is then forwarded through a match @type elasticsearch.

The problem is that this in-app log-flow creates a steady-but-slow memory leak on the node which it runs on.... The even stranger thing is that this leak is not happening in userspace application memory. Both the apps' and Fluentd's app memory remain stable. What is constantly increasing is kernel memory resulting in a constantly decreasing available memory of the node, until memory starvation problems begin. Note that I am referring to non-cache kernel memory that is not freed when requested. The applications are not that logging heavy. Max throughput should be around 10 loglines/sec from all together.

This is not hapenning with any of the other log configuration in Fluent where docker, system, kubernetes logs are scraped etc... If I turn off this in-app mechanism then there is no memory leak!

I have installed different monitoring tools on the server trying to see if some other metric's trend is related to the memory decrease... The only thing that I found matching a lot is IPv4 TCP memory usage, which kinda makes sense since this is how the in-app logs are sent to FluentD and also kernel related. However although the trend is similar, the actual memory amount does not match. In the screenshots attached below for the same time period, you can see that the system memory is decreased around 700MB while TCP memory usage increases only 30MB. However the trend is a complete match!

Any help with this problem would be really appreciated! Feel free to ask any extra information that you might need.

Below are the details about my configuration and set up.

To Reproduce
A simple pod running a NodeJS app sending directly logs to FluentD using the fluent-logger npm package is enough to cause the memory problem.

Expected behavior
I expect the kernel memory to remain stable when usage is also stable, as is the case with the rest of the logging configuration.

Your Environment

Fluentd or td-agent version: 1.11.4
Operating system: Debian GNU/Linux 9 (stretch)
Kernel version: 4.9.0-14-amd64
Kubernetes Version: v.1.16.15

Your Configuration
FluentD DaemonSet is deployed using latest (v11.3.0) chart version found in https://github.com/kokuwaio/helm-charts/blob/main/charts/fluentd-elasticsearch/Chart.yaml
Since there is a lot of configuration, I will only put here the relevant config that creates the problem. If all is needed let me know to paste it in a pastebin or sth....

<source>
  @type forward
  port 24226
  bind 0.0.0.0
  @label @CENTAUR
</source>
<filter **>
  @type record_transformer
  <record>
    env staging
  </record>
</filter>
<filter **>
  @type record_transformer
  <record>
    fl.host "#{Socket.gethostname}"
  </record>
</filter>
<filter **>
  @type record_transformer
  <record>
    fl.cfgVer "#{ENV['CONFIG_VERSION']}"
  </record>
</filter>
# Exclude own namespace logs

# Exclude centaur related logs since they are handled through different flow
<filter kubernetes.**>
  @type grep
  <exclude>
    key tag
    pattern /^centaur.*/
  </exclude>
</filter>

<filter kubernetes.var.log.containers.fluentd-elasticsearch**>
  @type grep
  <exclude>
    key tag
    pattern /.*/
  </exclude>
</filter>

<label @CENTAUR>
  @include ../2-filter/fl-host.conf
  @include ../2-filter/fl-version.conf


  <match centaur.metrics.measurement>
    @id centaur_metrics_measurement
    @type elasticsearch

    logstash_prefix "centaur_metrics_measurement"
    logstash_dateformat "%Y.%m"

    time_key_format "%Y-%m-%dT%H:%M:%S.%N%z"
    time_key "centaur_timestamp"
  @log_level info
  
  host "<HIDDEN_HOST>"
  port "80"
  scheme "http"
  
  include_tag_key true
  
  logstash_format true
  
  reload_on_failure false
  reload_connections false
  reconnect_on_error true
  log_es_400_reason true
  
  default_elasticsearch_version 7
  

  validate_client_version true
  
  # See detailed transporter log
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-see-detailed-failure-log
  with_transporter_log true
  
  # Prevent Request size exceeded error during fluent -> ES data flow
  # read more: https://github.com/uken/fluent-plugin-elasticsearch/issues/588
  bulk_message_request_threshold 8M
  <buffer>
    # read more about buffering parameters: https://docs.fluentd.org/configuration/buffer-section#buffering-parameters
    @type file
    flush_mode interval
    retry_type exponential_backoff
    flush_thread_count 2
    flush_interval 10s
    retry_timeout 30m
    retry_max_interval 30
    chunk_full_threshold 0.8
    chunk_limit_size 15M
    total_limit_size 96M
    overflow_action block
    compress gzip
  </buffer>

    # @log_level debug
  </match>

  <match centaur.metrics.performance>
    @id centaur_metrics_performance
    @type elasticsearch

    logstash_prefix "centaur_metrics_performance"
    logstash_dateformat "%Y.%m"

    time_key_format "%Y-%m-%dT%H:%M:%S.%N%z"
    time_key "centaur_timestamp"
  @log_level info
  
  host "<HIDDEN_HOST>"
  port "80"
  scheme "http"
  
  include_tag_key true
  
  logstash_format true
  
  # Prevent reloading connections to AWS ES
  # read more: https://github.com/atomita/fluent-plugin-aws-elasticsearch-service/issues/15#issuecomment-254793259
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
  reload_on_failure false
  reload_connections false
  reconnect_on_error true
  log_es_400_reason true
  
  # If you know that your using ES major version is 7, you can set as 7 here.
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#fluentd-seems-to-hang-if-it-unable-to-connect-elasticsearch-why
  default_elasticsearch_version 7
  
  # Check Elasticsearch instance for an incompatible version
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
  validate_client_version true
  
  # See detailed transporter log
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-see-detailed-failure-log
  with_transporter_log true
  
  # Prevent Request size exceeded error during fluent -> ES data flow
  # read more: https://github.com/uken/fluent-plugin-elasticsearch/issues/588
  bulk_message_request_threshold 8M
  <buffer>
    # read more about buffering parameters: https://docs.fluentd.org/configuration/buffer-section#buffering-parameters
    @type file
    flush_mode interval
    retry_type exponential_backoff
    flush_thread_count 2
    flush_interval 10s
    retry_timeout 30m
    retry_max_interval 30
    chunk_full_threshold 0.8
    chunk_limit_size 15M
    total_limit_size 96M
    overflow_action block
    compress gzip
  </buffer>
  </match>

  <match centaur.metrics.**>
    @id centaur_metrics
    @type elasticsearch

    logstash_prefix "centaur_metrics_generic"
    logstash_dateformat "%Y.%m"

    time_key_format "%Y-%m-%dT%H:%M:%S.%N%z"
    time_key "centaur_timestamp"
  @log_level info
  
  host "<HIDDEN_HOST>"
  port "80"
  scheme "http"
  
  include_tag_key true
  
  logstash_format true
  
  # Prevent reloading connections to AWS ES
  # read more: https://github.com/atomita/fluent-plugin-aws-elasticsearch-service/issues/15#issuecomment-254793259
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
  reload_on_failure false
  reload_connections false
  reconnect_on_error true
  log_es_400_reason true
  
  # If you know that your using ES major version is 7, you can set as 7 here.
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#fluentd-seems-to-hang-if-it-unable-to-connect-elasticsearch-why
  default_elasticsearch_version 7
  
  # Check Elasticsearch instance for an incompatible version
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
  validate_client_version true
  
  # See detailed transporter log
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-see-detailed-failure-log
  with_transporter_log true
  
  # Prevent Request size exceeded error during fluent -> ES data flow
  # read more: https://github.com/uken/fluent-plugin-elasticsearch/issues/588
  bulk_message_request_threshold 8M
  <buffer>
    # read more about buffering parameters: https://docs.fluentd.org/configuration/buffer-section#buffering-parameters
    @type file
    flush_mode interval
    retry_type exponential_backoff
    flush_thread_count 2
    flush_interval 10s
    retry_timeout 30m
    retry_max_interval 30
    chunk_full_threshold 0.8
    chunk_limit_size 15M
    total_limit_size 96M
    overflow_action block
    compress gzip
  </buffer>
  </match>

  <match centaur.logs>
    @id centaur_logs
    @type elasticsearch

    logstash_prefix "centaur_logs"
    logstash_dateformat "%Y.%m.%d"

    time_key_format "%Y-%m-%dT%H:%M:%S.%N%z"
    time_key "centaur_timestamp"

    pipeline centaur-pipeline
  @log_level info
  
  host "<HIDDEN_HOST>"
  port "80"
  scheme "http"
  
  include_tag_key true
  
  logstash_format true
  
  # Prevent reloading connections to AWS ES
  # read more: https://github.com/atomita/fluent-plugin-aws-elasticsearch-service/issues/15#issuecomment-254793259
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
  reload_on_failure false
  reload_connections false
  reconnect_on_error true
  log_es_400_reason true
  
  # If you know that your using ES major version is 7, you can set as 7 here.
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#fluentd-seems-to-hang-if-it-unable-to-connect-elasticsearch-why
  default_elasticsearch_version 7
  
  # Check Elasticsearch instance for an incompatible version
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
  validate_client_version true
  
  # See detailed transporter log
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-see-detailed-failure-log
  with_transporter_log true
  
  # Prevent Request size exceeded error during fluent -> ES data flow
  # read more: https://github.com/uken/fluent-plugin-elasticsearch/issues/588
  bulk_message_request_threshold 8M
  <buffer>
    # read more about buffering parameters: https://docs.fluentd.org/configuration/buffer-section#buffering-parameters
    @type file
    flush_mode interval
    retry_type exponential_backoff
    flush_thread_count 2
    flush_interval 10s
    retry_timeout 30m
    retry_max_interval 30
    chunk_full_threshold 0.8
    chunk_limit_size 15M
    total_limit_size 96M
    overflow_action block
    compress gzip
  </buffer>
  </match>

  <match centaur.**>
    @id centaur_catch_all
    @type elasticsearch

    logstash_prefix "centaur_logs"
    logstash_dateformat "%Y.%m.%d"

    pipeline centaur-pipeline
  @log_level info
  
  host "<HIDDEN_HOST>"
  port "80"
  scheme "http"
  
  include_tag_key true
  
  logstash_format true
  
  # Prevent reloading connections to AWS ES
  # read more: https://github.com/atomita/fluent-plugin-aws-elasticsearch-service/issues/15#issuecomment-254793259
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
  reload_on_failure false
  reload_connections false
  reconnect_on_error true
  log_es_400_reason true
  
  # If you know that your using ES major version is 7, you can set as 7 here.
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#fluentd-seems-to-hang-if-it-unable-to-connect-elasticsearch-why
  default_elasticsearch_version 7
  
  # Check Elasticsearch instance for an incompatible version
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
  validate_client_version true
  
  # See detailed transporter log
  # read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-see-detailed-failure-log
  with_transporter_log true
  
  # Prevent Request size exceeded error during fluent -> ES data flow
  # read more: https://github.com/uken/fluent-plugin-elasticsearch/issues/588
  bulk_message_request_threshold 8M
  <buffer>
    # read more about buffering parameters: https://docs.fluentd.org/configuration/buffer-section#buffering-parameters
    @type file
    flush_mode interval
    retry_type exponential_backoff
    flush_thread_count 2
    flush_interval 10s
    retry_timeout 30m
    retry_max_interval 30
    chunk_full_threshold 0.8
    chunk_limit_size 15M
    total_limit_size 96M
    overflow_action block
    compress gzip
  </buffer>
  </match>
</label>

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2021-03-13T10:08:10Z

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

nvtkaszpir · 2021-03-23T18:21:22Z

I suggest to enable metrics in fluentd and scrape them with prometehus to see more details from fluentd side.
Also look at stats from node exporters.

It's really hard to guess anything from just those two graphs.
Wild guess, hanging tcp connections not closed by nodejs app?

mitsos1os · 2021-03-23T18:42:33Z

I suggest to enable metrics in fluentd and scrape them with prometehus to see more details from fluentd side.
Also look at stats from node exporters.

It's really hard to guess anything from just those two graphs.
Wild guess, hanging tcp connections not closed by nodejs app?

Hi @nvtkaszpir Thanks for the response!

We didn't actually tamper with FluentD extra metrics, since the service operated normally without any problematic behaviours. The problems shown were observed in the NodeJS Client apps.

What seems to be related was this issue found in NodeJS and also fixed in latest versions: nodejs/node#36650

After that the client behaviors seem to be normal again.

So re-closing the issue now...

mitsos1os mentioned this issue Jan 4, 2021

High Kernel memory usage kubernetes/kubernetes#96892

Closed

github-actions bot added the stale label Mar 13, 2021

mitsos1os closed this as completed Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange FluentD memory leak in Kubernetes Cluster #3202

Strange FluentD memory leak in Kubernetes Cluster #3202

mitsos1os commented Dec 12, 2020

github-actions bot commented Mar 13, 2021

nvtkaszpir commented Mar 23, 2021

mitsos1os commented Mar 23, 2021

Strange FluentD memory leak in Kubernetes Cluster #3202

Strange FluentD memory leak in Kubernetes Cluster #3202

Comments

mitsos1os commented Dec 12, 2020

github-actions bot commented Mar 13, 2021

nvtkaszpir commented Mar 23, 2021

mitsos1os commented Mar 23, 2021