You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We are up against a really strange - frustrating problem. I do not have any experience with FluentD at all, so I will try to give a representation as complete as possible.
We have deployed FluentD as a DaemonSet in a Kubernetes cluster. FluentD is configured to gather logs from multiple sources (Docker daemon, network, etc...) and send them to a hosted AWS ElasticSearch.
Along with the mentioned logging, we have in-app mechanisms that log directly to FluentD through a separate @type forward source created only for these in-app logging mechanisms, which is then forwarded through a match @type elasticsearch.
The problem is that this in-app log-flow creates a steady-but-slow memory leak on the node which it runs on.... The even stranger thing is that this leak is not happening in userspace application memory. Both the apps' and Fluentd's app memory remain stable. What is constantly increasing is kernel memory resulting in a constantly decreasing available memory of the node, until memory starvation problems begin. Note that I am referring to non-cache kernel memory that is not freed when requested. The applications are not that logging heavy. Max throughput should be around 10 loglines/sec from all together.
This is not hapenning with any of the other log configuration in Fluent where docker, system, kubernetes logs are scraped etc... If I turn off this in-app mechanism then there is no memory leak!
I have installed different monitoring tools on the server trying to see if some other metric's trend is related to the memory decrease... The only thing that I found matching a lot is IPv4 TCP memory usage, which kinda makes sense since this is how the in-app logs are sent to FluentD and also kernel related. However although the trend is similar, the actual memory amount does not match. In the screenshots attached below for the same time period, you can see that the system memory is decreased around 700MB while TCP memory usage increases only 30MB. However the trend is a complete match!
Any help with this problem would be really appreciated! Feel free to ask any extra information that you might need.
Below are the details about my configuration and set up.
To Reproduce
A simple pod running a NodeJS app sending directly logs to FluentD using the fluent-logger npm package is enough to cause the memory problem.
Expected behavior
I expect the kernel memory to remain stable when usage is also stable, as is the case with the rest of the logging configuration.
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days
We didn't actually tamper with FluentD extra metrics, since the service operated normally without any problematic behaviours. The problems shown were observed in the NodeJS Client apps.
What seems to be related was this issue found in NodeJS and also fixed in latest versions: nodejs/node#36650
After that the client behaviors seem to be normal again.
Describe the bug
We are up against a really strange - frustrating problem. I do not have any experience with FluentD at all, so I will try to give a representation as complete as possible.
We have deployed FluentD as a DaemonSet in a Kubernetes cluster. FluentD is configured to gather logs from multiple sources (Docker daemon, network, etc...) and send them to a hosted AWS ElasticSearch.
Along with the mentioned logging, we have in-app mechanisms that log directly to FluentD through a separate
@type forward
source created only for these in-app logging mechanisms, which is then forwarded through amatch @type elasticsearch
.The problem is that this in-app log-flow creates a steady-but-slow memory leak on the node which it runs on.... The even stranger thing is that this leak is not happening in userspace application memory. Both the apps' and Fluentd's app memory remain stable. What is constantly increasing is kernel memory resulting in a constantly decreasing available memory of the node, until memory starvation problems begin. Note that I am referring to non-cache kernel memory that is not freed when requested. The applications are not that logging heavy. Max throughput should be around 10 loglines/sec from all together.
This is not hapenning with any of the other log configuration in Fluent where docker, system, kubernetes logs are scraped etc... If I turn off this in-app mechanism then there is no memory leak!
I have installed different monitoring tools on the server trying to see if some other metric's trend is related to the memory decrease... The only thing that I found matching a lot is IPv4 TCP memory usage, which kinda makes sense since this is how the in-app logs are sent to FluentD and also kernel related. However although the trend is similar, the actual memory amount does not match. In the screenshots attached below for the same time period, you can see that the system memory is decreased around 700MB while TCP memory usage increases only 30MB. However the trend is a complete match!
Any help with this problem would be really appreciated! Feel free to ask any extra information that you might need.
Below are the details about my configuration and set up.
To Reproduce
A simple pod running a NodeJS app sending directly logs to FluentD using the fluent-logger npm package is enough to cause the memory problem.
Expected behavior
I expect the kernel memory to remain stable when usage is also stable, as is the case with the rest of the logging configuration.
Your Environment
Your Configuration
FluentD DaemonSet is deployed using latest (v11.3.0) chart version found in https://github.com/kokuwaio/helm-charts/blob/main/charts/fluentd-elasticsearch/Chart.yaml
Since there is a lot of configuration, I will only put here the relevant config that creates the problem. If all is needed let me know to paste it in a pastebin or sth....
Additional context
The text was updated successfully, but these errors were encountered: