Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP: Error when outputting OTLP in 3.2.7 (and 3.2.8) #10071

Open
nortenorte opened this issue Mar 12, 2025 · 6 comments
Open

OTLP: Error when outputting OTLP in 3.2.7 (and 3.2.8) #10071

nortenorte opened this issue Mar 12, 2025 · 6 comments

Comments

@nortenorte
Copy link

nortenorte commented Mar 12, 2025

Bug Report

Describe the bug
Error when outputing OTLP version 3.2.7 (and 3.2.8). Error happen on metrics, logs and traces.

Error have never happen on older versions, for example 3.2.6 works with exactly same test setup.

[error] [engine] chunk '422061-1741761815.979790809.flb' cannot be retried: task_id=0, input=opentelemetry.0 > output=opentelemetry.0

Se more details below.

To Reproduce

Build fluent-bit with the following defines:
cmake .. -DFLB_SIGNV4=Off -DFLB_AWS=Off -DFLB_FILTER_AWS=Off -DFLB_OUT_S3=Off -DFLB_OUT_KINESIS_FIREHOSE=Off -DFLB_OUT_KINESIS_STREAMS=Off -DFLB_OUT_CLOUDWATCH_LOGS=Off -DFLB_OUT_BIGQUERY=Off

  1. Start opentelemetry-collector. Official https://github.com/open-telemetry/opentelemetry-collector
    Installation according to: https://opentelemetry.io/docs/collector/installation/

Start:
bin> ./otelcorecol_linux_amd64 --config otelconfig.yaml

otelconfig.yaml

receivers:
  otlp:
    protocols:
      http:
        endpoint: 127.0.0.1:4319
 
processors:
  batch:
  memory_limiter:
    # 75% of maximum memory up to 2G
    limit_mib: 1536
    # 25% of limit up to 2G
    spike_limit_mib: 512
    check_interval: 5s
 
exporters:
  debug:
    verbosity: detailed
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug]
  1. Start Fluent-bit.
    ./fluent-bit -c ~/fluent-bit.yaml

fluent-bit.yaml

service:
  log_level: trace 

pipeline:
  inputs:
    - name: opentelemetry
      tag: input_open
      listen: 127.0.0.1
      port: 4318
      successful_response_code: 200
  outputs:
    - name: opentelemetry
      host: 127.0.0.1
      port: 4319
      match: '*' 
  1. Send a otlp message.

Install protocurl: https://github.com/qaware/protocurl

Use protocurl to send a standard otlp metric message. (https://github.com/open-telemetry/opentelemetry-proto)

opentelemetry-proto> protocurl \
-i opentelemetry.proto.collector.logs.v1.ExportLogsServiceRequest \
-o opentelemetry.proto.collector.logs.v1.ExportLogsServiceResponse \
-u http://localhost:4318/v1/logs \
-d @examples/logs.json \
-I .
  1. Check result from otel-collector:
[2025/03/12 07:43:36] [debug] [task] created task=0x7e5244033dd0 id=0 OK
[2025/03/12 07:43:36] [debug] [upstream] KA connection #36 to 127.0.0.1:4319 is connected
[2025/03/12 07:43:36] [ warn] [output:opentelemetry:opentelemetry.0] error performing HTTP request, remote host=127.0.0.1:4319 connection error
[2025/03/12 07:43:36] [debug] [upstream] KA connection #36 to 127.0.0.1:4319 is now available
[2025/03/12 07:43:36] [debug] [upstream] KA connection #36 to 127.0.0.1:4319 has been disconnected by the remote service
[2025/03/12 07:43:36] [debug] [out flush] cb_destroy coro_id=0
[2025/03/12 07:43:36] [debug] [retry] new retry created for task_id=0 attempts=1
[2025/03/12 07:43:36] [ warn] [engine] failed to flush chunk '422061-1741761815.979790809.flb', retry in 9 seconds: task_id=0, input=opentelemetry.0 > output=opentelemetry.0 (out_id=0)
[2025/03/12 07:43:45] [ warn] [output:opentelemetry:opentelemetry.0] error performing HTTP request, remote host=127.0.0.1:4319 connection error
[2025/03/12 07:43:45] [debug] [out flush] cb_destroy coro_id=1
[2025/03/12 07:43:45] [debug] [task] task_id=0 reached retry-attempts limit 1/1
[2025/03/12 07:43:45] [error] [engine] chunk '422061-1741761815.979790809.flb' cannot be retried: task_id=0, input=opentelemetry.0 > output=opentelemetry.0
[2025/03/12 07:43:45] [debug] [task] destroy task=0x7e5244033dd0 (task_id=0)

Expected behavior

Result from v3.2.6:

[2025/03/12 07:47:58] [debug] [task] created task=0x756318033e90 id=0 OK
[2025/03/12 07:47:58] [debug] [upstream] KA connection #36 to 127.0.0.1:4319 is connected
[2025/03/12 07:47:58] [ info] [output:opentelemetry:opentelemetry.0] 127.0.0.1:4319, HTTP status=200
[2025/03/12 07:47:58] [debug] [upstream] KA connection #36 to 127.0.0.1:4319 is now available
[2025/03/12 07:47:58] [debug] [out flush] cb_destroy coro_id=0
[2025/03/12 07:47:58] [debug] [task] destroy task=0x756318033e90 (task_id=0)

Screenshots

Your Environment

  • Version used: 3.2.7 (and 3.2.8)
  • Configuration: See above configuration
  • Environment name and version (e.g. Kubernetes? What version?):
  • Server type and version: N/A
  • Operating System and version: Ubuntu 24.04 (x86), target system is embedded Linux on arm64. Same problem on both OS and hardware.
  • Filters and plugins: See above configuration

Additional context
Can not use v3.2.7 and v3.2.8 as we can not get out any logs.

@patrick-stephens
Copy link
Contributor

Does it work with the official builds/containers?

@patrick-stephens patrick-stephens added the waiting-for-user Waiting for more information, tests or requested changes label Mar 12, 2025
@nortenorte
Copy link
Author

nortenorte commented Mar 12, 2025

@patrick-stephens Does not work using the containers (works on 3.2.6).

Recreate using container:

fluent-bit_stdout_otlp.yaml

service:
    flush: 1
    log_level: trace
 
pipeline:
    inputs:
        - name: opentelemetry
          listen: 0.0.0.0
          port: 4318
          successful_response_code: 200
    outputs:
        - name: stdout
          match: '*'
        - name: opentelemetry
          match: '*'
          host: 127.0.0.1
          port: 4319

Create a directory and put fluent-bit_stdout_otlp.yaml in that directory then run the following command.
docker run -v $(pwd):$(pwd) -w $(pwd) --rm -it -p 4318:4318 --network host fluent/fluent-bit:3.2.8 /fluent-bit/bin/fluent-bit -c fluent-bit_stdout_otlp.yaml

Start opentelemetry-collector. Official https://github.com/open-telemetry/opentelemetry-collector
Installation according to: https://opentelemetry.io/docs/collector/installation/

Start:
bin> ./otelcorecol_linux_amd64 --config otelconfig.yaml

otelconfig.yaml

receivers:
  otlp:
    protocols:
      http:
        endpoint: 127.0.0.1:4319
  
processors:
  batch:
  memory_limiter:
    # 75% of maximum memory up to 2G
    limit_mib: 1536
    # 25% of limit up to 2G
    spike_limit_mib: 512
    check_interval: 5s
  
exporters:
  debug:
    verbosity: detailed
  
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug] 

Send a simple curl command:
curl --header "Content-Type: application/json" --request POST --data '{"resourceLogs":[{"resource":{},"scopeLogs":[{"scope":{},"logRecords":[{"timeUnixNano":"1660296023390371588","body":{"stringValue":"{\"message\":\"dummy\"}"},"traceId":"","spanId":""}]}]}]}' http://0.0.0.0:4318/v1/logs

Output:

[2025/03/12 15:01:05] [debug] [input:opentelemetry:opentelemetry.0] attributes missing
[2025/03/12 15:01:06] [debug] [task] created task=0x7bde54a36640 id=0 OK
[2025/03/12 15:01:06] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] v1_logs: [[-1.000000000, {"schema"=>"otlp", "resource_id"=>0, "scope_id"=>0}], {"resource"=>{}, "scope"=>{}}]
[1] v1_logs: [[1660296023.1698112429, {"otlp"=>{"trace_id"=>"", "span_id"=>""}}], {"log"=>"{"message":"dummy"}"}]
[2] v1_logs: [[-2.000000000, {}], {}]
[2025/03/12 15:01:06] [debug] [out flush] cb_destroy coro_id=0
[2025/03/12 15:01:06] [debug] [upstream] KA connection #50 to 127.0.0.1:4319 is connected
[2025/03/12 15:01:06] [ warn] [output:opentelemetry:opentelemetry.1] error performing HTTP request, remote host=127.0.0.1:4319 connection error
[2025/03/12 15:01:06] [debug] [out flush] cb_destroy coro_id=0
[2025/03/12 15:01:06] [debug] [retry] new retry created for task_id=0 attempts=1
[2025/03/12 15:01:06] [ warn] [engine] failed to flush chunk '1-1741791665.723364920.flb', retry in 6 seconds: task_id=0, input=opentelemetry.0 > output=opentelemetry.1 (out_id=1)
[2025/03/12 15:01:12] [ warn] [output:opentelemetry:opentelemetry.1] error performing HTTP request, remote host=127.0.0.1:4319 connection error
[2025/03/12 15:01:12] [debug] [out flush] cb_destroy coro_id=1
[2025/03/12 15:01:12] [debug] [task] task_id=0 reached retry-attempts limit 1/1
[2025/03/12 15:01:12] [error] [engine] chunk '1-1741791665.723364920.flb' cannot be retried: task_id=0, input=opentelemetry.0 > output=opentelemetry.1
[2025/03/12 15:01:12] [debug] [task] destroy task=0x7bde54a36640 (task_id=0)

It's easy to switch between 3.2.6 and 3.2.8 and see the different result (success vs failure).

@patrick-stephens patrick-stephens removed the waiting-for-user Waiting for more information, tests or requested changes label Mar 13, 2025
@edsiper
Copy link
Member

edsiper commented Mar 17, 2025

wondering if this PR is the fix: #10083

@nortenorte
Copy link
Author

nortenorte commented Mar 17, 2025

@edsiper Unfortunately no.

I checked out the commit below on branch 3.2.

commit 554c0cd (HEAD -> 3.2, origin/3.2)
Author: Leonardo Alminana [email protected]
Date: Fri Mar 14 11:42:24 2025 +0100

out_opentelemetry: restored old group meta record processing mechanism

Signed-off-by: Leonardo Alminana <[email protected]>

Still same issue:

[2025/03/17 08:38:27] [debug] [task] created task=0x700340035a70 id=0 OK
[2025/03/17 08:38:27] [debug] [upstream] KA connection #36 to 127.0.0.1:4319 is connected
[2025/03/17 08:38:27] [ warn] [output:opentelemetry:opentelemetry.0] error performing HTTP request, remote host=127.0.0.1:4319 connection error
[2025/03/17 08:38:27] [debug] [out flush] cb_destroy coro_id=0
[2025/03/17 08:38:27] [debug] [retry] new retry created for task_id=0 attempts=1
[2025/03/17 08:38:27] [ warn] [engine] failed to flush chunk '121723-1742197106.492297243.flb', retry in 7 seconds: task_id=0, input=opentelemetry.0 > output=opentelemetry.0 (out_id=0)
[2025/03/17 08:38:34] [ warn] [output:opentelemetry:opentelemetry.0] error performing HTTP request, remote host=127.0.0.1:4319 connection error
[2025/03/17 08:38:34] [debug] [out flush] cb_destroy coro_id=1
[2025/03/17 08:38:34] [debug] [task] task_id=0 reached retry-attempts limit 1/1
[2025/03/17 08:38:34] [error] [engine] chunk '121723-1742197106.492297243.flb' cannot be retried: task_id=0, input=opentelemetry.0 > output=opentelemetry.0
[2025/03/17 08:38:34] [debug] [task] destroy task=0x700340035a70 (task_id=0)

@edsiper
Copy link
Member

edsiper commented Mar 17, 2025

it's not a bug, Fluent Bit OpenTelemetry output enables http2 by default, and it seems the collector drops the connection.

Setting http2: off will make it work.

Considering turning that as a new default

@edsiper
Copy link
Member

edsiper commented Mar 17, 2025

#10089

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants