Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve context propagation for brave integration #6139

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jrhee17
Copy link
Contributor

@jrhee17 jrhee17 commented Mar 5, 2025

Motivation:

Recently we received a report where brave's TraceContext behavior is different depending on the thread which reaches a BraveService. This is mainly due to how RequestContextCurrentTraceContext is implemented; the TraceContext is stored either in a ThreadLocal or RequestContext depending on the invoking thread.

e.g.

// will use ctx.attrs to propagate the trace context
sb.decorator(BraveService.newDecorator(tracing));
sb.decorator(ReschedulingService.newDecorator());

// will use the thread local to propagate the trace context
sb.decorator(ReschedulingService.newDecorator());
sb.decorator(BraveService.newDecorator(tracing));

Previously, the eventLoop check was added to avoid concurrent modification of the TraceContext stored in RequestContext#attrs.

However, given the introduction of serviceWorkers or blockingTaskExecutors I think this logic makes less sense. By design 1) the lifecycle of RequestContext and TraceContext is different 2) RequestContext does not have the facility to propagate multiple thread-local contexts together.

Given the current limitations, I propose that we simplify the logic so that the reasoning/expectation is more straightforward.

  • Once a RequestContext goes through a BraveService or BraveClient, the corresponding TraceContext is bound to the RequestContext
  • If a user sets a separate TraceContext via brave APIs, the user-specified TraceContext always has priority.
  • If a user does not set a separate TraceContext, the TraceContext bound to the RequestContext will always have priority

I believe this covers most cases since:

  1. Logic triggered from a request will be bound to the TraceContext assigned for a single request (via BraveClient or BraveService). This will be the default behavior if users don't use brave APIs directly which I think is reasonable.
  2. If users choose to use brave APIs directly, it is reasonable to assume that the lifecycle of RequestContext differs from the custom TraceContext.

Modifications:

  • Modified RequestContextCurrentTraceContext so that:
    • Opening a new scope always sets the thread local
    • Getting a scope always checks the thread local. If the thread local doesn't exist, the TraceContext set by Brave[Client|Service] is used
  • It doesn't really matter now if RequestContextCurrentTraceContext is pushed/popped from the designated event loop. Added deprecated annotations for RequestContextCurrentTraceContextBuilder#nonRequestThread

Result:

  • The behavior of Brave[Client|Service] is more straightforward.

@jrhee17 jrhee17 added this to the 1.33.0 milestone Mar 5, 2025
@jrhee17 jrhee17 marked this pull request as ready for review March 5, 2025 02:44
@ikhoon
Copy link
Contributor

ikhoon commented Mar 6, 2025

@codefromthecrypt Would you mind reviewing this PR?

Copy link
Contributor

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In brave somewhere we have a strict trace scope decorator which makes sure there is no leak of trace context which is a safeguard in case assumptions are incorrect.

So I would say if you want to be very safe? Do something like this as an after test hook and if all tests pass all good. If a scope issue happens later you will already have the infra to prove you can fix the bug.

My 2p

Copy link
Contributor

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach will definitely work and I like it. Thanks!

@minwoox
Copy link
Contributor

minwoox commented Mar 25, 2025

@anuraaga Would you mind taking a look? 😆

@anuraaga
Copy link
Collaborator

If users choose to use brave APIs directly, it is reasonable to assume that the lifecycle of RequestContext differs from the custom TraceContext.

IIUC this is the main change - I think initially the intent of this is that a user can use brave to create spans while using only one context propagation mechanism, armeria's. I think this basically means that users have to use two context propagation methods always, armeria and brave. It seems like quite a breaking change but if it's ok / intended, I don't know if we need this adapter at all since it means brave usage isn't safe without using brave propagation directly.

I'm a bit unsure about the user report, but shouldn't it always be an issue of user missing armeria request context propagation? Is it because they use brave propagation themselves?

Not sure if I fully understood so let me know if I missed something.

@anuraaga
Copy link
Collaborator

Concretely, I think this is the case I am thinking. The context bridge was developed mainly so it could work.

handler() {
  ScopedSpan span = tracer.startScopedSpan("internal");
  try {
    // The span is in "scope" meaning downstream code such as loggers can see trace IDs
    RequestContext.current().makeContextAware(threadPoolExecutor).submit(() -> {
      ScopedSpan span = tracer.startScopedSpan("offthread");
      // do stuff
      span.finis();
    });
  } catch (RuntimeException | Error e) {
    span.error(e); // Unless you handle exceptions, you might not know the operation failed!
    throw e;
  } finally {
    span.finish(); // always finish the span
  }
}

If offthread is still a child of internal, then things look good but I have a feeling it will be a child of handler which would be bad.

@minwoox
Copy link
Contributor

minwoox commented Mar 25, 2025

I think initially the intent of this is that a user can use brave to create spans while using only one context propagation mechanism, armeria's. I think this basically means that users have to use two context propagation methods always, armeria and brave.

Well, I thought brave context propagation is only used when the BraveClient retrieves the parent TraceContext so I thought it shouldn't be a problem.

final Span span = handler.handleSend(braveReq);

@jrhee17 Would you mind explaining your intention, please?

@jrhee17
Copy link
Contributor Author

jrhee17 commented Mar 25, 2025

I'm a bit unsure about the user report, but shouldn't it always be an issue of user missing armeria request context propagation? Is it because they use brave propagation themselves?

I should've explained the user report in the issue, sorry about that. The report was from an internal user.

A user using armeria server added a BraveService to be executed after an async decorator.

i.e.

ServerBuilder sb = Server.builder();
sb.decorator(BraveService.newDecorator());
sb.decorator(MyAsyncDecorator);
sb.service(MyService)

As a result BraveService#serve was sometimes executed on the event loop and sometimes on a different thread. Due to this, propagating the RequestContext didn't guarantee the TraceContext was propagated as well.

poc ref: jrhee17@bf8a738

The motvation of this PR was to at least guarantee TraceContext is always propagated with a RequestContext if a request goes through a Brave[Client|Service] regardless of the executing thread.

Concretely, I think this is the case I am thinking. The context bridge was developed mainly so it could work.

Thanks for the example. I assume the handler() method is executed in the event loop with the ctx pushed.

I can see that users may lose a link between the TraceContext set outside of a WebClient if an async decorator is used.
i.e.

// not created by `BraveService`, but created on `ctx.eventLoop()`
ScopedSpan span = tracer.startScopedSpan("internal");
// maybe won't work
WebClient.builder()
         .decorator(BraveClient.newDecorator())
         .decorator(MyAsyncDecorator.newDecorator())
         .build()
         .get("/");

// will work
WebClient.builder()
         .decorator(BraveClient.newDecorator())
         .decorator(MyAsyncDecorator.newDecorator())
         .build()
         .get("/");
span.finish();

Users will need to add a decorator which invokes the TraceContextUtil#setTraceContext before invoking BraveClient to avoid losing the link.

Having said this, if the above snippet was created within BraveService then ClientRequestContext.attr(TRACE_CONTEXT_KEY) will invoke ServiceRequestContext.attr(TRACE_CONTEXT_KEY).
Hence, there is no breaking change for users who are using clients within a BraveService.

// ServiceRequestContext#attr#TRACE_CONTEXT_KEY is set due to `BraveService`
ScopedSpan span = tracer.startScopedSpan("internal");
// will always work
WebClient.builder()
         .decorator(BraveClient.newDecorator())
         .decorator(MyAsyncDecorator.newDecorator())
         .build()
         .get("/");
span.finish();

Copy link
Collaborator

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in my example the "internal" span will call newScope from this PR which only sets to thread local. Doesn't it make it invisible to any other thread like the thread pool in my example? By the fallback to ServiceRequestContext you mention, indeed it will find handler, but it should find internal, would it be able to?

@jrhee17
Copy link
Contributor Author

jrhee17 commented Mar 25, 2025

I think in my example the "internal" span will call newScope from this PR which only sets to thread local. Doesn't it make it invisible to any other thread like the thread pool in my example?

Right, assuming this PR is applied internal won't be visible from offthread.

I'm not suggesting the fix in this PR will perfectly handle all cases. From the perspective of users', the biggest difference will probably be that

  1. previously, just setting on the ctx.eventLoop guaranteed propagation
  // should be run inside ctx.eventLoop
  ScopedSpan span = tracer.startScopedSpan("internal");
  try {
    // The span is in "scope" meaning downstream code such as loggers can see trace IDs
    RequestContext.current().makeContextAware(threadPoolExecutor).submit(() -> {
      ScopedSpan span = tracer.startScopedSpan("offthread");
...
  1. now, setting TraceContextUtil#setTraceContext guarantees propagation
  // thread doesn't matter
  ScopedSpan span = Tracing.currentTracer().startScopedSpan("internal");
  RequestContext ctx;
  TraceContextUtil.setTraceContext(ctx, span.context());;
  try {
    // The span is in "scope" meaning downstream code such as loggers can see trace IDs
    RequestContext.current().makeContextAware(threadPoolExecutor).submit(() -> {
      ScopedSpan span = tracer.startScopedSpan("offthread");
...

By the fallback to ServiceRequestContext you mention, indeed it will find handler, but it should find internal, would it be able to?

I'm not sure what handler is, but as long as the ServiceRequestContext went through a BraveService (or the user set TraceContextUtil explicitly), the TraceContext will be propagated to the ClientRequestContext

Having said this, if the above snippet was created within BraveService then ClientRequestContext.attr(TRACE_CONTEXT_KEY) will invoke ServiceRequestContext.attr(TRACE_CONTEXT_KEY).
Hence, there is no breaking change for users who are using clients within a BraveService.

The below illustrates my previous comment:

        ServiceRequestContext sctx = ServiceRequestContext.builder(HttpRequest.of(HttpMethod.GET, "/")).build();

        try (SafeCloseable ignored = sctx.push()) {
            ScopedSpan span = tracer.startScopedSpan("internal");
            TraceContextUtil.setTraceContext(sctx, span.context());

            RequestContext.current().makeContextAware(eventLoop.get()).execute(() -> {
                ClientRequestContext cctx = ClientRequestContext.builder(HttpRequest.of(HttpMethod.GET, "/")).build();
                System.out.println(TraceContextUtil.traceContext(cctx));
...

@jrhee17
Copy link
Contributor Author

jrhee17 commented Mar 26, 2025

Organized my thoughts:

The reported issue:

  • TraceContext isn't propagated with RequestContext depending on whether ctx.eventLoop executes BraveClient or BraveService

As anuraaga pointed out, the following case won't be supported with the changeset introduced in this PR:

  • If a TraceContext was created in ctx.eventLoop with ctx pushed, the TraceContext will be available as long as ctx is pushed

Armeria users can be affected if:

  • The user isn't using BraveService
  • The user used Tracer APIs directly to create a TraceContext in ctx.eventLoop with ctx pushed
  • The BraveClient is executed after an async call; hence losing the thread-local TraceContext

This can be worked around by:

  • Users setting TraceContextUtil#setTraceContext directly
  • Using the TraceContextPropagation#inject API

My thoughts on this PR:

Overall, I don't think this is a very critical breaking change.

I don't think users already expect that the TraceContext is automatically propagated since it is dependent on the calling thread. (as reported by #4075)
Also, using BraveService will automatically propagate the TraceContext so that's a major use-case covered.

I do think as more users use blockingTaskExecutor/serviceWorkerGroup/virtual threads, I guess we will see more cases where Brave[Service|Client] isn't called by ctx.eventLoop.

I think it's fine that Armeria doesn't handle every case related to tracing automatically as long as the behavior is easy to reason about. The current changeset is simple in that: If a request goes through Brave[Service|Client], the TraceContext will be propagated as long as the ctx is pushed.

I still think this PR has value, but let me know if anyone feels differently.

@anuraaga
Copy link
Collaborator

I don't think users already expect that the TraceContext is automatically propagated

The main point for me is that this is why we have RequestContextTraceContext in the first place - to have trace context automatically propagated using only armeria propagation, not brave propagation. It's not needed to connect the BraveClient span to BraveService span but is needed to connect to other brave spans and what I think is a common use case as my example (I guess it's more realworld if the internal span is called batch and under it are multiple submissions to a thread pool) probably shouldn't break due to an armeria version upgrade.

Also note that it's not necessarily a user calling brave apis, they may just be using a brave instrumented client like redis client. It was nice that armeria context propagation was enough for these all to work in most circumstances.

One other note is that due to the nature of tracing, there is almost always a "correct order" that shouldn't be changed, with brave etc almost always being the very first decorator no matter what (OTel Java agent does this). I don't know if it helps but perhaps there are less issues if there is an assumption that brave is always first?

If the new threading models in Armeria make it too difficult to maintain this class, then it's fine though since things happen and this is a very old class. IMO it's better to deprecate the class then and indicate users should use brave propagation apis in their code then to change the existing behavior.

@jrhee17
Copy link
Contributor Author

jrhee17 commented Mar 26, 2025

One other note is that due to the nature of tracing, there is almost always a "correct order" that shouldn't be changed, with brave etc almost always being the very first decorator no matter what (OTel Java agent does this). I don't know if it helps but perhaps there are less issues if there is an assumption that brave is always first?

That was my thought at first as well, but the user was using their own version of BraveRpcService which we will also support in #6084. As RpcServices are by design executed after HttpServices, it's difficult to expect users to enforce this.

The main point for me is that this is why we have RequestContextTraceContext in the first place - to have trace context automatically propagated using only armeria propagation, not brave propagation.

I agree it's useful for users to only have to propagate RequestContext without also worrying about TraceContext. I think RequestContextTraceContext has a lot of value in this sense.

It's not needed to connect the BraveClient span to BraveService span but is needed to connect to other brave spans and what I think is a common use case as my example (I guess it's more realworld if the internal span is called batch and under it are multiple submissions to a thread pool)

I see, I understood your concern.
As long as the ctx in question went through a BraveService, I do think the behavior will be identical before/after this changeset. (in the sense that pushing the ctx maintains the TraceContext link)

probably shouldn't break due to an armeria version upgrade.

I agree. Worst case scenario, trace links can be lost.

Having said this:

  • I assume most users using the RequestContextTraceContext module are already using BraveService anyways.
  • I do think the fix (which is simply adding a BraveService) is simple enough

The alternative is us maintaining/troubleshooting both versions of RequestContextTraceContext, which I personally don't prefer since we're already understaffed at the moment. Maybe others have differing opinions though.

@anuraaga
Copy link
Collaborator

I assume most users using the RequestContextTraceContext module are already using BraveService anyways.

Ah I agree it's fine to expect using BraveService. But in your example you add TraceContextUtil.setTraceContext so it's not only adding BraveService right (because the internal span is in ThreadLocal, not request context)?

@jrhee17
Copy link
Contributor Author

jrhee17 commented Mar 26, 2025

But in your example you add TraceContextUtil.setTraceContext so it's not only adding BraveService right (because the internal span is in ThreadLocal, not request context)?

The proposed changeset calls TraceContextUtil.setTraceContext inside BraveService. If BraveService isn't used, users can still call TraceContextUtil.setTraceContext manually though.

ref:

TraceContextUtil.setTraceContext(ctx, span.context());

@anuraaga
Copy link
Collaborator

The proposed changeset calls TraceContextUtil.setTraceContext inside BraveService

I think that sets the server span to the request context, but any other span such as an internal one, or maybe some instrumented library, is in ThreadLocal and not automatically propagated when using makeContextAware right? It does allow all spans to still connect back to the server span somehow but I'm still missing what allows the new thread to have the current span context rather than the server span context, which may be an ancestor.

@jrhee17
Copy link
Contributor Author

jrhee17 commented Mar 26, 2025

but any other span such as an internal one, or maybe some instrumented library, is in ThreadLocal and not automatically propagated when using makeContextAware right?

Right, good point. I guess that's a design choice I made - when using RequestContextCurrentTraceContext, the TraceContext is bound to the RequestContext.

I assume that other libraries aren't relying on ctx.push since they should be implemented agnostic to CurrentTraceContext.

As for internal traces, propagating the ctx won't carry over internal traces created directly like you said.

@anuraaga
Copy link
Collaborator

As for internal traces, propagating the ctx won't carry over internal traces created directly like you said.

So to go back, this is the major change, and it seems like a big one. Note that if the goal is to have BraveClient to always link back to the BraveService, this doesn't require overriding brave's default CurrentTraceContext - it could just read brave's current span and use as parent if available or fallback to reading directly from RequestContext if not available. So I think RequestContextCurrentTraceContext becomes just a backup mechanism for non-armeria client spans - if brave propagation is forgotten, at least a i.e. redis client span can still link to the server span even if that's not the correct parent. It seems sort of reasonable to have that compared to nothing, but compared to the current implementation it feels like a big a change in behavior - I haven't used Armeria much lately but when I did I would create internal spans around operations on arbitrary context-aware threads similar to the example relatively frequently, and things worked quite nicely. If that broke on a version update it would be pretty painful to have to add TraceContextUtil and I think I would prefer at that point to not have CurrentTraceContext overridden at all and just use brave's APIs for such propagation.

@jrhee17
Copy link
Contributor Author

jrhee17 commented Mar 26, 2025

As for internal traces, propagating the ctx won't carry over internal traces created directly like you said.

A little more from my point of view.
Semantically, I do think there is an argument that binding to the service span is correct.

  // within the service span
  ScopedSpan span1 = tracer.startScopedSpan("internal");
  try {
    RequestContext.current().makeContextAware(threadPoolExecutor).submit(() -> {
      ScopedSpan span2 = tracer.startScopedSpan("offthread");
      // do stuff offthread
      span2.finish();
    }
    // do stuff internally
    span1.finish();
  }

Since offthread is executed asynchronously, I think there is an argument for both sides as to which span is considered the parent.

On one hand, since the RequestContext bound to the service is pushed on the thread, the parent could be considered the service span.

On the other hand, if users consider internal to be the parent of offthread, they could explicitly set it (like any other brave user would). Since the user is already using brave APIs, he/she is probably used to this type of pattern.

So the user can decide which span is the parent.

  ScopedSpan span1 = tracer.startScopedSpan("internal");
  try {
    RequestContext.current().makeContextAware(threadPoolExecutor).submit(() -> {
      ScopedSpan span2 = tracing.tracer().startScopedSpanWithParent("offthread", span1.context());

As you pointed out, this PR will change the default behavior though which I guess could be a bigger breaking change than I originally thought.

If that broke on a version update it would be pretty painful to have to add TraceContextUtil and I think I would prefer at that point to not have CurrentTraceContext overridden at all and just use brave's APIs for such propagation.

I agree. I would also not encourage users to use TraceContextUtil if possible.

So to go back, this is the major change, and it seems like a big one.

I would just like to point out that even now, traces that aren't created from ctx.eventLoop aren't linked correctly.

e.g. When not run from ctx.eventLoop, span2.parent != span1

        // not run from ctx.eventLoop
        final ServiceRequestContext ctx =
                ServiceRequestContext.builder(HttpRequest.of(HttpMethod.GET, "/")).build();
        try (SafeCloseable closeable = ctx.push()) {
            ScopedSpan span1 = tracing.tracer().startScopedSpan("span1");
            ctx.eventLoop().execute(() -> {
                ScopedSpan span2 = tracing.tracer().startScopedSpan("span2");
                span2.finish();

I do understand that this may be a bigger breaking change than expected since the upgrade path may not be trivial if users wish to retain their span hierarchy.

Perhaps I could add an option to selectively this enable this mode then if there are many users who will be affected by this change. Just curious, do you have any services serving armeria at the moment?

@anuraaga
Copy link
Collaborator

Since offthread is executed asynchronously, I think there is an argument for both sides as to which span is considered the parent.

I think in the tracing community it's consistent knowledge that internal would be the parent in that situation, i.e., if using the OTel javaagent, it'd make sure that happens even if not making the thread context aware explicitly. Context isn't about what thread is executing something but about the flow of execution within a call graph, the caller of some code is its parent.

I would just like to point out that even now, traces that aren't created from ctx.eventLoop aren't linked correctly.

Yeah agree that even with the current implementation there are corner cases, eventLoop is being used as an approximation for onRequestThread and the approximation may have gotten less precise over time (maybe it could be made more precise to support async decorators? dunno). So understand the motivation to try to clean it up.

Perhaps I could add an option to selectively this enable this mode then if there are many users who will be affected by this change. Just curious, do you have any services serving armeria at the moment?

While I currently don't maintain any, ones I've created before I think still are maintained, though don't know if traces are viewed much. Note, I assume I was brought in to check versus the intent of the class as one of the original authors, and the new implementation doesn't preserve the intent which is for it to be possible to write Armeria servers with fully-correct (hierarchy-respecting) tracing without propagating of TraceContext, only RequestContext. That's probably enough said on the topic, of course if you still want to make the change you can ;)

@jrhee17
Copy link
Contributor Author

jrhee17 commented Mar 26, 2025

. Note, I assume I was brought in to check versus the intent of the class as one of the original authors, and the new implementation doesn't preserve the intent which is for it to be possible to write Armeria servers with fully-correct (hierarchy-respecting) tracing without propagating of TraceContext, only RequestContext.

This was a lot of help (sorry for taking so much of your time by the way!). I think I was able to learn a lot from your input.

I think in the tracing community it's consistent knowledge that internal would be the parent in that situation, i.e., if using the OTel javaagent, it'd make sure that happens even if not making the thread context aware explicitly.

I see, I wasn't aware of that. Thanks for the explanation.

Context isn't about what thread is executing something but about the flow of execution within a call graph, the caller of some code is its parent.

This makes me think adding a hook to AbstractContextAwareExecutor so that users can propagate additional contexts (e.g. TraceContext) may make more sense.

The previous behavior of propagation for the same ctx.eventLoop can be handled as well with this approach. (so hopefully no breaking change)

@anuraaga
Copy link
Collaborator

This makes me think adding a hook to AbstractContextAwareExecutor so that users can propagate additional contexts (e.g. TraceContext) may make more sense.

Thanks, this reminds me of some history now. Actually RequestContextCurrentTraceContext is mostly an optimization - before it we propagated brave context with enter/exit hooks on request context. We found it cool to be able to reduce overhead on context switches by instead overriding CurrentTraceContext, only incurring when accessing trace context - for something like reactor, context switches can be so much more common than span creation that it does add up.

But we did find some threading corner cases and now it seems there are more - so I guess it might make sense to replace it back to just hooks that mount and unmount brave'd current trace context. Then there should be no change to existing users and other threads I guess will behave the same. We can chock up any added overhead to server CPUs being much faster than 5 years ago and reactor users are probably using micrometer anyways ;)

@jrhee17
Copy link
Contributor Author

jrhee17 commented Mar 27, 2025

Thanks for the input @anuraaga - this was a lot of help 👍
Let me think of a way to preserve the span hierarchy you mentioned and update this PR 🙇

@jrhee17 jrhee17 marked this pull request as draft March 27, 2025 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants