-
Notifications
You must be signed in to change notification settings - Fork 939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve context propagation for brave integration #6139
base: main
Are you sure you want to change the base?
Conversation
@codefromthecrypt Would you mind reviewing this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In brave somewhere we have a strict trace scope decorator which makes sure there is no leak of trace context which is a safeguard in case assumptions are incorrect.
So I would say if you want to be very safe? Do something like this as an after test hook and if all tests pass all good. If a scope issue happens later you will already have the infra to prove you can fix the bug.
My 2p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach will definitely work and I like it. Thanks!
@anuraaga Would you mind taking a look? 😆 |
IIUC this is the main change - I think initially the intent of this is that a user can use brave to create spans while using only one context propagation mechanism, armeria's. I think this basically means that users have to use two context propagation methods always, armeria and brave. It seems like quite a breaking change but if it's ok / intended, I don't know if we need this adapter at all since it means brave usage isn't safe without using brave propagation directly. I'm a bit unsure about the user report, but shouldn't it always be an issue of user missing armeria request context propagation? Is it because they use brave propagation themselves? Not sure if I fully understood so let me know if I missed something. |
Concretely, I think this is the case I am thinking. The context bridge was developed mainly so it could work. handler() {
ScopedSpan span = tracer.startScopedSpan("internal");
try {
// The span is in "scope" meaning downstream code such as loggers can see trace IDs
RequestContext.current().makeContextAware(threadPoolExecutor).submit(() -> {
ScopedSpan span = tracer.startScopedSpan("offthread");
// do stuff
span.finis();
});
} catch (RuntimeException | Error e) {
span.error(e); // Unless you handle exceptions, you might not know the operation failed!
throw e;
} finally {
span.finish(); // always finish the span
}
} If offthread is still a child of internal, then things look good but I have a feeling it will be a child of |
Well, I thought brave context propagation is only used when the armeria/brave/brave6/src/main/java/com/linecorp/armeria/client/brave/BraveClient.java Line 122 in 5eafb19
@jrhee17 Would you mind explaining your intention, please? |
I should've explained the user report in the issue, sorry about that. The report was from an internal user. A user using armeria server added a i.e. ServerBuilder sb = Server.builder();
sb.decorator(BraveService.newDecorator());
sb.decorator(MyAsyncDecorator);
sb.service(MyService) As a result poc ref: jrhee17@bf8a738 The motvation of this PR was to at least guarantee
Thanks for the example. I assume the I can see that users may lose a link between the // not created by `BraveService`, but created on `ctx.eventLoop()`
ScopedSpan span = tracer.startScopedSpan("internal");
// maybe won't work
WebClient.builder()
.decorator(BraveClient.newDecorator())
.decorator(MyAsyncDecorator.newDecorator())
.build()
.get("/");
// will work
WebClient.builder()
.decorator(BraveClient.newDecorator())
.decorator(MyAsyncDecorator.newDecorator())
.build()
.get("/");
span.finish(); Users will need to add a decorator which invokes the Having said this, if the above snippet was created within // ServiceRequestContext#attr#TRACE_CONTEXT_KEY is set due to `BraveService`
ScopedSpan span = tracer.startScopedSpan("internal");
// will always work
WebClient.builder()
.decorator(BraveClient.newDecorator())
.decorator(MyAsyncDecorator.newDecorator())
.build()
.get("/");
span.finish(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in my example the "internal" span will call newScope
from this PR which only sets to thread local. Doesn't it make it invisible to any other thread like the thread pool in my example? By the fallback to ServiceRequestContext you mention, indeed it will find handler, but it should find internal, would it be able to?
Right, assuming this PR is applied I'm not suggesting the fix in this PR will perfectly handle all cases. From the perspective of users', the biggest difference will probably be that
// should be run inside ctx.eventLoop
ScopedSpan span = tracer.startScopedSpan("internal");
try {
// The span is in "scope" meaning downstream code such as loggers can see trace IDs
RequestContext.current().makeContextAware(threadPoolExecutor).submit(() -> {
ScopedSpan span = tracer.startScopedSpan("offthread");
...
// thread doesn't matter
ScopedSpan span = Tracing.currentTracer().startScopedSpan("internal");
RequestContext ctx;
TraceContextUtil.setTraceContext(ctx, span.context());;
try {
// The span is in "scope" meaning downstream code such as loggers can see trace IDs
RequestContext.current().makeContextAware(threadPoolExecutor).submit(() -> {
ScopedSpan span = tracer.startScopedSpan("offthread");
...
I'm not sure what handler is, but as long as the
The below illustrates my previous comment: ServiceRequestContext sctx = ServiceRequestContext.builder(HttpRequest.of(HttpMethod.GET, "/")).build();
try (SafeCloseable ignored = sctx.push()) {
ScopedSpan span = tracer.startScopedSpan("internal");
TraceContextUtil.setTraceContext(sctx, span.context());
RequestContext.current().makeContextAware(eventLoop.get()).execute(() -> {
ClientRequestContext cctx = ClientRequestContext.builder(HttpRequest.of(HttpMethod.GET, "/")).build();
System.out.println(TraceContextUtil.traceContext(cctx));
... |
Organized my thoughts: The reported issue:
As anuraaga pointed out, the following case won't be supported with the changeset introduced in this PR:
Armeria users can be affected if:
This can be worked around by:
My thoughts on this PR: Overall, I don't think this is a very critical breaking change. I don't think users already expect that the I do think as more users use I think it's fine that Armeria doesn't handle every case related to tracing automatically as long as the behavior is easy to reason about. The current changeset is simple in that: If a request goes through I still think this PR has value, but let me know if anyone feels differently. |
The main point for me is that this is why we have Also note that it's not necessarily a user calling brave apis, they may just be using a brave instrumented client like redis client. It was nice that armeria context propagation was enough for these all to work in most circumstances. One other note is that due to the nature of tracing, there is almost always a "correct order" that shouldn't be changed, with brave etc almost always being the very first decorator no matter what (OTel Java agent does this). I don't know if it helps but perhaps there are less issues if there is an assumption that brave is always first? If the new threading models in Armeria make it too difficult to maintain this class, then it's fine though since things happen and this is a very old class. IMO it's better to deprecate the class then and indicate users should use brave propagation apis in their code then to change the existing behavior. |
That was my thought at first as well, but the user was using their own version of
I agree it's useful for users to only have to propagate
I see, I understood your concern.
I agree. Worst case scenario, trace links can be lost. Having said this:
The alternative is us maintaining/troubleshooting both versions of |
Ah I agree it's fine to expect using |
The proposed changeset calls ref: armeria/brave/brave6/src/main/java/com/linecorp/armeria/server/brave/BraveService.java Line 103 in f2f0d90
|
I think that sets the server span to the request context, but any other span such as an internal one, or maybe some instrumented library, is in |
Right, good point. I guess that's a design choice I made - when using I assume that other libraries aren't relying on As for internal traces, propagating the |
So to go back, this is the major change, and it seems like a big one. Note that if the goal is to have |
A little more from my point of view. // within the service span
ScopedSpan span1 = tracer.startScopedSpan("internal");
try {
RequestContext.current().makeContextAware(threadPoolExecutor).submit(() -> {
ScopedSpan span2 = tracer.startScopedSpan("offthread");
// do stuff offthread
span2.finish();
}
// do stuff internally
span1.finish();
} Since On one hand, since the On the other hand, if users consider So the user can decide which span is the parent. ScopedSpan span1 = tracer.startScopedSpan("internal");
try {
RequestContext.current().makeContextAware(threadPoolExecutor).submit(() -> {
ScopedSpan span2 = tracing.tracer().startScopedSpanWithParent("offthread", span1.context()); As you pointed out, this PR will change the default behavior though which I guess could be a bigger breaking change than I originally thought.
I agree. I would also not encourage users to use
I would just like to point out that even now, traces that aren't created from e.g. When not run from // not run from ctx.eventLoop
final ServiceRequestContext ctx =
ServiceRequestContext.builder(HttpRequest.of(HttpMethod.GET, "/")).build();
try (SafeCloseable closeable = ctx.push()) {
ScopedSpan span1 = tracing.tracer().startScopedSpan("span1");
ctx.eventLoop().execute(() -> {
ScopedSpan span2 = tracing.tracer().startScopedSpan("span2");
span2.finish(); I do understand that this may be a bigger breaking change than expected since the upgrade path may not be trivial if users wish to retain their span hierarchy. Perhaps I could add an option to selectively this enable this mode then if there are many users who will be affected by this change. Just curious, do you have any services serving |
I think in the tracing community it's consistent knowledge that
Yeah agree that even with the current implementation there are corner cases,
While I currently don't maintain any, ones I've created before I think still are maintained, though don't know if traces are viewed much. Note, I assume I was brought in to check versus the intent of the class as one of the original authors, and the new implementation doesn't preserve the intent which is for it to be possible to write Armeria servers with fully-correct (hierarchy-respecting) tracing without propagating of |
This was a lot of help (sorry for taking so much of your time by the way!). I think I was able to learn a lot from your input.
I see, I wasn't aware of that. Thanks for the explanation.
This makes me think adding a hook to AbstractContextAwareExecutor so that users can propagate additional contexts (e.g. The previous behavior of propagation for the same |
Thanks, this reminds me of some history now. Actually But we did find some threading corner cases and now it seems there are more - so I guess it might make sense to replace it back to just hooks that mount and unmount brave'd current trace context. Then there should be no change to existing users and other threads I guess will behave the same. We can chock up any added overhead to server CPUs being much faster than 5 years ago and reactor users are probably using micrometer anyways ;) |
Thanks for the input @anuraaga - this was a lot of help 👍 |
Motivation:
Recently we received a report where brave's
TraceContext
behavior is different depending on the thread which reaches aBraveService
. This is mainly due to howRequestContextCurrentTraceContext
is implemented; theTraceContext
is stored either in aThreadLocal
orRequestContext
depending on the invoking thread.e.g.
Previously, the
eventLoop
check was added to avoid concurrent modification of theTraceContext
stored inRequestContext#attrs
.armeria/brave/brave6/src/main/java/com/linecorp/armeria/common/brave/RequestContextCurrentTraceContext.java
Line 165 in 0960d09
However, given the introduction of
serviceWorker
s orblockingTaskExecutor
s I think this logic makes less sense. By design 1) the lifecycle ofRequestContext
andTraceContext
is different 2)RequestContext
does not have the facility to propagate multiple thread-local contexts together.Given the current limitations, I propose that we simplify the logic so that the reasoning/expectation is more straightforward.
RequestContext
goes through aBraveService
orBraveClient
, the correspondingTraceContext
is bound to theRequestContext
TraceContext
viabrave
APIs, the user-specifiedTraceContext
always has priority.TraceContext
, theTraceContext
bound to theRequestContext
will always have priorityI believe this covers most cases since:
TraceContext
assigned for a single request (viaBraveClient
orBraveService
). This will be the default behavior if users don't usebrave
APIs directly which I think is reasonable.brave
APIs directly, it is reasonable to assume that the lifecycle ofRequestContext
differs from the customTraceContext
.Modifications:
RequestContextCurrentTraceContext
so that:TraceContext
set byBrave[Client|Service]
is usedRequestContextCurrentTraceContext
is pushed/popped from the designated event loop. Added deprecated annotations forRequestContextCurrentTraceContextBuilder#nonRequestThread
Result:
Brave[Client|Service]
is more straightforward.