Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(spans): Add spans buffer v2 #85856

Merged
merged 65 commits into from
Mar 14, 2025
Merged

ref(spans): Add spans buffer v2 #85856

merged 65 commits into from
Mar 14, 2025

Conversation

untitaker
Copy link
Member

@untitaker untitaker commented Feb 25, 2025

The current process-spans consumer assumes that each span has a segment
ID. In the new world we need to construct segments and correlate spans
purely based on their parent-child relationship + timeouts.

Build a new redis-based spans buffer, patch it into the existing consumer behind a
CLI flag.

See https://github.com/getsentry/streaming-planning/issues/18

see getsentry/streaming-planning#18

The current process-spans consumer assumes that each span has a segment
ID. In the new world we need to construct segments and correlate spans
purely based on their parent-child relationship + timeouts.

Build a new redis-based spans buffer, patch it into the existing consumer behind a
CLI flag.
@untitaker untitaker requested review from a team as code owners February 25, 2025 15:52
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Feb 25, 2025
Copy link

codecov bot commented Feb 25, 2025

Codecov Report

Attention: Patch coverage is 98.88579% with 4 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sentry/spans/consumers/process/flusher.py 92.85% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #85856       +/-   ##
===========================================
+ Coverage   33.16%   87.74%   +54.57%     
===========================================
  Files        8314     9834     +1520     
  Lines      463026   556594    +93568     
  Branches    21939    21939               
===========================================
+ Hits       153582   488373   +334791     
+ Misses     309013    67790   -241223     
  Partials      431      431               

Copy link
Member

@mcannizz mcannizz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some high level feedback; mostly questions I thought of as I read the code.

@untitaker untitaker marked this pull request as draft February 25, 2025 19:47
@untitaker
Copy link
Member Author

hey @mcannizz, this wasn't supposed to be reviewable yet, next steps for us is to test it in the sandbox but there's a lot of stuff to clean up in code. i'll address your comments partially though.

Copy link
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other questions:

This code doesn't seem to deal with the fact that a parent might arrive after a child. In that case the child won't be consolidated into the parent set in Redis. Is that something that will be addressed later/not at all?

@untitaker untitaker removed request for a team February 27, 2025 16:46
@untitaker
Copy link
Member Author

This code doesn't seem to deal with the fact that a parent might arrive after a child. In that case the child won't be consolidated into the parent set in Redis. Is that something that will be addressed later/not at all?

you're right -- there are a few other edgecases that are not properly handled. right now we're mostly concerned with performance. I don't think those cases will change our perf profile.

parent_span_id=val.get("parent_span_id"),
project_id=val["project_id"],
payload=payload.value,
# TODO: validate, this logic may not be complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, however we can follow up with a dedicated PR for this.

  • We shouldn't use span.op nor access sentry_tags in here, instead check the span's kind (Relay PR).
  • We need to call out that after flushing the buffer (or flusher?) checks whether the parent span is in a different project; otherwise considering it the same as is_remote is True.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh i couldn't find any code in relay-master that showed how to use those attributes, didn't know this was still WIP.


payload_nums.append(len(payloads))
for payload in payloads:
span_id = rapidjson.loads(payload)["span_id"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to avoid this load by getting the span_id from somewhere else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't store individual span IDs in redis, only payloads, so when we flush we eventually have to get it out of the payload one way or another.

# redis-cluster-py sentrywide. this probably leaves a bit of
# perf on the table as we send the full lua sourcecode with every span.
p.eval(
add_buffer_script.script,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are highly likely to find spans of the same trace in a batch. This means we could pre-process them in memory and minimize the numbers of mutations in Redis. For example, we should build partial trees in memory, then insert redirect keys for the top-most parent known at the current time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally don't believe that reducing the amount of work done in Redis will pay off if we pay the cost for the same CPU-bound work but executed in Python. So while Redis is a scaling bottleneck I do believe it's just more cost efficient to run things like tree construction within Redis, and scaling out Redis to the degree we need it is actually not a problem.

Of course I say this without having it rigorously tested, but I don't think we should spend time on that. If this consumer was Rust then I would've been all-in on the local buffer approach.

@untitaker untitaker merged commit 1a8e073 into master Mar 14, 2025
49 checks passed
@untitaker untitaker deleted the spans-buffer-consumer-v2 branch March 14, 2025 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants