-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing: prevent race conditions during shutdown #21335
Conversation
/cc @nodejs/trace-events |
It would be good to add the failing test case as well. |
@@ -66,16 +70,15 @@ void Agent::Start() { | |||
started_ = true; | |||
} | |||
|
|||
Agent::ClientHandle Agent::AddClient(const std::set<std::string>& categories, | |||
std::unique_ptr<AsyncTraceWriter> writer) { | |||
std::unique_ptr<ClientHandle> Agent::AddClient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~~nit: need two spaces at the start here. looks like it's only one?~~~
Sigh... nevermind... misread the diff lol...sigh.
main change looks good, test case would make it better. |
} | ||
|
||
void AgentHandle::AppendTraceEvent(TraceObject* trace_event) { | ||
Mutex::ScopedLock scoped_lock(mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we benchmark the impact of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. It's not entirely clear to me how this fixes #21383.
src/tracing/agent.cc
Outdated
void AgentHandle::Disconnect(int client) { | ||
// This is supposed to be only callable from a main thread, same as Reset | ||
// This code needs to be updated to enforce that. | ||
if (agent_ != nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this method take out the lock first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot add lock without significant rework (which would delay this PR) because node::Mutex is not reentrant... Currently this is only called on the isolate thread (i.e. it will be be correct for webworkers) so it is correct without the lock... But it is not obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, can you update the comment and maybe make a bit more imperative? "Is supposed to" isn't strong enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/tracing/agent.h
Outdated
static ClientHandle EmptyClientHandle() { | ||
return ClientHandle(nullptr, DisconnectClient); | ||
private: | ||
node::Mutex mutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary to qualify, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/tracing/agent.h
Outdated
ClientHandle(const ClientHandle&) = delete; | ||
ClientHandle& operator=(const ClientHandle&) = delete; | ||
~ClientHandle(); | ||
int id() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int id() const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/tracing/agent.h
Outdated
|
||
private: | ||
std::shared_ptr<AgentHandle> agent_; | ||
int id_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe make this const
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/tracing/agent.cc
Outdated
void AgentHandle::Disconnect(int client) { | ||
// This is supposed to be only callable from a main thread, same as Reset | ||
// This code needs to be updated to enforce that. | ||
if (agent_ != nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, can you update the comment and maybe make a bit more imperative? "Is supposed to" isn't strong enough.
src/tracing/agent.h
Outdated
using ClientHandle = std::unique_ptr<std::pair<Agent*, int>, | ||
void (*)(std::pair<Agent*, int>*)>; | ||
explicit AgentHandle(Agent*); | ||
void Reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call Reset() from the destructor so you don't have to call it explicitly in Agent::~Agent()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, AgentHandle is supposed to be a weak pointer - it allows accessing the tracing agent but does not prevent its destruction. I had a naive implementation with shared_ptr/weak_ptr but the issue there is that it may cause Agent to be deleted on a wrong thread.
src/tracing/agent.h
Outdated
// Resetting the pointer disconnects client | ||
using ClientHandle = std::unique_ptr<std::pair<Agent*, int>, | ||
void (*)(std::pair<Agent*, int>*)>; | ||
explicit AgentHandle(Agent*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agent* agent
and consider changing the data member to Agent* const agent_;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
src/tracing/agent.h
Outdated
ClientHandle& operator=(const ClientHandle&) = delete; | ||
~ClientHandle(); | ||
int id() const { | ||
return id_; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style guide allows doing this on one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@matthiaskrgr Would you be able to test this patch against your build with AddressSanitizer and see if it fixes the issue? Thanks! |
src/tracing/agent.h
Outdated
|
||
// These 3 methods operate on a "default" client, e.g. the file writer | ||
void Enable(const std::string& categories); | ||
void Enable(const std::set<std::string>& categories); | ||
void Disable(const std::set<std::string>& categories); | ||
std::string GetEnabledCategories(); | ||
|
||
void AppendTraceEvent(TraceObject* trace_event); | ||
// void AppendTraceEvent(TraceObject* trace_event); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be removed rather than commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a wrong version - this method is still in use...
PR-URL: #21335 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]>
Test failure needs to be debugged on Windows. |
Test failure on windows is actually a crash:
3221225477 is 0xC0000005, an access violation on windows. |
I have been unable to reproduce the crash on my windows machine so far. |
New CI: https://ci.nodejs.org/job/node-test-pull-request/15642/. All green. |
Second CI is green on windows as well: https://ci.nodejs.org/job/node-test-pull-request/15656/ and I cannot reproduce the crashes on windows. |
Launched a stress of the single test on windows: https://ci.nodejs.org/job/node-stress-single-test/1929/ |
The stress test definitely shows crashes. @eugeneo |
This will be reimplemented from scratch, current design cannot work. |
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passes