-
Notifications
You must be signed in to change notification settings - Fork 145
Protect against corrupting interrupts #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@mperham looks like travis isn't set up for this repo :( |
That's nuts. Timeouts bypass ensures??? |
Yup, nuts is an understatement |
Btw I never recommend using Timeout.timeout in a multithreaded system. Quirks like this are exactly why. Use socket timeouts only. |
That's sound advice, but On another note, this fix is insufficient because the connection being checked back in can still be corrupted -- another commit on the way. |
@mperham PTAL |
This takes a different approach now, and simply discards any connection which is deemed unsafe to reuse. |
Ugh, I hate this type of issue. Near impossible to test or understand, is the best I can do to turn a new release and wait for issues? |
Here's a script that reliably reproduces the issue: https://gist.github.com/ggilder/b531bdceea44a449c840 Against current |
@mperham It's not too bad to test, but it does make the test suite more complicated We do it through acceptance tests against an HTTP server, but you can do the same thing with a basic TCPServer that sleeps and echos, quick drycode example: server_thread = Thread.new do
server = TCPServer.new("127.0.0.1", 10_000)
while (line = server.readline) != nil
sleep 4
line.write(line)
end
end
pool = ConnectionPool.new { TCPSocket.new("127.0.0.1", 10_000) }
thread = Thread.new do
pool.with do |conn|
conn.writeline("1")
end
end
thread.join(1)
thread.kill
pool.with do |conn|
conn.writeline("2")
expect(conn.readline).to eq("2")
end If it's handled properly, you'll get back 2 rather than a 1. |
Protect against corrupting interrupts
Thanks @mperham ! Could you set up Travis for this repo? It seems that some of the tests have rotted under JRuby... |
Who wants commit bit for this? You guys can set it up yourselves. @tamird? |
yup, sounds good |
Done, please stick with PRs rather than committing directly to master. |
You got it. Thanks for the fast triage here! |
@mperham turns out I can't set up travis because I need administrative access: |
Oy vey, activated. BTW I'll cut a new release on Monday unless y'all find something wrong. |
I'd appreciate it if someone can check my work since I'm super jet lagged, but here is a script that will make the connection pool fail to correctly check the connection back in. require 'connection_pool'
require 'thread'
require 'monitor'
# Add a latch class so we can manipulate thread order
class Latch
def initialize count = 1
@count = count
@lock = Monitor.new
@cv = @lock.new_cond
end
def release
@lock.synchronize do
@count -= 1 if @count > 0
@cv.broadcast if @count.zero?
end
end
def await
@lock.synchronize do
@cv.wait_while { @count > 0 }
end
end
end
memcached = Class.new(ConnectionPool) {
def checkin
# add a puts so we can give the scheduler a chance to switch
puts __method__
super
end
def pop_connection
# add a puts so we can give the scheduler a chance to switch
puts __method__
super
end
}.new(size: 5, timeout: 5) { Object.new }
Thread.abort_on_exception = true
queue = SizedQueue.new 2
2.times.map {
Thread.new {
loop do
job_start_latch = Latch.new
job_finish_latch = Latch.new
worker = Thread.new do
# This latch is to ensure the thread is up and running before we have
# a different thread attempt to kill it.
job_start_latch.await
memcached.with do |connection|
# This latch is to notify the "killer" thread that the work is done,
# and we're ready to be killed
job_finish_latch.release
end
end
queue << [job_start_latch, job_finish_latch, worker]
end
}
}
2.times.map {
Thread.new {
loop do
start, finish, thread = *queue.pop
start.release # let the target thread know that we're ready to kill it
finish.await # wait on the target thread to tell us it's time
thread.kill # kill the target thread
end
}
}.each(&:join) You may notice the weird subclass. I added a If you run this against master, or against #70, you'll see a timeout error on checkout from |
@tenderlove @tamird's work certainly helps a lot. I've gotten half a dozen reports from Sidekiq users about really weird Redis connection behavior (here's the latest) over the last 6 months. I suspect this fix will solve their problems in all but the rarest of edge cases, i.e. maybe it gets us from 99% to 99.99% reliability. I do want to get to 100% but this is still forward progress. If anyone has ideas on what the right solution is long-term, please chime in on #70. I want to avoid splitting the conversation across two issues. |
I've had several Sidekiq users complain that this change is leading to connection exhaustion. I believe the issue might be due to |
As a suggestion, |
@zanker The shutdown block cannot be passed right now. It's only passed when shutting down the entire pool. |
@mperham Ah! Right, sorry forgot that's only set on shutdown not all the time. |
# everything is roses, we can safely check the connection back in | ||
checkin | ||
else | ||
@available.discard!(pop_connection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamird @mperham Sorry for the dumb question here, but why is the connection considered corrupted at this point?
We use connection pool to share our database connection between test and capybara threads like https://gist.github.com/josevalim/470808#comment-474499
class ActiveRecord::Base
def self.connection
@@shared_connection || ConnectionPool::Wrapper.new(:size => 1) { retrieve_connection }
end
end
And since this update, we're getting very very weird random errors, mostly in the form of Mysql2::Error: This connection is in use by: #<Thread:0x007fe86c3263b8 sleep>
.
By adding print statements into the code, I'm seeing we hit this code a lot. What I'm guessing is happening, is the connection is discarded, then recreated when needed again. It uses Rails's internal database connection pool and picks up the same database connection again which is handed off to "other" thread without blocking. Everything goes back to normal again by reverting this ensure block to checkin
.
I'm trying to figure out if there's something lacking in this line of code that is causing this problem now. I read through Headius's post and somewhat understand the Timeout::Error problem, but I could use hand pointing out what I'm missing.
Thanks guys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that you're throwing an exception while the connection is held. The problem here is really that connection_pool
is one level too high to address the Timeout::timeout
problem - connection_pool
can't know if the block failed while the connection was actually being used or just before or after, so it errs on the side of eagerly discarding connections which might still be healthy:
pool = ConnectionPool::Wrapper.new(:size => 1) { retrieve_connection }
poo.with do |conn|
conn.do_stuff
# this marks the connection as corrupted because
# `with` cannot know if the error came during a
# method on the connection or not
raise 'throw away the connection'
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tamird that helps.
Do you know why checkin
and @available.discard!(pop_connection)
would have a different wait behavior between threads then? I'm trying to explore to make sense of it, but I'm not understanding it.
I even tried this to see if I could force a block in in the discard
branch, but it still failed.
class CloseableMysqlConnection < ::BasicObject
METHODS = [:close]
def initialize(connection)
@connection = connection
end
def close
sleep 4 # just to see if this would work.
@connection.close
end
def respond_to?(id, *args)
METHODS.include?(id) || @connection.respond_to?(id, *args)
end
def method_missing(name, *args, &block)
@connection.send(name, *args, &block)
end
end
class ActiveRecord::Base
def self.connection
@@shared_connection || ConnectionPool::Wrapper.new(:size => 1) { CloseableMysqlConnection.new(retrieve_connection) }
end
end
I'm having a heck of a time trying to put together a small lightweight example using threads to show the difference in behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you trying to demonstrate here? This is not a self-contained example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'm trying to put a self contained example together now.
The fact that I'm seeing Mysql2::Error: This connection is in use by: #<Thread:0x007fe86c3263b8 sleep>.
now makes me think that ConnectionPool is not waiting with @available.discard!(pop_connection)
the same way it did with checkin
and it's allowing another thread to use a resource too soon.
I was trying to show in the above snippet that I even tried to force a wait when discard!
is called (because it calls close
or disconnect
connection_pool/lib/connection_pool/timed_stack.rb
Lines 130 to 133 in efaef8b
if obj.respond_to?(:close) # Dalli::Client | |
obj.close rescue nil | |
elsif obj.respond_to?(:disconnect!) # Redis | |
obj.disconnect! rescue nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zmoazeni at work, we encountered connection-in-use errors from Mysql2 for a long time. Only after adding an explicit mutex lock as per my comment at https://gist.github.com/josevalim/470808#comment-1377343 did the problem disappear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aprescott Thanks for the tip, but ConnectionPool still doesn't help us there with version 2.1.3 as we still get errors, though slightly different ones:
ActiveRecord::StatementInvalid: Mysql2::Error: Lost connection to MySQL server during query: SELECT ...
We've seen this a lot in production, where a well-intentioned
Timeout.timeout
will interruptensure
blocks and cause serious problems. See: http://headius.blogspot.com/2008/02/rubys-threadraise-threadkill-timeoutrb.html