Skip to content

Protect against corrupting interrupts #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 18, 2015
Merged

Protect against corrupting interrupts #67

merged 8 commits into from
Jan 18, 2015

Conversation

tamird
Copy link
Contributor

@tamird tamird commented Jan 17, 2015

We've seen this a lot in production, where a well-intentioned Timeout.timeout will interrupt ensure blocks and cause serious problems. See: http://headius.blogspot.com/2008/02/rubys-threadraise-threadkill-timeoutrb.html

@tamird
Copy link
Contributor Author

tamird commented Jan 17, 2015

@mperham looks like travis isn't set up for this repo :(

@mperham
Copy link
Owner

mperham commented Jan 17, 2015

That's nuts. Timeouts bypass ensures???

@tamird
Copy link
Contributor Author

tamird commented Jan 17, 2015

Yup, nuts is an understatement

@mperham
Copy link
Owner

mperham commented Jan 17, 2015

Btw I never recommend using Timeout.timeout in a multithreaded system. Quirks like this are exactly why. Use socket timeouts only.

@tamird
Copy link
Contributor Author

tamird commented Jan 17, 2015

That's sound advice, but Timeout.timeout is sometimes used to wrap multiple operations where individual socket timeouts are insufficiently expressive.

On another note, this fix is insufficient because the connection being checked back in can still be corrupted -- another commit on the way.

@tamird
Copy link
Contributor Author

tamird commented Jan 18, 2015

@mperham PTAL

@tamird
Copy link
Contributor Author

tamird commented Jan 18, 2015

This takes a different approach now, and simply discards any connection which is deemed unsafe to reuse.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@mperham
Copy link
Owner

mperham commented Jan 18, 2015

Ugh, I hate this type of issue. Near impossible to test or understand, is the best I can do to turn a new release and wait for issues?

@ggilder
Copy link

ggilder commented Jan 18, 2015

Here's a script that reliably reproduces the issue: https://gist.github.com/ggilder/b531bdceea44a449c840

Against current connection_pool (or, of course, switched to just use one Redis connection), it will consistently return mismatched responses within 1-3 timeouts. With @tamird 's patch, it throws away the connection after nearly every timeout, but never returns a bad response.

@zanker
Copy link

zanker commented Jan 18, 2015

@mperham It's not too bad to test, but it does make the test suite more complicated

We do it through acceptance tests against an HTTP server, but you can do the same thing with a basic TCPServer that sleeps and echos, quick drycode example:

server_thread = Thread.new do
  server = TCPServer.new("127.0.0.1", 10_000)
  while (line = server.readline) != nil
    sleep 4
    line.write(line)
  end
end

pool = ConnectionPool.new { TCPSocket.new("127.0.0.1", 10_000) }

thread = Thread.new do
  pool.with do |conn|
    conn.writeline("1")
  end
end

thread.join(1)
thread.kill

pool.with do |conn|
  conn.writeline("2")
  expect(conn.readline).to eq("2")
end

If it's handled properly, you'll get back 2 rather than a 1.

mperham added a commit that referenced this pull request Jan 18, 2015

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Protect against corrupting interrupts
@mperham mperham merged commit e1e3934 into mperham:master Jan 18, 2015
@ggilder
Copy link

ggilder commented Jan 18, 2015

Thanks @mperham ! Could you set up Travis for this repo? It seems that some of the tests have rotted under JRuby...

@mperham
Copy link
Owner

mperham commented Jan 18, 2015

Who wants commit bit for this? You guys can set it up yourselves. @tamird?

@tamird
Copy link
Contributor Author

tamird commented Jan 18, 2015

yup, sounds good

@mperham
Copy link
Owner

mperham commented Jan 18, 2015

Done, please stick with PRs rather than committing directly to master.

@tamird
Copy link
Contributor Author

tamird commented Jan 18, 2015

You got it. Thanks for the fast triage here!

@tamird
Copy link
Contributor Author

tamird commented Jan 18, 2015

@mperham turns out I can't set up travis because I need administrative access:
http://docs.travis-ci.com/user/getting-started/#Step-two%3A-Activate-GitHub-Webhook

@mperham
Copy link
Owner

mperham commented Jan 18, 2015

Oy vey, activated.

BTW I'll cut a new release on Monday unless y'all find something wrong.

@mperham
Copy link
Owner

mperham commented Jan 18, 2015

@tenderlove
Copy link

I'd appreciate it if someone can check my work since I'm super jet lagged, but here is a script that will make the connection pool fail to correctly check the connection back in.

require 'connection_pool'
require 'thread'
require 'monitor'

# Add a latch class so we can manipulate thread order
class Latch
  def initialize count = 1
    @count = count
    @lock = Monitor.new
    @cv = @lock.new_cond
  end

  def release
    @lock.synchronize do
      @count -= 1 if @count > 0
      @cv.broadcast if @count.zero?
    end
  end

  def await
    @lock.synchronize do
      @cv.wait_while { @count > 0 }
    end
  end
end

memcached = Class.new(ConnectionPool) {
  def checkin
    # add a puts so we can give the scheduler a chance to switch
    puts __method__
    super
  end

  def pop_connection
    # add a puts so we can give the scheduler a chance to switch
    puts __method__
    super
  end
}.new(size: 5, timeout: 5) { Object.new }

Thread.abort_on_exception = true
queue = SizedQueue.new 2

2.times.map {
  Thread.new {
    loop do
      job_start_latch = Latch.new
      job_finish_latch = Latch.new
      worker = Thread.new do
        # This latch is to ensure the thread is up and running before we have
        # a different thread attempt to kill it.
        job_start_latch.await

        memcached.with do |connection|
          # This latch is to notify the "killer" thread that the work is done,
          # and we're ready to be killed
          job_finish_latch.release
        end
      end
      queue << [job_start_latch, job_finish_latch, worker]
    end
  }
}

2.times.map {
  Thread.new {
    loop do
      start, finish, thread = *queue.pop
      start.release # let the target thread know that we're ready to kill it
      finish.await  # wait on the target thread to tell us it's time
      thread.kill   # kill the target thread
    end
  }
}.each(&:join)

You may notice the weird subclass. I added a puts so that I could give the scheduler a chance to switch to a different thread which would increase the odds of showing off the bug. Without the puts, I couldn't get the scheduler to switch, so I never saw the error.

If you run this against master, or against #70, you'll see a timeout error on checkout from timed_stack. I think adding the puts to encourage thread switching is valid, but I'm not 100% sure.

@mperham
Copy link
Owner

mperham commented Jan 19, 2015

@tenderlove @tamird's work certainly helps a lot. I've gotten half a dozen reports from Sidekiq users about really weird Redis connection behavior (here's the latest) over the last 6 months. I suspect this fix will solve their problems in all but the rarest of edge cases, i.e. maybe it gets us from 99% to 99.99% reliability. I do want to get to 100% but this is still forward progress.

If anyone has ideas on what the right solution is long-term, please chime in on #70. I want to avoid splitting the conversation across two issues.

mperham added a commit that referenced this pull request Feb 25, 2015

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@mperham
Copy link
Owner

mperham commented Feb 25, 2015

I've had several Sidekiq users complain that this change is leading to connection exhaustion. I believe the issue might be due to discard!. The connection pool does not close the connections it discards so they rely on Ruby's GC to run and close the socket or the socket to naturally time out. I'm going to put in a bit of a hack to work around the issue.

@zanker
Copy link

zanker commented Feb 25, 2015

As a suggestion, discard! already calls shutdown. Any reason to not just pass a shutdown block in Sidekiq that does the same thing? Avoids having to do a special case.

@mperham
Copy link
Owner

mperham commented Feb 25, 2015

@zanker The shutdown block cannot be passed right now. It's only passed when shutting down the entire pool.

mperham added a commit that referenced this pull request Feb 25, 2015

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@zanker
Copy link

zanker commented Feb 26, 2015

@mperham Ah! Right, sorry forgot that's only set on shutdown not all the time.

# everything is roses, we can safely check the connection back in
checkin
else
@available.discard!(pop_connection)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tamird @mperham Sorry for the dumb question here, but why is the connection considered corrupted at this point?

We use connection pool to share our database connection between test and capybara threads like https://gist.github.com/josevalim/470808#comment-474499

class ActiveRecord::Base
  def self.connection
    @@shared_connection || ConnectionPool::Wrapper.new(:size => 1) { retrieve_connection }
  end
end

And since this update, we're getting very very weird random errors, mostly in the form of Mysql2::Error: This connection is in use by: #<Thread:0x007fe86c3263b8 sleep>.

By adding print statements into the code, I'm seeing we hit this code a lot. What I'm guessing is happening, is the connection is discarded, then recreated when needed again. It uses Rails's internal database connection pool and picks up the same database connection again which is handed off to "other" thread without blocking. Everything goes back to normal again by reverting this ensure block to checkin.

I'm trying to figure out if there's something lacking in this line of code that is causing this problem now. I read through Headius's post and somewhat understand the Timeout::Error problem, but I could use hand pointing out what I'm missing.

Thanks guys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that you're throwing an exception while the connection is held. The problem here is really that connection_pool is one level too high to address the Timeout::timeout problem - connection_pool can't know if the block failed while the connection was actually being used or just before or after, so it errs on the side of eagerly discarding connections which might still be healthy:

pool = ConnectionPool::Wrapper.new(:size => 1) { retrieve_connection }

poo.with do |conn|
  conn.do_stuff
  # this marks the connection as corrupted because
  # `with` cannot know if the error came during a
  # method on the connection or not
  raise 'throw away the connection'
end

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tamird that helps.

Do you know why checkin and @available.discard!(pop_connection) would have a different wait behavior between threads then? I'm trying to explore to make sense of it, but I'm not understanding it.

I even tried this to see if I could force a block in in the discard branch, but it still failed.

class CloseableMysqlConnection < ::BasicObject
  METHODS = [:close]

  def initialize(connection)
    @connection = connection
  end

  def close
    sleep 4 # just to see if this would work.
    @connection.close
  end

  def respond_to?(id, *args)
    METHODS.include?(id) || @connection.respond_to?(id, *args)
  end

  def method_missing(name, *args, &block)
    @connection.send(name, *args, &block)
  end
end


class ActiveRecord::Base
  def self.connection
    @@shared_connection || ConnectionPool::Wrapper.new(:size => 1) { CloseableMysqlConnection.new(retrieve_connection) }
  end
end

I'm having a heck of a time trying to put together a small lightweight example using threads to show the difference in behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you trying to demonstrate here? This is not a self-contained example.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I'm trying to put a self contained example together now.

The fact that I'm seeing Mysql2::Error: This connection is in use by: #<Thread:0x007fe86c3263b8 sleep>. now makes me think that ConnectionPool is not waiting with @available.discard!(pop_connection) the same way it did with checkin and it's allowing another thread to use a resource too soon.

I was trying to show in the above snippet that I even tried to force a wait when discard! is called (because it calls close or disconnect

if obj.respond_to?(:close) # Dalli::Client
obj.close rescue nil
elsif obj.respond_to?(:disconnect!) # Redis
obj.disconnect! rescue nil
) and I still see the error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zmoazeni at work, we encountered connection-in-use errors from Mysql2 for a long time. Only after adding an explicit mutex lock as per my comment at https://gist.github.com/josevalim/470808#comment-1377343 did the problem disappear.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aprescott Thanks for the tip, but ConnectionPool still doesn't help us there with version 2.1.3 as we still get errors, though slightly different ones:

ActiveRecord::StatementInvalid: Mysql2::Error: Lost connection to MySQL server during query: SELECT ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants