Protect against corrupting interrupts #67

tamird · 2015-01-17T23:39:37Z

We've seen this a lot in production, where a well-intentioned Timeout.timeout will interrupt ensure blocks and cause serious problems. See: http://headius.blogspot.com/2008/02/rubys-threadraise-threadkill-timeoutrb.html

tamird · 2015-01-17T23:42:05Z

@mperham looks like travis isn't set up for this repo :(

mperham · 2015-01-17T23:45:55Z

That's nuts. Timeouts bypass ensures???

tamird · 2015-01-17T23:46:26Z

Yup, nuts is an understatement

mperham · 2015-01-17T23:48:45Z

Btw I never recommend using Timeout.timeout in a multithreaded system. Quirks like this are exactly why. Use socket timeouts only.

tamird · 2015-01-17T23:50:37Z

That's sound advice, but Timeout.timeout is sometimes used to wrap multiple operations where individual socket timeouts are insufficiently expressive.

On another note, this fix is insufficient because the connection being checked back in can still be corrupted -- another commit on the way.

tamird · 2015-01-18T01:26:10Z

@mperham PTAL

tamird · 2015-01-18T01:26:36Z

This takes a different approach now, and simply discards any connection which is deemed unsafe to reuse.

mperham · 2015-01-18T17:43:19Z

Ugh, I hate this type of issue. Near impossible to test or understand, is the best I can do to turn a new release and wait for issues?

ggilder · 2015-01-18T18:39:15Z

Here's a script that reliably reproduces the issue: https://gist.github.com/ggilder/b531bdceea44a449c840

Against current connection_pool (or, of course, switched to just use one Redis connection), it will consistently return mismatched responses within 1-3 timeouts. With @tamird 's patch, it throws away the connection after nearly every timeout, but never returns a bad response.

zanker · 2015-01-18T18:50:29Z

@mperham It's not too bad to test, but it does make the test suite more complicated

We do it through acceptance tests against an HTTP server, but you can do the same thing with a basic TCPServer that sleeps and echos, quick drycode example:

server_thread = Thread.new do
  server = TCPServer.new("127.0.0.1", 10_000)
  while (line = server.readline) != nil
    sleep 4
    line.write(line)
  end
end

pool = ConnectionPool.new { TCPSocket.new("127.0.0.1", 10_000) }

thread = Thread.new do
  pool.with do |conn|
    conn.writeline("1")
  end
end

thread.join(1)
thread.kill

pool.with do |conn|
  conn.writeline("2")
  expect(conn.readline).to eq("2")
end

If it's handled properly, you'll get back 2 rather than a 1.

Protect against corrupting interrupts

ggilder · 2015-01-18T18:56:29Z

Thanks @mperham ! Could you set up Travis for this repo? It seems that some of the tests have rotted under JRuby...

mperham · 2015-01-18T19:00:00Z

Who wants commit bit for this? You guys can set it up yourselves. @tamird?

tamird · 2015-01-18T19:00:29Z

yup, sounds good

mperham · 2015-01-18T19:03:22Z

Done, please stick with PRs rather than committing directly to master.

tamird · 2015-01-18T19:03:47Z

You got it. Thanks for the fast triage here!

tamird · 2015-01-18T19:06:26Z

@mperham turns out I can't set up travis because I need administrative access:
http://docs.travis-ci.com/user/getting-started/#Step-two%3A-Activate-GitHub-Webhook

mperham · 2015-01-18T19:12:22Z

Oy vey, activated.

BTW I'll cut a new release on Monday unless y'all find something wrong.

mperham · 2015-01-18T19:19:30Z

https://twitter.com/tenderlove/status/556892700620775425

tenderlove · 2015-01-19T01:31:32Z

I'd appreciate it if someone can check my work since I'm super jet lagged, but here is a script that will make the connection pool fail to correctly check the connection back in.

require 'connection_pool'
require 'thread'
require 'monitor'

# Add a latch class so we can manipulate thread order
class Latch
  def initialize count = 1
    @count = count
    @lock = Monitor.new
    @cv = @lock.new_cond
  end

  def release
    @lock.synchronize do
      @count -= 1 if @count > 0
      @cv.broadcast if @count.zero?
    end
  end

  def await
    @lock.synchronize do
      @cv.wait_while { @count > 0 }
    end
  end
end

memcached = Class.new(ConnectionPool) {
  def checkin
    # add a puts so we can give the scheduler a chance to switch
    puts __method__
    super
  end

  def pop_connection
    # add a puts so we can give the scheduler a chance to switch
    puts __method__
    super
  end
}.new(size: 5, timeout: 5) { Object.new }

Thread.abort_on_exception = true
queue = SizedQueue.new 2

2.times.map {
  Thread.new {
    loop do
      job_start_latch = Latch.new
      job_finish_latch = Latch.new
      worker = Thread.new do
        # This latch is to ensure the thread is up and running before we have
        # a different thread attempt to kill it.
        job_start_latch.await

        memcached.with do |connection|
          # This latch is to notify the "killer" thread that the work is done,
          # and we're ready to be killed
          job_finish_latch.release
        end
      end
      queue << [job_start_latch, job_finish_latch, worker]
    end
  }
}

2.times.map {
  Thread.new {
    loop do
      start, finish, thread = *queue.pop
      start.release # let the target thread know that we're ready to kill it
      finish.await  # wait on the target thread to tell us it's time
      thread.kill   # kill the target thread
    end
  }
}.each(&:join)

You may notice the weird subclass. I added a puts so that I could give the scheduler a chance to switch to a different thread which would increase the odds of showing off the bug. Without the puts, I couldn't get the scheduler to switch, so I never saw the error.

If you run this against master, or against #70, you'll see a timeout error on checkout from timed_stack. I think adding the puts to encourage thread switching is valid, but I'm not 100% sure.

mperham · 2015-01-19T21:58:37Z

@tenderlove @tamird's work certainly helps a lot. I've gotten half a dozen reports from Sidekiq users about really weird Redis connection behavior (here's the latest) over the last 6 months. I suspect this fix will solve their problems in all but the rarest of edge cases, i.e. maybe it gets us from 99% to 99.99% reliability. I do want to get to 100% but this is still forward progress.

If anyone has ideas on what the right solution is long-term, please chime in on #70. I want to avoid splitting the conversation across two issues.

mperham · 2015-02-25T19:19:29Z

I've had several Sidekiq users complain that this change is leading to connection exhaustion. I believe the issue might be due to discard!. The connection pool does not close the connections it discards so they rely on Ruby's GC to run and close the socket or the socket to naturally time out. I'm going to put in a bit of a hack to work around the issue.

zanker · 2015-02-25T19:41:45Z

As a suggestion, discard! already calls shutdown. Any reason to not just pass a shutdown block in Sidekiq that does the same thing? Avoids having to do a special case.

mperham · 2015-02-25T22:10:29Z

@zanker The shutdown block cannot be passed right now. It's only passed when shutting down the entire pool.

zanker · 2015-02-26T00:25:51Z

@mperham Ah! Right, sorry forgot that's only set on shutdown not all the time.

zmoazeni · 2015-04-04T22:49:40Z

lib/connection_pool.rb

+        # everything is roses, we can safely check the connection back in
+        checkin
+      else
+        @available.discard!(pop_connection)


@tamird @mperham Sorry for the dumb question here, but why is the connection considered corrupted at this point?

We use connection pool to share our database connection between test and capybara threads like https://gist.github.com/josevalim/470808#comment-474499

class ActiveRecord::Base def self.connection @@shared_connection || ConnectionPool::Wrapper.new(:size => 1) { retrieve_connection } end end

And since this update, we're getting very very weird random errors, mostly in the form of Mysql2::Error: This connection is in use by: #<Thread:0x007fe86c3263b8 sleep>.

By adding print statements into the code, I'm seeing we hit this code a lot. What I'm guessing is happening, is the connection is discarded, then recreated when needed again. It uses Rails's internal database connection pool and picks up the same database connection again which is handed off to "other" thread without blocking. Everything goes back to normal again by reverting this ensure block to checkin.

I'm trying to figure out if there's something lacking in this line of code that is causing this problem now. I read through Headius's post and somewhat understand the Timeout::Error problem, but I could use hand pointing out what I'm missing.

Thanks guys.

My guess is that you're throwing an exception while the connection is held. The problem here is really that connection_pool is one level too high to address the Timeout::timeout problem - connection_pool can't know if the block failed while the connection was actually being used or just before or after, so it errs on the side of eagerly discarding connections which might still be healthy:

pool = ConnectionPool::Wrapper.new(:size => 1) { retrieve_connection } poo.with do |conn| conn.do_stuff # this marks the connection as corrupted because # `with` cannot know if the error came during a # method on the connection or not raise 'throw away the connection' end

Thanks @tamird that helps.

Do you know why checkin and @available.discard!(pop_connection) would have a different wait behavior between threads then? I'm trying to explore to make sense of it, but I'm not understanding it.

I even tried this to see if I could force a block in in the discard branch, but it still failed.

class CloseableMysqlConnection < ::BasicObject METHODS = [:close] def initialize(connection) @connection = connection end def close sleep 4 # just to see if this would work. @connection.close end def respond_to?(id, *args) METHODS.include?(id) || @connection.respond_to?(id, *args) end def method_missing(name, *args, &block) @connection.send(name, *args, &block) end end class ActiveRecord::Base def self.connection @@shared_connection || ConnectionPool::Wrapper.new(:size => 1) { CloseableMysqlConnection.new(retrieve_connection) } end end

I'm having a heck of a time trying to put together a small lightweight example using threads to show the difference in behavior.

What are you trying to demonstrate here? This is not a self-contained example.

You're right. I'm trying to put a self contained example together now.

The fact that I'm seeing Mysql2::Error: This connection is in use by: #<Thread:0x007fe86c3263b8 sleep>. now makes me think that ConnectionPool is not waiting with @available.discard!(pop_connection) the same way it did with checkin and it's allowing another thread to use a resource too soon.

I was trying to show in the above snippet that I even tried to force a wait when discard! is called (because it calls close or disconnect

connection_pool/lib/connection_pool/timed_stack.rb

Lines 130 to 133 in efaef8b

if obj.respond_to?(:close) # Dalli::Client

obj.close rescue nil

elsif obj.respond_to?(:disconnect!) # Redis

obj.disconnect! rescue nil

) and I still see the error.

@zmoazeni at work, we encountered connection-in-use errors from Mysql2 for a long time. Only after adding an explicit mutex lock as per my comment at https://gist.github.com/josevalim/470808#comment-1377343 did the problem disappear.

@aprescott Thanks for the tip, but ConnectionPool still doesn't help us there with version 2.1.3 as we still get errors, though slightly different ones:

ActiveRecord::StatementInvalid: Mysql2::Error: Lost connection to MySQL server during query: SELECT ...

tamird added 3 commits January 17, 2015 15:16

DRY

1b1c154

bundle exec rake is the travis default

a289696

Test against more rubies

a073eaf

tamird added 5 commits January 17, 2015 22:49

Reformat

42c642e

Tersify

67b9e26

Allow tests to pass under JRuby and RBX

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

e7e3282

Extract out pop_connection

0099475

Potentially broken connections are now discarded

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

6a2b4b7

mperham merged commit e1e3934 into mperham:master Jan 18, 2015

mperham mentioned this pull request Jan 19, 2015

multi can return nil? redis/redis-rb#490

Closed

zanker mentioned this pull request Jan 20, 2015

Requests end up interleaved on Ruby 2.1 when Timeout module is used redis/redis-rb#501

Closed

jaredjenkins mentioned this pull request Jan 20, 2015

Support connection pools nateware/redis-objects#150

Closed

mperham mentioned this pull request Feb 24, 2015

Sidekiq 3.3.2 + pro 2.0.0 - redis connections growing indefinitely sidekiq/sidekiq#2201

Closed

tamird mentioned this pull request Mar 10, 2015

connection fails to clean up in-use state after interrupt brianmario/mysql2#542

Closed

zanker mentioned this pull request Mar 29, 2015

Add an extra set of state tracking for Timeout.timeout edge cases httprb/http#194

Merged

zmoazeni reviewed Apr 4, 2015
View reviewed changes

mperham mentioned this pull request Apr 10, 2015

Non-local return forces disconnect #75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protect against corrupting interrupts #67

Protect against corrupting interrupts #67

tamird commented Jan 17, 2015

tamird commented Jan 17, 2015

mperham commented Jan 17, 2015

tamird commented Jan 17, 2015

mperham commented Jan 17, 2015

tamird commented Jan 17, 2015

tamird commented Jan 18, 2015

tamird commented Jan 18, 2015

mperham commented Jan 18, 2015

ggilder commented Jan 18, 2015

zanker commented Jan 18, 2015

ggilder commented Jan 18, 2015

mperham commented Jan 18, 2015

tamird commented Jan 18, 2015

mperham commented Jan 18, 2015

tamird commented Jan 18, 2015

tamird commented Jan 18, 2015

mperham commented Jan 18, 2015

mperham commented Jan 18, 2015

tenderlove commented Jan 19, 2015

mperham commented Jan 19, 2015

mperham commented Feb 25, 2015

zanker commented Feb 25, 2015

mperham commented Feb 25, 2015

zanker commented Feb 26, 2015

zmoazeni Apr 4, 2015

tamird Apr 4, 2015

zmoazeni Apr 5, 2015

tamird Apr 5, 2015

zmoazeni Apr 5, 2015

aprescott Apr 6, 2015

zmoazeni Apr 6, 2015

	if obj.respond_to?(:close) # Dalli::Client
	obj.close rescue nil
	elsif obj.respond_to?(:disconnect!) # Redis
	obj.disconnect! rescue nil

Protect against corrupting interrupts #67

Protect against corrupting interrupts #67

Conversation

tamird commented Jan 17, 2015

tamird commented Jan 17, 2015

mperham commented Jan 17, 2015

tamird commented Jan 17, 2015

mperham commented Jan 17, 2015

tamird commented Jan 17, 2015

tamird commented Jan 18, 2015

tamird commented Jan 18, 2015

mperham commented Jan 18, 2015

ggilder commented Jan 18, 2015

zanker commented Jan 18, 2015

ggilder commented Jan 18, 2015

mperham commented Jan 18, 2015

tamird commented Jan 18, 2015

mperham commented Jan 18, 2015

tamird commented Jan 18, 2015

tamird commented Jan 18, 2015

mperham commented Jan 18, 2015

mperham commented Jan 18, 2015

tenderlove commented Jan 19, 2015

mperham commented Jan 19, 2015

mperham commented Feb 25, 2015

zanker commented Feb 25, 2015

mperham commented Feb 25, 2015

zanker commented Feb 26, 2015

zmoazeni Apr 4, 2015

Choose a reason for hiding this comment

tamird Apr 4, 2015

Choose a reason for hiding this comment

zmoazeni Apr 5, 2015

Choose a reason for hiding this comment

tamird Apr 5, 2015

Choose a reason for hiding this comment

zmoazeni Apr 5, 2015

Choose a reason for hiding this comment

aprescott Apr 6, 2015

Choose a reason for hiding this comment

zmoazeni Apr 6, 2015

Choose a reason for hiding this comment