Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py-redis don't acknowledged disconnection from the server #3547

Open
adilbenameur opened this issue Mar 7, 2025 · 6 comments
Open

py-redis don't acknowledged disconnection from the server #3547

adilbenameur opened this issue Mar 7, 2025 · 6 comments

Comments

@adilbenameur
Copy link

adilbenameur commented Mar 7, 2025

Hi,
I have a redis server hosted on Azure (Azure Redis PaaS). From time to time, I have connection error to this server with py-redis. From the network capture I have made with tcpdump, py-redis don't seems to respond to the server asking to drop the connection. It only respond to keepalive packets sent by the server. Finnnaly, the connection is dropped by the server with RST packet. I think the socket is broken on py-redis side but still stays in the connection pool. And when py-redis try to reuse it, an error redis.exceptions.ConnectionError is thrown because the socket is broken.

Until commit acac4db, I think py-redis used selector to handle network events. But since then, I don't think their is something in the code to handle network events like disconnection.

Issue is related to #3509.

Image

For now, I think I can use retry_on_error to reconnect after ConnectionError but might be worth to implement a real fix.

@adilbenameur
Copy link
Author

Also to mention the solution @vladvildanov proposed on #3509, Azure Redis don't make the parameters CLIENT NO-EVICT or CONFIG SET timeout available.

@adilbenameur adilbenameur changed the title py-redis don't acknowledged disconnect from the server py-redis don't acknowledged disconnection from the server Mar 7, 2025
@vladvildanov
Copy link
Collaborator

@adilbenameur How does it possible to reproduce this issue? Should I experience this issue by executing CLIENT KILL in server side?

@adilbenameur
Copy link
Author

@vladvildanov Yes logically it should be reproducible this way.

@adilbenameur
Copy link
Author

Also to give more context, I use Redis for a django app (used for cache and as message broker for celery). The error seems to occur only when celery tries to talk to Redis (and not everytime).

This is a typical traceback:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 454, in send_packed_command
    self._sock.sendall(item)
  File "/usr/local/lib/python3.11/ssl.py", line 1273, in sendall
    v = self.send(byte_view[count:])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/ssl.py", line 1242, in send
    return self._sslobj.write(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2427)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/views/generic/base.py", line 70, in view
    return self.dispatch(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/nautobot/core/views/mixins.py", line 166, in dispatch
    return super().dispatch(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/views/generic/base.py", line 98, in dispatch
    return handler(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/nautobot/extras/views.py", line 1283, in get
    job_form = job_class.as_form(initial=initial)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/nautobot/extras/jobs.py", line 487, in as_form
    form.fields["_task_queue"].choices = task_queues_as_choices(task_queues)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/nautobot/extras/utils.py", line 364, in task_queues_as_choices
    celery_queues = get_celery_queues()
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/nautobot/extras/utils.py", line 333, in get_celery_queues
    active_queues = celery_inspect.active_queues()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/celery/app/control.py", line 338, in active_queues
    return self._request('active_queues')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/celery/app/control.py", line 106, in _request
    return self._prepare(self.app.control.broadcast(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/celery/app/control.py", line 776, in broadcast
    return self.mailbox(conn)._broadcast(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kombu/pidbox.py", line 346, in _broadcast
    return self._collect(reply_ticket, limit=limit,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kombu/pidbox.py", line 388, in _collect
    self.connection.drain_events(timeout=timeout)
  File "/usr/local/lib/python3.11/site-packages/kombu/connection.py", line 341, in drain_events
    return self.transport.drain_events(self.connection, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kombu/transport/virtual/base.py", line 997, in drain_events
    get(self._deliver, timeout=timeout)
  File "/usr/local/lib/python3.11/site-packages/kombu/transport/redis.py", line 584, in get
    self._register_BRPOP(channel)
  File "/usr/local/lib/python3.11/site-packages/kombu/transport/redis.py", line 525, in _register_BRPOP
    channel._brpop_start()
  File "/usr/local/lib/python3.11/site-packages/kombu/transport/redis.py", line 957, in _brpop_start
    self.client.connection.send_command(*command_args)
  File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 476, in send_command
    self.send_packed_command(
  File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 465, in send_packed_command
    raise ConnectionError(f"Error {errno} while writing to socket. {errmsg}.")
redis.exceptions.ConnectionError: Error 8 while writing to socket. EOF occurred in violation of protocol (_ssl.c:2427).

Note that the ssl error is in reality a network error.

@adilbenameur
Copy link
Author

adilbenameur commented Mar 7, 2025

The option retry_on_error is not implemented in the celery config. I am trying to if I can pass it through the redis url but I don’t understand how the code parsing retry_on_error works. For example, how can the function parse_url parse ConnectionError with list?

@adilbenameur
Copy link
Author

Were you able to reproduce the bug ? @vladvildanov

If needed I can make a PR for this:

The option retry_on_error is not implemented in the celery config. I am trying to if I can pass it through the redis url but I don’t understand how the code parsing retry_on_error works. For example, how can the function parse_url parse ConnectionError with list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants