-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Dead-Lock of async_tcp task and tcp/ip task when _async_queue fills up completely #876
Comments
Looking further into this, I realise the dead-lock is just what happens when the _async_queue of the async_tcp event loop fills up 100%, causing xQueueGenericSend() to block indefinitely. Anyways, the _async_queue filling up completely seems to happen easily when periodic push messages are sent via SSE / via the AsyncEventSource. I have to look more into the "why" that happens but I suspect any one of the default or user defined AsyncCallbackWebHandlers blocks for a certain amount of time while SSE push messages are acknowledged. If this is correct, then there should be a graceful handling of the event loop filling to the top (e.g. a sensible error message plus discarding further events). Also, what helps in my case is increasing the _async_queue length, which by default is only 32. (This could be a configure option) Please let me know if you need more input. |
You could measure in _handle_async_event() how long a event needs for handling, especially in case of LWIP_TCP_RECV. If there are peaks greater than 300 ms (as I see in your javascript), you could dig in there. Usually SPIFFS access needs quite long. |
Thank you Andi, I did just that and the guess was correct. I noticed blocking times of up to seven (!) seconds for some of the initial requests in the AsyncClient::_s_recv() handler. I did not go into more detail there, why the search for a nonexistent SPIFFS file could possibly cause a multi-second timeout - but at least the result is clear now. The event timer task was pushing SSE events for the whole massive timeout and the _async_queue was filling up. So this needs to be fixed:
Also nice-to-have would be a configure option for the queue length, I guess. I realize this is more than one issue here in one thread, but if this is appreciated, I can post individual issue numbers for these and maybe even provide a fix or two with a PR next week.
|
[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
So although the original issue is the _async_queue filling up due to some SPIFFS or other time lag, it would be nice to know if there will by any official fix for this dead-lock issue. |
[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future. |
How can I tell if #921 is this? |
Hi Brandon, you could try if the version from my fork fixes your issue: Regards, |
@ul-gh your fork seems to be equal with master? |
That was fast reply.. Pardon me, there is a "dev" branch which combines the patches for #837, #884 and #888. |
@brandonros, I now had a look at the changes I made: I did /not/ commit a fix for the async_tcp task dead-lock-issues, because there was a different underlying issue of the SPIFFS driver looking for nonexisting files causing long delays. However, if you have a look at file AsyncTCP.cpp in line 105...120:
You could add some debugging to this to see if this is your issue: You could also check the return value of the FreeRTOS xQueueSend() call and see if the queue overflows in your application. Regards, Ulrich |
It seems to happen whenever I try to transmit more than 40-50 bytes I am not sure if the WiFi part is choking or the TCP part is fighting for contention with my “CAN/GPIO” section. I will try these fixes. Your help is beyond appreciated. I have tried async, not async, scheduler with timeout, pinning tasks to cores. I am ears for any possible solutions. I will link the problem code for a skim if you don’t mind. https://gist.github.com/brandonros/c4288c12beb171747258d6a1120b22bc |
Hi. I've been battling stability around ESPAsyncWebServer and BLE(related to what h2zero posted above) with crashes related usually to errors in AsyncTCP. I have tried your suggestion and it seems to have made an immediate difference and may even have resolved the issue. I'll leave it running overnight and report back. Thank you. |
Glad to hear this is helping. But I am afraid the call trace above does not show any unusual behavior. It does not say the reason why the panic handler was called, this should have been in one of the adjacent lines of text output. If you please open up a new issue thread with more information, we might be able to help. |
My observation is that using |
[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Still not fixed (possibly unfixable) in main branch. I am using the following combination for my projects involving ESP32 and web server:
Both are needed - one or the other is not enough. As I wrote AsyncTCPSock specifically to be a drop-in replacement for AsyncTCP, I am dogfooding it on my projects. |
[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future. |
Wrapper class for PZEM004T/PZEM004Tv30 libs, controlled with USE_PZEMv3 definition Note: Async Server has lot's of issues under esp32 me-no-dev/ESPAsyncWebServer#876 me-no-dev/ESPAsyncWebServer#900 espressif/arduino-esp32#1101 me-no-dev/ESPAsyncWebServer#324 me-no-dev/ESPAsyncWebServer#932 Signed-off-by: Emil Muratov <[email protected]>
[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Still not fixed. |
[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future. |
Thank you @avillacis and @ul-gh for your work here. I've been chasing an issue where large events were not being sent, and your work has helped me immensely. |
[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Still not fixed. |
[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future. |
[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Still not fixed. |
[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future. |
[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Keep open! |
[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future. |
Also have a problem with this ! |
[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Still not fixed. Feeding the stalebot. |
[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future. |
[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
avillacis, thanks for the modified libs. I've been having a hard time tracking down a problem with a ESP hosted websocket glitching to high latency at random. It's meant to spit out ~500bytes @ 100mS intervals, but every now and then would glitch out for a second or two. Lots of mucking around with wireshark, and writing diagnostic code. I was convinced it was something to do with TCP. Whatever you've changed in those modified libraries, it looks to have fixed things. |
Hi,
when using an AsyncEventSource instance on ESP32 platform, multiple issues arise due to multi-threaded access to the AsyncTCP event queue (_async_queue), the AsyncEventSource internal _messageQueue and locked usage of the LWIP callback API.
One of the issues was already adressed by Arjan Filius by adding mutex locking of access to the AsyncEventSource _messageQueue (see iafilius/ESPAsyncWebServer/commit/d924de1..)
On my AJAX API project, I tried above commit plus multiple other fixes, but after extensive tracing of heap memory issues etc in AsyncTCP and ESPAsyncWebServer when using the SSE event source, I ran into a dead-lock issue now which causes the async_tcp task watchdog timer rebooting the whole system.
As far as my observation is correct, the issue is that enabling the AsyncEventSource causes arbitrarily timed accesses of the AsyncTCP event queue from the LWIP tcp_ip task (by activating the LWIP _tcp_poll periodic callback and time-interleaved data arriving via the network/IP layer) while at the same time the LWIP API is called by the async_tcp task (sending responses) and also by the user application task calling the AsyncEventSource send() methods.(AsyncEventSourceClient::_queueMessage()... ultimately calling a AsyncClient "client->write()" call etc)
==> I recorded an example call trace using a hardware JTAG debugger, please see following attached PNG image:

For the application code triggering the issue, please excuse I did not yet prepare a minimum-code example yet, however please let me know if this would be helpful.
The relevant changes I made to ESPAsyncWebServer and AsyncTCP on the "dev" branches under:
https://github.com/ul-gh/ESPAsyncWebServer/tree/dev and:
https://github.com/ul-gh/AsyncTCP/tree/dev
The whole application using the ESPAsyncWebServer is:
https://github.com/ul-gh/esp_ajax_if
The text was updated successfully, but these errors were encountered: