Dead-Lock of async_tcp task and tcp/ip task when _async_queue fills up completely #876

ul-gh · 2020-11-05T00:00:17Z

Hi,

when using an AsyncEventSource instance on ESP32 platform, multiple issues arise due to multi-threaded access to the AsyncTCP event queue (_async_queue), the AsyncEventSource internal _messageQueue and locked usage of the LWIP callback API.

One of the issues was already adressed by Arjan Filius by adding mutex locking of access to the AsyncEventSource _messageQueue (see iafilius/ESPAsyncWebServer/commit/d924de1..)

On my AJAX API project, I tried above commit plus multiple other fixes, but after extensive tracing of heap memory issues etc in AsyncTCP and ESPAsyncWebServer when using the SSE event source, I ran into a dead-lock issue now which causes the async_tcp task watchdog timer rebooting the whole system.

As far as my observation is correct, the issue is that enabling the AsyncEventSource causes arbitrarily timed accesses of the AsyncTCP event queue from the LWIP tcp_ip task (by activating the LWIP _tcp_poll periodic callback and time-interleaved data arriving via the network/IP layer) while at the same time the LWIP API is called by the async_tcp task (sending responses) and also by the user application task calling the AsyncEventSource send() methods.(AsyncEventSourceClient::_queueMessage()... ultimately calling a AsyncClient "client->write()" call etc)

==> I recorded an example call trace using a hardware JTAG debugger, please see following attached PNG image:

For the application code triggering the issue, please excuse I did not yet prepare a minimum-code example yet, however please let me know if this would be helpful.

The relevant changes I made to ESPAsyncWebServer and AsyncTCP on the "dev" branches under:
https://github.com/ul-gh/ESPAsyncWebServer/tree/dev and:
https://github.com/ul-gh/AsyncTCP/tree/dev

The whole application using the ESPAsyncWebServer is:
https://github.com/ul-gh/esp_ajax_if

ul-gh · 2020-11-12T01:17:35Z

Looking further into this, I realise the dead-lock is just what happens when the _async_queue of the async_tcp event loop fills up 100%, causing xQueueGenericSend() to block indefinitely.

Anyways, the _async_queue filling up completely seems to happen easily when periodic push messages are sent via SSE / via the AsyncEventSource. I have to look more into the "why" that happens but I suspect any one of the default or user defined AsyncCallbackWebHandlers blocks for a certain amount of time while SSE push messages are acknowledged.

If this is correct, then there should be a graceful handling of the event loop filling to the top (e.g. a sensible error message plus discarding further events).

Also, what helps in my case is increasing the _async_queue length, which by default is only 32. (This could be a configure option)
I set this to 256 and indeed now I experience up to approx. 100 events queuing up from time to time, now finally without the dead-lock issue.

Please let me know if you need more input.

BlueAndi · 2020-11-12T22:35:13Z

You could measure in _handle_async_event() how long a event needs for handling, especially in case of LWIP_TCP_RECV. If there are peaks greater than 300 ms (as I see in your javascript), you could dig in there. Usually SPIFFS access needs quite long.

ul-gh · 2020-11-13T23:38:36Z

Thank you Andi, I did just that and the guess was correct.

I noticed blocking times of up to seven (!) seconds for some of the initial requests in the AsyncClient::_s_recv() handler.
And yes, ultimately I could trace it down to the AsyncStaticWebHandler doing some SPIFFS access. What happened was I had the wrong initialisation order for the AsyncWebHandlers where the static handler serving the root (/) URI was called first before the specialised API endpoint handlers. The massive timeouts happened downwards from the .canHandle() method of the AsyncStaticWebHandler wjich was for every API endpoint called first-in-line looking for any local SPIFFS file with the same name, which of course did not exist.

I did not go into more detail there, why the search for a nonexistent SPIFFS file could possibly cause a multi-second timeout - but at least the result is clear now. The event timer task was pushing SSE events for the whole massive timeout and the _async_queue was filling up.

So this needs to be fixed:

If or when the async event queue fills up for whatever reason, this should not result in an infinite dead-lock of the tasks (This is a clear Denial-of-Service if a request is made for an invalid file name)
There should be a warning or error in the log output when a queue overflow happens
For the AsyncStaticWebHandler, if the massive delays are not caused by some bug or mis-configuration, (i.e. the SPIFFS just /is/ that slow, then it would help much if this was documented.

Also nice-to-have would be a configure option for the queue length, I guess.

I realize this is more than one issue here in one thread, but if this is appreciated, I can post individual issue numbers for these and maybe even provide a fix or two with a PR next week.

Anyways, it would be nice to know from others if these issues can be confirmed or what else should be considered.

ul-gh · 2020-11-15T22:18:08Z

I need now to add, a lot of what is reported in issue #825 very much looks like the same issue.
See e.g. #825 (comment)

So fixing this will likely also fix #825.

stale · 2021-01-15T10:32:16Z

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

ul-gh · 2021-01-17T02:34:03Z

So although the original issue is the _async_queue filling up due to some SPIFFS or other time lag, it would be nice to know if there will by any official fix for this dead-lock issue.

stale · 2021-01-17T02:34:06Z

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

brandonros · 2021-01-17T02:39:13Z

How can I tell if #921 is this?

ul-gh · 2021-01-17T22:42:23Z

Hi Brandon,

you could try if the version from my fork fixes your issue:
https://github.com/ul-gh/ESPAsyncWebServer/

Regards,
Ulrich

brandonros · 2021-01-17T22:43:29Z

@ul-gh your fork seems to be equal with master?

ul-gh · 2021-01-17T22:48:13Z

That was fast reply..

Pardon me, there is a "dev" branch which combines the patches for #837, #884 and #888.
https://github.com/ul-gh/ESPAsyncWebServer/tree/dev

ul-gh · 2021-01-17T23:07:54Z

@brandonros, I now had a look at the changes I made: I did /not/ commit a fix for the async_tcp task dead-lock-issues, because there was a different underlying issue of the SPIFFS driver looking for nonexisting files causing long delays.

However, if you have a look at file AsyncTCP.cpp in line 105...120:

static inline bool _send_async_event(lwip_event_packet_t ** e_buf){
    return _async_queue && xQueueSend(_async_queue, e_buf, portMAX_DELAY) == pdPASS;
}

You could add some debugging to this to see if this is your issue:
If you change portMAX_DELAY to zero or some small value, then the behaviour of _send_async_event() will change to not block indefinitely.

You could also check the return value of the FreeRTOS xQueueSend() call and see if the queue overflows in your application.

Regards, Ulrich

brandonros · 2021-01-17T23:10:36Z

It seems to happen whenever I try to transmit more than 40-50 bytes

I am not sure if the WiFi part is choking or the TCP part is fighting for contention with my “CAN/GPIO” section.

I will try these fixes. Your help is beyond appreciated. I have tried async, not async, scheduler with timeout, pinning tasks to cores. I am ears for any possible solutions. I will link the problem code for a skim if you don’t mind.

https://gist.github.com/brandonros/c4288c12beb171747258d6a1120b22bc

justoke · 2021-01-18T21:15:34Z

If you change portMAX_DELAY to zero or some small value, then the behaviour of _send_async_event() will change to not block indefinitely.

Hi. I've been battling stability around ESPAsyncWebServer and BLE(related to what h2zero posted above) with crashes related usually to errors in AsyncTCP. I have tried your suggestion and it seems to have made an immediate difference and may even have resolved the issue. I'll leave it running overnight and report back. Thank you.

justoke · 2021-01-19T11:21:02Z

Well, the performance is much improved. Requests and responses are near instant. However, it did not resolve the seemingly random crashes. This is a common error trace:

ul-gh · 2021-01-19T21:59:11Z

Glad to hear this is helping.

But I am afraid the call trace above does not show any unusual behavior.

It does not say the reason why the panic handler was called, this should have been in one of the adjacent lines of text output.
Also, even though this might be the task which ultimately crashed, it is not necessarily the root cause of the crash, as any other task could have caused memory corruption etc.

If you please open up a new issue thread with more information, we might be able to help.

CelliesProjects · 2021-01-21T13:42:17Z

My observation is that using String in ASyncTCP or AsyncWebServer will at some point result in a crash.

stale · 2021-03-26T13:06:03Z

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

avillacis · 2021-03-26T15:50:45Z

Still not fixed (possibly unfixable) in main branch.

I am using the following combination for my projects involving ESP32 and web server:

AsyncTCPSock as a replacement of AsyncTCP - this gets rid of the queue deadlocks, as the event loop does NOT use a queue, but a select() call.
ESPAsyncWebServer (my fork with patches by 0xFEEDC0DE64 applied for Arduino) - this fixes several races between tasks that result in memory corruption, especially when using websocket/SSE messages from tasks other than the AsyncTCP task.

Both are needed - one or the other is not enough. As I wrote AsyncTCPSock specifically to be a drop-in replacement for AsyncTCP, I am dogfooding it on my projects.

stale · 2021-03-26T15:50:56Z

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

Wrapper class for PZEM004T/PZEM004Tv30 libs, controlled with USE_PZEMv3 definition Note: Async Server has lot's of issues under esp32 me-no-dev/ESPAsyncWebServer#876 me-no-dev/ESPAsyncWebServer#900 espressif/arduino-esp32#1101 me-no-dev/ESPAsyncWebServer#324 me-no-dev/ESPAsyncWebServer#932 Signed-off-by: Emil Muratov <[email protected]>

stale · 2021-06-02T16:40:41Z

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

avillacis · 2021-06-02T17:03:00Z

Still not fixed.

stale · 2021-06-02T17:03:03Z

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

rdnn · 2021-06-08T19:44:13Z

Thank you @avillacis and @ul-gh for your work here. I've been chasing an issue where large events were not being sent, and your work has helped me immensely.

stale · 2021-08-13T23:14:24Z

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

avillacis · 2021-08-14T20:32:51Z

Still not fixed.

stale · 2021-08-14T20:32:54Z

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

stale · 2022-03-30T06:06:14Z

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

avillacis · 2022-03-30T23:22:23Z

Still not fixed.

stale · 2022-03-30T23:22:26Z

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

stale · 2022-06-12T15:44:31Z

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

CelliesProjects · 2022-06-17T11:47:24Z

Keep open!

stale · 2022-06-17T11:47:27Z

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

Adesin-fr · 2022-08-23T06:49:41Z

Also have a problem with this !
I have a simple sketch with multiple web entry points.
If I spoof one endpoint with a curl bash loop, it is OK, but when I spoof two entry points (which both only just replies "OK"), it crashes with a watchdog not reset on async_tcp...

stale · 2022-11-02T00:54:39Z

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

avillacis · 2022-11-08T18:11:49Z

Still not fixed. Feeding the stalebot.

stale · 2022-11-08T18:16:05Z

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

stale · 2023-05-22T01:54:01Z

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Bevanj86 · 2023-12-13T18:38:43Z

avillacis, thanks for the modified libs. I've been having a hard time tracking down a problem with a ESP hosted websocket glitching to high latency at random. It's meant to spit out ~500bytes @ 100mS intervals, but every now and then would glitch out for a second or two. Lots of mucking around with wireshark, and writing diagnostic code. I was convinced it was something to do with TCP.

Whatever you've changed in those modified libraries, it looks to have fixed things.

ul-gh changed the title ~~AsyncEventSource unusable on ESP32 due to dead-locking of tcpip_api_call by async_tcp task while tcp/ip task waits for _async_queue to be processed~~ Dead-Lock of async_tcp task and tcp/ip task when _async_queue fills up completely Nov 12, 2020

ul-gh mentioned this issue Nov 14, 2020

WDT get reset, no delay, only a big loop to build the response #882

Closed

stale bot added the stale label Jan 15, 2021

stale bot removed the stale label Jan 17, 2021

h2zero mentioned this issue Jan 18, 2021

ESP32 crash with ESP_ERR_NO_MEM in NimBLEDevice h2zero/NimBLE-Arduino#179

Closed

justoke mentioned this issue Jan 25, 2021

Panic reset when scanning and several concurrent http requests occur h2zero/NimBLE-Arduino#167

Closed

stale bot added the stale label Mar 26, 2021

stale bot removed the stale label Mar 26, 2021

stale bot removed the stale label Jun 2, 2021

stale bot added the stale label Aug 13, 2021

stale bot removed the stale label Aug 14, 2021

avillacis mentioned this issue Aug 31, 2021

Bug: Using server.serveStatic causes WDT crashes almost every 2nd time for files >150K #984

Closed

stale bot added the stale label Mar 30, 2022

stale bot removed the stale label Mar 30, 2022

stale bot added the stale label Jun 12, 2022

stale bot removed the stale label Jun 17, 2022

stale bot added the stale label Nov 2, 2022

stale bot removed the stale label Nov 8, 2022

tomaszduda23 mentioned this issue Feb 15, 2023

Web crash fix esphome/esphome#4444

Closed

11 tasks

corke2013 mentioned this issue Nov 23, 2024

Deadlock when Serving Static Files with ESP32AsyncWebServer Library #1450

Open

Dead-Lock of async_tcp task and tcp/ip task when _async_queue fills up completely #876

Dead-Lock of async_tcp task and tcp/ip task when _async_queue fills up completely #876

Comments

ul-gh commented Nov 5, 2020 • edited Loading

ul-gh commented Nov 12, 2020

BlueAndi commented Nov 12, 2020

ul-gh commented Nov 13, 2020

ul-gh commented Nov 15, 2020

stale bot commented Jan 15, 2021

ul-gh commented Jan 17, 2021

stale bot commented Jan 17, 2021

brandonros commented Jan 17, 2021

ul-gh commented Jan 17, 2021

brandonros commented Jan 17, 2021

ul-gh commented Jan 17, 2021 • edited Loading

ul-gh commented Jan 17, 2021

brandonros commented Jan 17, 2021 • edited Loading

justoke commented Jan 18, 2021

justoke commented Jan 19, 2021

ul-gh commented Jan 19, 2021

CelliesProjects commented Jan 21, 2021

stale bot commented Mar 26, 2021

avillacis commented Mar 26, 2021

stale bot commented Mar 26, 2021

stale bot commented Jun 2, 2021

avillacis commented Jun 2, 2021

stale bot commented Jun 2, 2021

rdnn commented Jun 8, 2021

stale bot commented Aug 13, 2021

avillacis commented Aug 14, 2021

stale bot commented Aug 14, 2021

stale bot commented Mar 30, 2022

avillacis commented Mar 30, 2022

stale bot commented Mar 30, 2022

stale bot commented Jun 12, 2022

CelliesProjects commented Jun 17, 2022

stale bot commented Jun 17, 2022

Adesin-fr commented Aug 23, 2022

stale bot commented Nov 2, 2022

avillacis commented Nov 8, 2022

stale bot commented Nov 8, 2022

stale bot commented May 22, 2023

Bevanj86 commented Dec 13, 2023 • edited Loading

ul-gh commented Nov 5, 2020 •

edited

Loading

ul-gh commented Jan 17, 2021 •

edited

Loading

brandonros commented Jan 17, 2021 •

edited

Loading

Bevanj86 commented Dec 13, 2023 •

edited

Loading