Skip to content
This repository was archived by the owner on Jan 20, 2025. It is now read-only.

Dead-Lock of async_tcp task and tcp/ip task when _async_queue fills up completely #876

Open
ul-gh opened this issue Nov 5, 2020 · 39 comments

Comments

@ul-gh
Copy link

ul-gh commented Nov 5, 2020

Hi,

when using an AsyncEventSource instance on ESP32 platform, multiple issues arise due to multi-threaded access to the AsyncTCP event queue (_async_queue), the AsyncEventSource internal _messageQueue and locked usage of the LWIP callback API.

One of the issues was already adressed by Arjan Filius by adding mutex locking of access to the AsyncEventSource _messageQueue (see iafilius/ESPAsyncWebServer/commit/d924de1..)

On my AJAX API project, I tried above commit plus multiple other fixes, but after extensive tracing of heap memory issues etc in AsyncTCP and ESPAsyncWebServer when using the SSE event source, I ran into a dead-lock issue now which causes the async_tcp task watchdog timer rebooting the whole system.

As far as my observation is correct, the issue is that enabling the AsyncEventSource causes arbitrarily timed accesses of the AsyncTCP event queue from the LWIP tcp_ip task (by activating the LWIP _tcp_poll periodic callback and time-interleaved data arriving via the network/IP layer) while at the same time the LWIP API is called by the async_tcp task (sending responses) and also by the user application task calling the AsyncEventSource send() methods.(AsyncEventSourceClient::_queueMessage()... ultimately calling a AsyncClient "client->write()" call etc)

==> I recorded an example call trace using a hardware JTAG debugger, please see following attached PNG image:
AsyncEventSource_multithreading_issue

For the application code triggering the issue, please excuse I did not yet prepare a minimum-code example yet, however please let me know if this would be helpful.

The relevant changes I made to ESPAsyncWebServer and AsyncTCP on the "dev" branches under:
https://github.com/ul-gh/ESPAsyncWebServer/tree/dev and:
https://github.com/ul-gh/AsyncTCP/tree/dev

The whole application using the ESPAsyncWebServer is:
https://github.com/ul-gh/esp_ajax_if

@ul-gh
Copy link
Author

ul-gh commented Nov 12, 2020

Looking further into this, I realise the dead-lock is just what happens when the _async_queue of the async_tcp event loop fills up 100%, causing xQueueGenericSend() to block indefinitely.

Anyways, the _async_queue filling up completely seems to happen easily when periodic push messages are sent via SSE / via the AsyncEventSource. I have to look more into the "why" that happens but I suspect any one of the default or user defined AsyncCallbackWebHandlers blocks for a certain amount of time while SSE push messages are acknowledged.

If this is correct, then there should be a graceful handling of the event loop filling to the top (e.g. a sensible error message plus discarding further events).

Also, what helps in my case is increasing the _async_queue length, which by default is only 32. (This could be a configure option)
I set this to 256 and indeed now I experience up to approx. 100 events queuing up from time to time, now finally without the dead-lock issue.

Please let me know if you need more input.

@ul-gh ul-gh changed the title AsyncEventSource unusable on ESP32 due to dead-locking of tcpip_api_call by async_tcp task while tcp/ip task waits for _async_queue to be processed Dead-Lock of async_tcp task and tcp/ip task when _async_queue fills up completely Nov 12, 2020
@BlueAndi
Copy link

You could measure in _handle_async_event() how long a event needs for handling, especially in case of LWIP_TCP_RECV. If there are peaks greater than 300 ms (as I see in your javascript), you could dig in there. Usually SPIFFS access needs quite long.

@ul-gh
Copy link
Author

ul-gh commented Nov 13, 2020

Thank you Andi, I did just that and the guess was correct.

I noticed blocking times of up to seven (!) seconds for some of the initial requests in the AsyncClient::_s_recv() handler.
And yes, ultimately I could trace it down to the AsyncStaticWebHandler doing some SPIFFS access. What happened was I had the wrong initialisation order for the AsyncWebHandlers where the static handler serving the root (/) URI was called first before the specialised API endpoint handlers. The massive timeouts happened downwards from the .canHandle() method of the AsyncStaticWebHandler wjich was for every API endpoint called first-in-line looking for any local SPIFFS file with the same name, which of course did not exist.

I did not go into more detail there, why the search for a nonexistent SPIFFS file could possibly cause a multi-second timeout - but at least the result is clear now. The event timer task was pushing SSE events for the whole massive timeout and the _async_queue was filling up.

So this needs to be fixed:

  • If or when the async event queue fills up for whatever reason, this should not result in an infinite dead-lock of the tasks (This is a clear Denial-of-Service if a request is made for an invalid file name)

  • There should be a warning or error in the log output when a queue overflow happens

  • For the AsyncStaticWebHandler, if the massive delays are not caused by some bug or mis-configuration, (i.e. the SPIFFS just /is/ that slow, then it would help much if this was documented.

Also nice-to-have would be a configure option for the queue length, I guess.

I realize this is more than one issue here in one thread, but if this is appreciated, I can post individual issue numbers for these and maybe even provide a fix or two with a PR next week.

  • Anyways, it would be nice to know from others if these issues can be confirmed or what else should be considered.

@ul-gh
Copy link
Author

ul-gh commented Nov 15, 2020

I need now to add, a lot of what is reported in issue #825 very much looks like the same issue.
See e.g. #825 (comment)

So fixing this will likely also fix #825.

@stale
Copy link

stale bot commented Jan 15, 2021

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 15, 2021
@ul-gh
Copy link
Author

ul-gh commented Jan 17, 2021

So although the original issue is the _async_queue filling up due to some SPIFFS or other time lag, it would be nice to know if there will by any official fix for this dead-lock issue.

@stale
Copy link

stale bot commented Jan 17, 2021

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

@stale stale bot removed the stale label Jan 17, 2021
@brandonros
Copy link

How can I tell if #921 is this?

@ul-gh
Copy link
Author

ul-gh commented Jan 17, 2021

Hi Brandon,

you could try if the version from my fork fixes your issue:
https://github.com/ul-gh/ESPAsyncWebServer/

Regards,
Ulrich

@brandonros
Copy link

@ul-gh your fork seems to be equal with master?

@ul-gh
Copy link
Author

ul-gh commented Jan 17, 2021

That was fast reply..

Pardon me, there is a "dev" branch which combines the patches for #837, #884 and #888.
https://github.com/ul-gh/ESPAsyncWebServer/tree/dev

@ul-gh
Copy link
Author

ul-gh commented Jan 17, 2021

@brandonros, I now had a look at the changes I made: I did /not/ commit a fix for the async_tcp task dead-lock-issues, because there was a different underlying issue of the SPIFFS driver looking for nonexisting files causing long delays.

However, if you have a look at file AsyncTCP.cpp in line 105...120:

static inline bool _send_async_event(lwip_event_packet_t ** e_buf){
    return _async_queue && xQueueSend(_async_queue, e_buf, portMAX_DELAY) == pdPASS;
}

You could add some debugging to this to see if this is your issue:
If you change portMAX_DELAY to zero or some small value, then the behaviour of _send_async_event() will change to not block indefinitely.

You could also check the return value of the FreeRTOS xQueueSend() call and see if the queue overflows in your application.

Regards, Ulrich

@brandonros
Copy link

brandonros commented Jan 17, 2021

It seems to happen whenever I try to transmit more than 40-50 bytes

I am not sure if the WiFi part is choking or the TCP part is fighting for contention with my “CAN/GPIO” section.

I will try these fixes. Your help is beyond appreciated. I have tried async, not async, scheduler with timeout, pinning tasks to cores. I am ears for any possible solutions. I will link the problem code for a skim if you don’t mind.

https://gist.github.com/brandonros/c4288c12beb171747258d6a1120b22bc

@justoke
Copy link

justoke commented Jan 18, 2021

If you change portMAX_DELAY to zero or some small value, then the behaviour of _send_async_event() will change to not block indefinitely.

Hi. I've been battling stability around ESPAsyncWebServer and BLE(related to what h2zero posted above) with crashes related usually to errors in AsyncTCP. I have tried your suggestion and it seems to have made an immediate difference and may even have resolved the issue. I'll leave it running overnight and report back. Thank you.

@justoke
Copy link

justoke commented Jan 19, 2021

Well, the performance is much improved. Requests and responses are near instant. However, it did not resolve the seemingly random crashes. This is a common error trace:

image

@ul-gh
Copy link
Author

ul-gh commented Jan 19, 2021

Glad to hear this is helping.

But I am afraid the call trace above does not show any unusual behavior.

It does not say the reason why the panic handler was called, this should have been in one of the adjacent lines of text output.
Also, even though this might be the task which ultimately crashed, it is not necessarily the root cause of the crash, as any other task could have caused memory corruption etc.

If you please open up a new issue thread with more information, we might be able to help.

@CelliesProjects
Copy link

My observation is that using String in ASyncTCP or AsyncWebServer will at some point result in a crash.

@stale
Copy link

stale bot commented Mar 26, 2021

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 26, 2021
@avillacis
Copy link
Contributor

Still not fixed (possibly unfixable) in main branch.

I am using the following combination for my projects involving ESP32 and web server:

Both are needed - one or the other is not enough. As I wrote AsyncTCPSock specifically to be a drop-in replacement for AsyncTCP, I am dogfooding it on my projects.

@stale
Copy link

stale bot commented Mar 26, 2021

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

@stale stale bot removed the stale label Mar 26, 2021
vortigont added a commit to vortigont/espem that referenced this issue Apr 24, 2021
Wrapper class for PZEM004T/PZEM004Tv30 libs, controlled with USE_PZEMv3 definition

Note: Async Server has lot's of issues under esp32

me-no-dev/ESPAsyncWebServer#876
me-no-dev/ESPAsyncWebServer#900
espressif/arduino-esp32#1101
me-no-dev/ESPAsyncWebServer#324
me-no-dev/ESPAsyncWebServer#932

Signed-off-by: Emil Muratov <[email protected]>
@stale
Copy link

stale bot commented Jun 2, 2021

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@avillacis
Copy link
Contributor

Still not fixed.

@stale
Copy link

stale bot commented Jun 2, 2021

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

@stale stale bot removed the stale label Jun 2, 2021
@rdnn
Copy link

rdnn commented Jun 8, 2021

Thank you @avillacis and @ul-gh for your work here. I've been chasing an issue where large events were not being sent, and your work has helped me immensely.

@stale
Copy link

stale bot commented Aug 13, 2021

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 13, 2021
@avillacis
Copy link
Contributor

Still not fixed.

@stale
Copy link

stale bot commented Aug 14, 2021

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

@stale
Copy link

stale bot commented Mar 30, 2022

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 30, 2022
@avillacis
Copy link
Contributor

Still not fixed.

@stale
Copy link

stale bot commented Mar 30, 2022

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

@stale stale bot removed the stale label Mar 30, 2022
@stale
Copy link

stale bot commented Jun 12, 2022

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 12, 2022
@CelliesProjects
Copy link

Keep open!

@stale
Copy link

stale bot commented Jun 17, 2022

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

@stale stale bot removed the stale label Jun 17, 2022
@Adesin-fr
Copy link

Also have a problem with this !
I have a simple sketch with multiple web entry points.
If I spoof one endpoint with a curl bash loop, it is OK, but when I spoof two entry points (which both only just replies "OK"), it crashes with a watchdog not reset on async_tcp...

@stale
Copy link

stale bot commented Nov 2, 2022

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 2, 2022
@avillacis
Copy link
Contributor

Still not fixed. Feeding the stalebot.

@stale
Copy link

stale bot commented Nov 8, 2022

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

@stale stale bot removed the stale label Nov 8, 2022
@stale
Copy link

stale bot commented May 22, 2023

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@Bevanj86
Copy link

Bevanj86 commented Dec 13, 2023

avillacis, thanks for the modified libs. I've been having a hard time tracking down a problem with a ESP hosted websocket glitching to high latency at random. It's meant to spit out ~500bytes @ 100mS intervals, but every now and then would glitch out for a second or two. Lots of mucking around with wireshark, and writing diagnostic code. I was convinced it was something to do with TCP.

Whatever you've changed in those modified libraries, it looks to have fixed things.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants