-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Watchdog timer triggered when serving large files from SPIFFS #825
Comments
Hi, But if you are same person as @zekageri (same path of the build machine /home/runner/work/esp32-arduino-lib-builder on th posted logs ...) you already know all this, If not, apologize for my prejudice :) and hope that helps. |
I kinda solved that problem with the following. All my JS files are pushed to one file. With 7zip on windows, I compressed the file to its absolute minimum size. Switched to LITTLEFS from SPIFFS and serving like this. I will post some code soon. |
So with 7zip on windows ,i'am zipping my web files in such a way that i delete the .gz extension from the end of the file name. The reason is that the Iphones can't recognize or can't decode the files if there is a .gz extension. ( stupid apple... ) After that when i uploaded into the LITTLEFS i'am serving for example the main page like this:
|
Better use |
I dont see the difference between servestatic and server.on. Where is the benefit? |
@zekageri Add a file /js/test.txt with "hello" text inside. Then try to access it by the browser http:// .... /js/test.txt and let me know if there is a difference or between both. I could be wrong, too. |
Thanks so much for your thoughts.
I am not.
With any problem that is aggravated by concurrency, as this is, I feel I need to identify root cause before concluding that reducing (not eliminating) concurrency is a real solution. It could just be that by consolidating files I reduce concurrency from the chrome-defined limit of 6 to say 3, and the problem doesn't occur during my limited testing - but then does occur some time later. If though I identify root cause as say a bug in dealing with the limit of 5 concurrently open SPIFFs files, then I know that so long as I keep concurrency at or below 5 I know I have a real workaround.
Thanks. I was surprised to read online that the default is 5. That's low! I'll look into the implications of this!
I intend to do this, as I understand SPIFFS to be deprecated. My understanding though is that while work-arounds exist, a formal solution for building a LittleFS image under PlatformIO is still a work-in-progress? Maybe I misunderstood, or am out-of-date?
I did this, and have LOTS of free heap throughout. I tinkered with a few things while waiting for the JTAG hardware to arrive from Amazon, and may have stumbled upon a solution. For example I:
With the original queue size limit of 32 the app consistently hangs within 10 page loads (and generally within 3). With the larger queue size limit I have reloaded my most complex pages maybe 100 times with no issue. I'm hoping to get time later today to instrument the code around the xQueue* calls to more fully understand what is happening, and where some useful debug messages might have helped. I also have the JTAG hardware arriving, and I'm hoping that will allow me to set a breakpoint at _abort(), then backtrace the async_tcp tasks to better understand the nature of the hang. I'll keep you posted as my fumblings progress. |
Thats interesting. I think Me_no_Dev has already on that issue but found that the root cause of the issue is not the async library but the tcpip layer at idf level maybe? That queue size problem looks interesting to me tho. I wonder if Me_no_Dev has already seen this or not. |
In the following log, my QMW macro is inserted before the xQueue* calls, and displays the number of messages already in the queue (as returned by uxQueueMessagesWaiting(_async_queue)) prior to the queue operation. It was running with the original 32-entry queue:
So I'm pretty convinced the watchdog timer expiry occurs when the _async_queue fills. Why?
Clearly async_tcp told the xQueue code to sleep if the queue is full - while the watchdog timer is ticking! I wonder why we're putting things in the queue so much faster than we're taking out? Is the processing/removal of the entries in the queue stalling for some reason? That'll take more work... BTW, when running with a queue of 256, the largest number of items I see in the queue is in the mid 40s. I'd be very interesting in learning from any investigation me_no_dev has already done. Any pointers? |
That is the reason for the queue isn't it? |
@choness Good work! |
I am also affected by the watchdog trigger bug when serving files. For me, I had this problem when the browser tried to request multiple resource files concurrently, and I eventually solved it by organizing my index.htm file to contain some javascript that serialized the loading of all the other resource files. However, I don't think the filesystem limit on the SPIFFS object is the root cause of the bug. In fact, the watchdog trigger bug could even happen on libraries other than ESPAsyncWebServer, that also use the underlying AsyncTCP. This is my understanding of why: When using AsyncTCP (and any other library that requires it such as ESPAsyncWebServer), there are two relevant threads:
The async_tcp thread creates and maintains a queue of pointer values to message structs (queue type lwip_event_packet_t*) that represent lwip events that the async_thread receives from the tiT thread. The tiT thread is the one that interfaces with the network hardware and implements the TCP/IP stack. Whenever the thread determines there is something happening on the application socket, it calls a callback previously registered on the tcp_pcb representing the socket. For AsyncTCP, each callback (still running on the tiT thread) eventually posts a message into the 32-entry queue. Then, the async_tcp thread removes each message from the queue, starts the watchdog, runs the corresponding application-level action, and after returning from the application handler, stops the watchdog. If the application decides it needs to output to the socket, it runs the write() method on the AsyncClient object. This eventually executes a variant of tcpip_api_call() that also happens to invoce a inter-thread messaging into the tiT thread. This implies that the tiT thread must be (eventually) runnable in order to handle a tcpip_api_call(). From examination of the code, many of the other application-initiated operations on the socket also require one or more calls into tcpip_api_call(). The watchdog trigger scenario is then a deadlock between the tiT thread and the async_tcp thread that happens like this:
Merely increasing the queue entry limit only increases the number of messages before the code deadlocks, and can be considered a poor workaround. I am not really sure what would work to actually fix the problem. Maybe refusing to post to the queue on the socket poll callback (and possibly the recv callback) if the queue is full. If anybody wishes to make tests using the AsyncTCP/ESPAsyncWebServer combo, you probably need to check out this pull request. This pull request fixes a bug in ESPAsyncWebServer that leaves the socket open, contrary to the expectation from the Connection: close response header. This bug, in turn breaks the Apache "ab" benchmarking program. I needed to fix this first before making the experiments that allowed me to figure out the above explanation. |
My problems about the deadlocks are gone after i moved to LITTLEFS. I have no crash since then. My webpage at bootup is heavily requesting files from LITTLEFS and nothing bad happens. |
Do you have a hard number on how many simultaneous requests is "heavily requesting"? Have you tried pointing a benchmarking tool such as Apache ab to your device? At some point the device just has to spend time reading a buffer from flash, which takes time, no matter how efficient the filesystem implementation is. Using ab, I have managed to trigger the watchdog even with webserver routes that do not touch the filesystem at all. Of course, this needs a much higher concurrency level than a route that touches the filesystem, but it is possible. The issue is that the current AsyncTCP design is inherently deadlock-prone because the xQueueSend waits without a timeout while the async_tcp thread eventually hits a network operation that requires the TCP/IP thread to run. Just a random idea - make the async_tcp thread keep a private (as in, not accessed anywhere by the TCP/IP thread) linked list, then, instead of pulling ONE event from the queue, pull ALL of the events, stash them in the linked list, and process them from there instead. Possibly repeat this transfer of events just before calling into tcpip_api_call(). |
avillacis Thanks so much for your detailed description of the issue, and providing a more complete picture. While I haven't confirmed your diagnosis in detail myself, it sounds well well investigated, articulated, and quite plausible. I congratulate you. So:
Stepping back from the detail for a moment, when a queue pertaining to a single network connection becomes full, forcing a watchdog system reset seems like a dramatic over-reaction! Rather for UDP I might expect events to be discarded, and for TCP the connection to be dropped - with a helpful message generated to the log. Well-written apps should respond gracefully (in the context of asyncwebserver this might mean the user having to reload the page). The programmer should consider the queue size a tunable, and adjust it based on the workload. The devil is always in the detail, but conceptually doesn't this seem more appropriate? I wonder if what happened here (guess, no insight) is that the deadlock was encountered and time wasn't available to get to root cause, so the watchdog was introduced to bounce the system as a stopgap measure?
It's a nice idea, maybe a little better than simply increasing queue size in that it empties the queue immediately before tcpip_api_call(). But "immediately before" is only in the context of the async_tcp thread; what does it mean in the broader context? Could a flurry of network activity cause _async_queue to get filled in interrupt context before or during tiT's servicing of the request issued through tcpip_api_call()? Could there be other requests ahead of this one in the mbox feeding into tiT widening the window? Other crazy schemes might include adding a "realloc" feature to the _async_queue, which could be used to automatically expand the queue when it fills. Ah, I drank too much coffee this morning... :-) My case is actually quite simple: I expect at most one user; I might have two if that user puts down one device and picks up another, but that would be very rare, so I don't have hugely variable concurrency introducing uncertainty. I've measured the maximum queue occupancy through load testing, and have easily enough memory to configure the queue at many times that size for safety. Additionally, the pages and resources of my app never change; the dynamic component arrives through a web socket, so aggressive caching of the static components not only dramatically improves performance, but also eliminates most of the network activity. So I'm happy with: In AsyncTCP.cpp:
And in my code:
I suppose I could set the expiry of every file individually, but I haven't had quite that much coffee... |
Is this suppose to be a number? Thanks your solution is working for me! |
Sure. max-age specifies how long files are to be cached in the browser (in seconds), and so won't need to be re-loaded from the server(/esp). In the right circumstances it can dramatically reduce traffic and response times. The example from the espasyncebserver readme: // Cache responses for 10 minutes (600 seconds) I used different large numbers for various components to reduce the probability that they all needed to be reloaded at the same time. In my case caching is very effective, as my html, css, js, svg files don't change. (Note that in my source I enable caching for more than just the css and res subdirectories, but I only cut-and-pasted a couple of lines into my post to give the flavor.) The biggest problem with caching is that if you make changes on the server they may not be seen by the browser. This is particularly unhelpful during development, and also presents challenges when you "release version 2" of your site! These issues are all addressable. You might want to read up on website caching best practice before committing. |
Thanks for your detailed response. I have few easy questions! :-) Q2: What if I have sudirectories, will I rewrite server.serveStatic("/css/".. for every subdirectory, here its "css"? Q3: Whats the largest number you can set for "max-age", I ask this because I have a font file that I am serving and I want browser to indefinitely cache it. Q4: In your experience, which file system is better SPIFFS or LITTLEFS? Q5: Have you thought about HTTPS implementation of this library? Almost all browsers support geolocation and Progressive Web App installation prompt but for that HTTPS is required. What are your thoughts on this? |
@johnkelton |
Thanks lorol! |
I would be curious about the https part. Ofc it would be a big big big game changer for the esp32 web things. In the littlefs part i'am using the I would have a question myself too.I'am serving my files like this now:
My files are gzipped with windows 7zip in a way that i delete the .gz extension after the file name. It is because if i left it in there the IOS based devices can't recognize (or can't decode ) my files and the user get garbage on the browser. So the question is, can i set a header like this : |
No i have not tested with such things. When the page loads it loads like 7 js files and 3-4 css files. After these are loaded, i HTTP get from JS, things like config file, big json weather info, user prog etc etc, like 5 http request more on page load. After that i have dinamically generated buttons on the page from the user config json file. These buttons uses http request to do things on the back-end. If i tapping to these buttons like crazy nothing is crashing. The requests are safelly reach the back-end and queued up. |
I was able to fix this issue by gzipping all the large JS files, and switching to FatFS. Thank you, @lorol for the very helpful sketch data upload tool! I've been desperate for easy Arduino FatFS upload support for Windows, and you've answered my prayers! |
@rorosaurus you are welcome! |
I'm using ESP32 (I assume you are ESP8266?), and Arduino board manager only offered presets for SPIFFS or FatFS. :) I left the default max files open setting. |
I am using ESP32 lately :) See my LITTLEFS library which allows this FS for ESP32. https://github.com/lorol/LITTLEFS |
Hello guys, I have the same situation with ESP32 (AP Mode) using last libraries (ESPAsyncWebServer, AsyncTCP). With LITTLEFS I solved the timeout timer issue but I am still getting:
The behaviour is rare because when I open the browser(Android or Labtop) with the esp32 URL the html and css and js libraries are loaded fast enough and at certain point the browser keeps "loading" but I did't see that the files are loading, then after about 80 seconds(always 80 sec) the browser finish the transaction and show the page correctly. After that always I get :
Any idea about what is freezing the browser? |
You can observe the files requested by the browser if you open the developer console. On windows you press ctr+shift+i, on the console you can see all the files that the browser requesting on the "Network" tab and see wich file is timing out. It is usually the favicon or the manifest.json something like that. About the PCB is NULL and the rx timeout, they are still happening to me too. |
A week ago, I decided that the current design of AsyncTCP is not salvageable, and a complete reimplementation (with the same external API) is required. So here it is: This is a very rough draft of a reimplementation of the AsyncTCP API for ESP32, but using the high-level BSD socket API provided by LWIP, instead of the low-level packet/callback API used by AsyncTCP. The main goal of the reimplementation is to get rid of the event queue that fills up and causes the deadlocks, and instead use standard select() calls on network sockets in order to call the required callbacks. So far, I have succeeded in making the following libraries work with my reimplementation: ESPAsyncWebServer, and async-mqtt-client, which are the two main libraries I use for my work projects. I will be implementing more of the API as needed by libraries I need, but not all of the API calls are functional - the ack() call is a no-op, and I have no plans whatsoever to implement the onPacket() callback, since it requires a pointer to a struct pbuf, which I cannot in good conscience fake when the onData() callback is just better. Anyone brave enough to test this reimplementation is invited to do so. I have run torture tests with Apache ab changing nothing in the firmware but the library (AsyncTCP versus my AsyncTCPSock), and concurrency tests (50 requests at once) that cause resets with AsyncTCP are successfully completed with my library. This library is only for ESP32. I have no ESP8266 devices available to test. In order to test this library, you must move aside or delete your copy of AsyncTCP - the two libraries will collide with each other, since they declare the same header files and the same class names, for API compatibility. At some unspecified time in the future, I might add SSL/TLS support. Do not count on it (yet), though. |
Hello @avillacis , I tried today your library AsyncTCPSock with ESPAsyncWebServer sometimes still getting timeouts:
|
[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Do you have a minimum example that displays this behavior? Do you have a test that makes it more likely to trigger these messages? |
[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future. |
Hi, Same problems. I'm trying to upgrade to LittleFS. Can anyone specify how this zekageris guide should be followed: Thank You! |
[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
The partition file does not need to be edited. You must choose the partition table that suits your esp32. If you got your standard 4mb esp32 you go for the default.csv. |
[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future. |
I have an 8266 implementation of a web site that uses radio buttons and therefore jquery to update their values. The problem that I have is that this operation occasionally creates a copy of the jquery file (i have also seen .css files and a couple of the html files also duplicated, and not always perfectly). Currently it is SPIFFS based, using onServer. Based on this thread I have many questions that I hope can be answered.
Any help is appreciated, and otherwise it's my worklist. |
Going to open separate issue |
[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
[STALE_DEL] This stale issue has been automatically closed. Thank you for your contributions. |
Hello, Just to let you know that there has ben a lengthy discussion around this problem here: mathieucarbou/ESPAsyncWebServer#165 A fix was implemented: (we could call that a workaround or patch though because the real fix would be to get rid of the current queue design)
There is also some perf tests available in my fork. |
I've reduced my code to the absolute minimum:
The application is dependent upon jQuery, Bootstrap, ion.rangeSlider and popper. Initially my pages were loading online copies,
but because my device must be accessible offline I took to serving minimized local copies from the ESP, and that's when the problems started. So my pages include:
Often the first one or two page loads succeed, but then the ESP resets; although it isn't strictly reproducible.
The error message isn't particularly useful as the backtrace is of the watchdog mechanism rather than of the task that didn't reset the watchdog:
I see several similar issues logged in the past, but all slightly different and none with a resolution (excepting using the non-async web server instead, which I'd rather not do). I do notice that if I serve up one HUGE file rather than several large files the system appears stable.
Is there anything I can do before investigating JTAG debugging as a means to determine what async_tcp is doing?
The text was updated successfully, but these errors were encountered: