Skip to content
This repository was archived by the owner on Jan 20, 2025. It is now read-only.

Watchdog timer triggered when serving large files from SPIFFS #825

Closed
choness opened this issue Aug 15, 2020 · 43 comments
Closed

Watchdog timer triggered when serving large files from SPIFFS #825

choness opened this issue Aug 15, 2020 · 43 comments
Labels

Comments

@choness
Copy link

choness commented Aug 15, 2020

I've reduced my code to the absolute minimum:

void setupWebServer() {
  
  // Initialize SPIFFS
  if(!SPIFFS.begin(true)) {
    // ... logging and stuff...
    return;
  }

  // Start WiFi. Connect to, or become, an AP
  if (!setupWiFi())
    return;

  server.serveStatic("/", SPIFFS, "/").setDefaultFile("status.html"); 
  server.onNotFound([](AsyncWebServerRequest *request){request->send(404);});
  server.begin();
}

The application is dependent upon jQuery, Bootstrap, ion.rangeSlider and popper. Initially my pages were loading online copies,
but because my device must be accessible offline I took to serving minimized local copies from the ESP, and that's when the problems started. So my pages include:

<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

    <link rel="stylesheet" href="/css/bootstrap.min.css">
    <link rel="stylesheet" href="/css/ion.rangeSlider.min.css">
    <link rel="stylesheet" href="/css/styles.css">
    
    <script src="/js/jquery-3.4.1.slim.min.js"></script>
    <script src="/js/popper.min.js"></script>
    <script src="/js/bootstrap.min.js"></script>
    <script src="/js/ion.rangeSlider.min.js"></script>
    <script src="/js/shared.js"></script>
    <script src="/js/configure.js"></script>

Often the first one or two page loads succeed, but then the ESP resets; although it isn't strictly reproducible.

The error message isn't particularly useful as the backtrace is of the watchdog mechanism rather than of the task that didn't reset the watchdog:

E (107592) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (107592) task_wdt:  - async_tcp (CPU 1)
E (107592) task_wdt: Tasks currently running:
E (107592) task_wdt: CPU 0: IDLE0
E (107592) task_wdt: CPU 1: IDLE1
E (107592) task_wdt: Aborting.
abort() was called at PC 0x400edcef on core 0

Backtrace: 0x4008c434:0x3ffbe170 0x4008c665:0x3ffbe190 0x400edcef:0x3ffbe1b0 0x40084771:0x3ffbe1d0 0x40186a5f:0x3ffbc280 0x400ef0da:0x3ffbc2a0 0x4008a361:0x3ffbc2c0 0x40088b7d:0x3ffbc2e0
  #0  0x4008c434:0x3ffbe170 in invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:707
  #1  0x4008c665:0x3ffbe190 in abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:707
  #2  0x400edcef:0x3ffbe1b0 in task_wdt_isr at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/task_wdt.c:252
  #3  0x40084771:0x3ffbe1d0 in _xt_lowint1 at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/xtensa_vectors.S:1154
  #4  0x40186a5f:0x3ffbc280 in esp_pm_impl_waiti at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/pm_esp32.c:492
  #5  0x400ef0da:0x3ffbc2a0 in esp_vApplicationIdleHook at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/freertos_hooks.c:108
  #6  0x4008a361:0x3ffbc2c0 in prvIdleTask at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/tasks.c:3507
  #7  0x40088b7d:0x3ffbc2e0 in vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c:355 (discriminator 1)

Rebooting...

I see several similar issues logged in the past, but all slightly different and none with a resolution (excepting using the non-async web server instead, which I'd rather not do). I do notice that if I serve up one HUGE file rather than several large files the system appears stable.

Is there anything I can do before investigating JTAG debugging as a means to determine what async_tcp is doing?

@lorol
Copy link
Contributor

lorol commented Aug 16, 2020

Hi,
Yes, you can consolidate several js files to one and also gz compress.
There is a SPIFFS limitation of max files opened at a time.
You can use LitleFS instead of SPIFFS.
You can check the heap while running.

But if you are same person as @zekageri (same path of the build machine /home/runner/work/esp32-arduino-lib-builder on th posted logs ...) you already know all this, If not, apologize for my prejudice :) and hope that helps.

@zekageri
Copy link

I kinda solved that problem with the following. All my JS files are pushed to one file. With 7zip on windows, I compressed the file to its absolute minimum size. Switched to LITTLEFS from SPIFFS and serving like this. I will post some code soon.

@zekageri
Copy link

So with 7zip on windows ,i'am zipping my web files in such a way that i delete the .gz extension from the end of the file name. The reason is that the Iphones can't recognize or can't decode the files if there is a .gz extension. ( stupid apple... ) After that when i uploaded into the LITTLEFS i'am serving for example the main page like this:

server.on("/", HTTP_GET, [](AsyncWebServerRequest *request) {
    AsyncWebServerResponse* response = request->beginResponse(LITTLEFS, "/Home_index.html", "text/html");
    response->addHeader("Content-Encoding", "gzip");
    request->send(response);
});

@lorol
Copy link
Contributor

lorol commented Aug 17, 2020

Better use server.serveStatic("/", LITTLEFS, "/").setDefaultFile("Home_index.html"); so everything under / will be served automatically when browser asks (you can change to sub-folder). see here https://github.com/lorol/ESPAsyncWebServer/blob/master/examples/ESP_AsyncFSBrowser/ESP_AsyncFSBrowser.ino#L189
Hm, I also have an old IPad and not extremely old IPhone ... the things work well with .gz. In your html, you call the js css filenames w/o. adding .gz example <script src="/js/popper.min.js"></script> but at esp (data folder), the real compressed file should be /js/popper.min.js.gz and leave the browser to get what is necessary. By default SPIFFS can have only 5 open files at time - can be changed.

@zekageri
Copy link

Better use server.serveStatic("/", LITTLEFS, "/").setDefaultFile("Home_index.html"); so everything under / will be served automatically when browser asks (you can change to sub-folder). see here https://github.com/lorol/ESPAsyncWebServer/blob/master/examples/ESP_AsyncFSBrowser/ESP_AsyncFSBrowser.ino#L189
Hm, I also have an old IPad and not extremely old IPhone ... the things work well with .gz. In your html, you call the js css filenames w/o. adding .gz example <script src="/js/popper.min.js"></script> but at esp (data folder), the real compressed file should be /js/popper.min.js.gz and leave the browser to get what is necessary. By default SPIFFS can have only 5 open files at time - can be changed.

I dont see the difference between servestatic and server.on.

Where is the benefit?

@lorol
Copy link
Contributor

lorol commented Aug 17, 2020

@zekageri Add a file /js/test.txt with "hello" text inside. Then try to access it by the browser http:// .... /js/test.txt and let me know if there is a difference or between both. I could be wrong, too.

@choness
Copy link
Author

choness commented Aug 17, 2020

Thanks so much for your thoughts.

But if you are same person as @zekageri

I am not.

Yes, you can consolidate several js files to one and also gz compress.

With any problem that is aggravated by concurrency, as this is, I feel I need to identify root cause before concluding that reducing (not eliminating) concurrency is a real solution. It could just be that by consolidating files I reduce concurrency from the chrome-defined limit of 6 to say 3, and the problem doesn't occur during my limited testing - but then does occur some time later. If though I identify root cause as say a bug in dealing with the limit of 5 concurrently open SPIFFs files, then I know that so long as I keep concurrency at or below 5 I know I have a real workaround.

There is a SPIFFS limitation of max files opened at a time.

Thanks. I was surprised to read online that the default is 5. That's low! I'll look into the implications of this!

You can use LitleFS instead of SPIFFS.

I intend to do this, as I understand SPIFFS to be deprecated. My understanding though is that while work-arounds exist, a formal solution for building a LittleFS image under PlatformIO is still a work-in-progress? Maybe I misunderstood, or am out-of-date?

You can check the heap while running.

I did this, and have LOTS of free heap throughout.

I tinkered with a few things while waiting for the JTAG hardware to arrive from Amazon, and may have stumbled upon a solution. For example I:

  1. Increased the debugging to ..._VERBOSE. No help.

  2. Disabled the watchdog (in _async_service_task()), to see if the pages would load eventually - i.e. is the watchdog interval too short? In this case the app would hang indefinitely; so it's definitely a hang and not just long-running code that doesn't tickle the watchdog

  3. Pushed the watchdog enablement down into _handle_async_event(), enabling it for some event types and not for others. My thinking was to try to narrow it down to maybe a single event type that exhibited the problem (using the fact that the failure mode was different depending on whether the watchdog was enabled for the code in which the hang occurred). I concluded that multiple event types exhibit the problem.

  4. I instrumented the semaphore code, because it smells of a locking problem. No fault found.

  5. I started looking for places where queues might overflow. Here I had some success. Specifically if I increase the size of the async event queue from its default 32 to 256 the problem seems to go away completely, with no other changes:

    if(!_async_queue){
       // ORIGINAL: _async_queue = xQueueCreate(32, sizeof(lwip_event_packet_t *));
       _async_queue = xQueueCreate(256, sizeof(lwip_event_packet_t *));

With the original queue size limit of 32 the app consistently hangs within 10 page loads (and generally within 3). With the larger queue size limit I have reloaded my most complex pages maybe 100 times with no issue.

I'm hoping to get time later today to instrument the code around the xQueue* calls to more fully understand what is happening, and where some useful debug messages might have helped. I also have the JTAG hardware arriving, and I'm hoping that will allow me to set a breakpoint at _abort(), then backtrace the async_tcp tasks to better understand the nature of the hang.

I'll keep you posted as my fumblings progress.

@lorol
Copy link
Contributor

lorol commented Aug 17, 2020

@zekageri
Copy link

Thanks so much for your thoughts.

But if you are same person as @zekageri

I am not.

Yes, you can consolidate several js files to one and also gz compress.

With any problem that is aggravated by concurrency, as this is, I feel I need to identify root cause before concluding that reducing (not eliminating) concurrency is a real solution. It could just be that by consolidating files I reduce concurrency from the chrome-defined limit of 6 to say 3, and the problem doesn't occur during my limited testing - but then does occur some time later. If though I identify root cause as say a bug in dealing with the limit of 5 concurrently open SPIFFs files, then I know that so long as I keep concurrency at or below 5 I know I have a real workaround.

There is a SPIFFS limitation of max files opened at a time.

Thanks. I was surprised to read online that the default is 5. That's low! I'll look into the implications of this!

You can use LitleFS instead of SPIFFS.

I intend to do this, as I understand SPIFFS to be deprecated. My understanding though is that while work-arounds exist, a formal solution for building a LittleFS image under PlatformIO is still a work-in-progress? Maybe I misunderstood, or am out-of-date?

You can check the heap while running.

I did this, and have LOTS of free heap throughout.

I tinkered with a few things while waiting for the JTAG hardware to arrive from Amazon, and may have stumbled upon a solution. For example I:

  1. Increased the debugging to ..._VERBOSE. No help.
  2. Disabled the watchdog (in _async_service_task()), to see if the pages would load eventually - i.e. is the watchdog interval too short? In this case the app would hang indefinitely; so it's definitely a hang and not just long-running code that doesn't tickle the watchdog
  3. Pushed the watchdog enablement down into _handle_async_event(), enabling it for some event types and not for others. My thinking was to try to narrow it down to maybe a single event type that exhibited the problem (using the fact that the failure mode was different depending on whether the watchdog was enabled for the code in which the hang occurred). I concluded that multiple event types exhibit the problem.
  4. I instrumented the semaphore code, because it smells of a locking problem. No fault found.
  5. I started looking for places where queues might overflow. Here I had some success. Specifically if I increase the size of the async event queue from its default 32 to 256 the problem seems to go away completely, with no other changes:
    if(!_async_queue){
       // ORIGINAL: _async_queue = xQueueCreate(32, sizeof(lwip_event_packet_t *));
       _async_queue = xQueueCreate(256, sizeof(lwip_event_packet_t *));

With the original queue size limit of 32 the app consistently hangs within 10 page loads (and generally within 3). With the larger queue size limit I have reloaded my most complex pages maybe 100 times with no issue.

I'm hoping to get time later today to instrument the code around the xQueue* calls to more fully understand what is happening, and where some useful debug messages might have helped. I also have the JTAG hardware arriving, and I'm hoping that will allow me to set a breakpoint at _abort(), then backtrace the async_tcp tasks to better understand the nature of the hang.

I'll keep you posted as my fumblings progress.

Thats interesting. I think Me_no_Dev has already on that issue but found that the root cause of the issue is not the async library but the tcpip layer at idf level maybe? That queue size problem looks interesting to me tho. I wonder if Me_no_Dev has already seen this or not.

@choness
Copy link
Author

choness commented Aug 18, 2020

In the following log, my QMW macro is inserted before the xQueue* calls, and displays the number of messages already in the queue (as returned by uxQueueMessagesWaiting(_async_queue)) prior to the queue operation. It was running with the original 32-entry queue:

QMW: send 29
QMW: get 30
QMW: send 29
QMW: send 30
QMW: send 31
QMW: send 32
E (17981) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (17981) task_wdt:  - async_tcp (CPU 1)
E (17981) task_wdt: Tasks currently running:
E (17981) task_wdt: CPU 0: IDLE0
E (17981) task_wdt: CPU 1: IDLE1
E (17981) task_wdt: Aborting.
abort() was called at PC 0x400edf2f on core 0

So I'm pretty convinced the watchdog timer expiry occurs when the _async_queue fills. Why?

static inline bool _send_async_event(lwip_event_packet_t ** e){
    return _async_queue && xQueueSend(_async_queue, e, portMAX_DELAY) == pdPASS;
}
#define portMAX_DELAY ( TickType_t ) 0xffffffffUL

Clearly async_tcp told the xQueue code to sleep if the queue is full - while the watchdog timer is ticking!

I wonder why we're putting things in the queue so much faster than we're taking out? Is the processing/removal of the entries in the queue stalling for some reason? That'll take more work...

BTW, when running with a queue of 256, the largest number of items I see in the queue is in the mid 40s.

I'd be very interesting in learning from any investigation me_no_dev has already done. Any pointers?

@zekageri
Copy link

That is the reason for the queue isn't it?
If I put the items in the queue with the same rate as I taking out why would I need a queue? ;D but interesting that I never heard about that problem before

@lorol
Copy link
Contributor

lorol commented Aug 18, 2020

@choness Good work!
You have the cause and fix shared. For the rest I hope the author @me-no-dev will have one day some time to look in details. Thank you.

@avillacis
Copy link
Contributor

I am also affected by the watchdog trigger bug when serving files. For me, I had this problem when the browser tried to request multiple resource files concurrently, and I eventually solved it by organizing my index.htm file to contain some javascript that serialized the loading of all the other resource files. However, I don't think the filesystem limit on the SPIFFS object is the root cause of the bug. In fact, the watchdog trigger bug could even happen on libraries other than ESPAsyncWebServer, that also use the underlying AsyncTCP. This is my understanding of why:

When using AsyncTCP (and any other library that requires it such as ESPAsyncWebServer), there are two relevant threads:

  1. The TCP/IP thread, created by the lwip component of the ESP32 SDK. From my experiments, this thread appears with the name of "tiT".
  2. The AsyncTCP thread, created by the library. This is the "async_tcp" thread on which the watchdog triggers.

The async_tcp thread creates and maintains a queue of pointer values to message structs (queue type lwip_event_packet_t*) that represent lwip events that the async_thread receives from the tiT thread. The tiT thread is the one that interfaces with the network hardware and implements the TCP/IP stack. Whenever the thread determines there is something happening on the application socket, it calls a callback previously registered on the tcp_pcb representing the socket. For AsyncTCP, each callback (still running on the tiT thread) eventually posts a message into the 32-entry queue. Then, the async_tcp thread removes each message from the queue, starts the watchdog, runs the corresponding application-level action, and after returning from the application handler, stops the watchdog.

If the application decides it needs to output to the socket, it runs the write() method on the AsyncClient object. This eventually executes a variant of tcpip_api_call() that also happens to invoce a inter-thread messaging into the tiT thread. This implies that the tiT thread must be (eventually) runnable in order to handle a tcpip_api_call(). From examination of the code, many of the other application-initiated operations on the socket also require one or more calls into tcpip_api_call().

The watchdog trigger scenario is then a deadlock between the tiT thread and the async_tcp thread that happens like this:

  1. There is heavy socket activity going on. The most probable cause is a bunch of open connections simultaneously transmitting or receiving data.
  2. One or more of the application-level socket handlers (running in the async_tcp thread) require time before returning. It does not matter if the application yields in any way or not. The most common way to require time (but not the only way) is filesystem I/O.
  3. Since the async_tcp thread is busy running the application code, the event queue (filled by the tiT thread) eventually fills up. At this point, since the xQueueSend() call has an infinite timeout, the tiT thread is not runnable.
  4. At some point, the application code running in the async_tcp thread requires some socket operation (possibly a write(), but even a close() will require a tcpip_api_call()).
  5. When invoking tcpip_api_call(), the async_tcp thread waits for the response, in the expectation that the tiT thread will run and yield a result.
  6. However, the tiT thread is frozen waiting for the async_tcp event queue. Therefore the application deadlocks and the watchdog eventually triggers.

Merely increasing the queue entry limit only increases the number of messages before the code deadlocks, and can be considered a poor workaround. I am not really sure what would work to actually fix the problem. Maybe refusing to post to the queue on the socket poll callback (and possibly the recv callback) if the queue is full.

If anybody wishes to make tests using the AsyncTCP/ESPAsyncWebServer combo, you probably need to check out this pull request. This pull request fixes a bug in ESPAsyncWebServer that leaves the socket open, contrary to the expectation from the Connection: close response header. This bug, in turn breaks the Apache "ab" benchmarking program. I needed to fix this first before making the experiments that allowed me to figure out the above explanation.

@zekageri
Copy link

My problems about the deadlocks are gone after i moved to LITTLEFS. I have no crash since then. My webpage at bootup is heavily requesting files from LITTLEFS and nothing bad happens.
I have nearly 15 HTTP get at onload. Some requesting small sized files and some are heavy big JS files.
So yeah, thats interesting.

@avillacis
Copy link
Contributor

My problems about the deadlocks are gone after i moved to LITTLEFS. I have no crash since then. My webpage at bootup is heavily requesting files from LITTLEFS and nothing bad happens.
I have nearly 15 HTTP get at onload. Some requesting small sized files and some are heavy big JS files.
So yeah, thats interesting.

Do you have a hard number on how many simultaneous requests is "heavily requesting"? Have you tried pointing a benchmarking tool such as Apache ab to your device? At some point the device just has to spend time reading a buffer from flash, which takes time, no matter how efficient the filesystem implementation is.

Using ab, I have managed to trigger the watchdog even with webserver routes that do not touch the filesystem at all. Of course, this needs a much higher concurrency level than a route that touches the filesystem, but it is possible. The issue is that the current AsyncTCP design is inherently deadlock-prone because the xQueueSend waits without a timeout while the async_tcp thread eventually hits a network operation that requires the TCP/IP thread to run.

Just a random idea - make the async_tcp thread keep a private (as in, not accessed anywhere by the TCP/IP thread) linked list, then, instead of pulling ONE event from the queue, pull ALL of the events, stash them in the linked list, and process them from there instead. Possibly repeat this transfer of events just before calling into tcpip_api_call().

@choness
Copy link
Author

choness commented Aug 27, 2020

avillacis Thanks so much for your detailed description of the issue, and providing a more complete picture. While I haven't confirmed your diagnosis in detail myself, it sounds well well investigated, articulated, and quite plausible. I congratulate you.

So:

  1. the async_tcp thread fails to remove items from _async_queue for a while for any of a variety of reasons;
  2. _async_queue eventually fills up and tiT thread blocks, effectively waiting for async_tcp thread;
  3. the async_tcp thread makes a synchronous request back to the tiT thread and so blocks waiting for the thread that in turn is waiting for it.

Stepping back from the detail for a moment, when a queue pertaining to a single network connection becomes full, forcing a watchdog system reset seems like a dramatic over-reaction! Rather for UDP I might expect events to be discarded, and for TCP the connection to be dropped - with a helpful message generated to the log. Well-written apps should respond gracefully (in the context of asyncwebserver this might mean the user having to reload the page). The programmer should consider the queue size a tunable, and adjust it based on the workload. The devil is always in the detail, but conceptually doesn't this seem more appropriate?

I wonder if what happened here (guess, no insight) is that the deadlock was encountered and time wasn't available to get to root cause, so the watchdog was introduced to bounce the system as a stopgap measure?

Just a random idea - make the async_tcp thread keep a private (as in, not accessed anywhere by the TCP/IP thread) linked list, then, instead of pulling ONE event from the queue, pull ALL of the events, stash them in the linked list, and process them from there instead. Possibly repeat this transfer of events just before calling into tcpip_api_call().

It's a nice idea, maybe a little better than simply increasing queue size in that it empties the queue immediately before tcpip_api_call(). But "immediately before" is only in the context of the async_tcp thread; what does it mean in the broader context? Could a flurry of network activity cause _async_queue to get filled in interrupt context before or during tiT's servicing of the request issued through tcpip_api_call()? Could there be other requests ahead of this one in the mbox feeding into tiT widening the window?

Other crazy schemes might include adding a "realloc" feature to the _async_queue, which could be used to automatically expand the queue when it fills. Ah, I drank too much coffee this morning... :-)

My case is actually quite simple: I expect at most one user; I might have two if that user puts down one device and picks up another, but that would be very rare, so I don't have hugely variable concurrency introducing uncertainty. I've measured the maximum queue occupancy through load testing, and have easily enough memory to configure the queue at many times that size for safety. Additionally, the pages and resources of my app never change; the dynamic component arrives through a web socket, so aggressive caching of the static components not only dramatically improves performance, but also eliminates most of the network activity.

So I'm happy with:

In AsyncTCP.cpp:

static inline bool _init_async_event_queue(){
    if(!_async_queue){
        // COL:
        _async_queue = xQueueCreate(256, sizeof(lwip_event_packet_t *));
        if(!_async_queue){
            return false;
        }
    }
    return true;
}

And in my code:

  server.serveStatic("/css/", SPIFFS, "/css/").setCacheControl("max-age=<large prime number>"); 
  server.serveStatic("/res/", SPIFFS, "/res/").setCacheControl("max-age=<different large prime number>"); 

I suppose I could set the expiry of every file individually, but I haven't had quite that much coffee...

@johnkelton
Copy link

johnkelton commented Aug 29, 2020

Is this suppose to be a number? <large prime number> and <different large prime number>

Thanks your solution is working for me!

@choness
Copy link
Author

choness commented Aug 29, 2020

Sure. max-age specifies how long files are to be cached in the browser (in seconds), and so won't need to be re-loaded from the server(/esp). In the right circumstances it can dramatically reduce traffic and response times. The example from the espasyncebserver readme:

// Cache responses for 10 minutes (600 seconds)
server.serveStatic("/", SPIFFS, "/www/").setCacheControl("max-age=600");

I used different large numbers for various components to reduce the probability that they all needed to be reloaded at the same time. In my case caching is very effective, as my html, css, js, svg files don't change. (Note that in my source I enable caching for more than just the css and res subdirectories, but I only cut-and-pasted a couple of lines into my post to give the flavor.)

The biggest problem with caching is that if you make changes on the server they may not be seen by the browser. This is particularly unhelpful during development, and also presents challenges when you "release version 2" of your site! These issues are all addressable. You might want to read up on website caching best practice before committing.

@johnkelton
Copy link

Thanks for your detailed response. I have few easy questions! :-)
Q1: I dont have any sub-directories so I will just change this server.serveStatic("/css/".. to "server.serveStatic("/"... I believe!?

Q2: What if I have sudirectories, will I rewrite server.serveStatic("/css/".. for every subdirectory, here its "css"?

Q3: Whats the largest number you can set for "max-age", I ask this because I have a font file that I am serving and I want browser to indefinitely cache it.

Q4: In your experience, which file system is better SPIFFS or LITTLEFS?

Q5: Have you thought about HTTPS implementation of this library? Almost all browsers support geolocation and Progressive Web App installation prompt but for that HTTPS is required. What are your thoughts on this?

@lorol
Copy link
Contributor

lorol commented Aug 30, 2020

@johnkelton
A4: Most examples you copy from Internet, use SPIFFS because it came earlier to the game. They will work out-of-box with SPIFFS
LittleFS is not even official yet for esp32 because Espressif team works on more important things.
Some comparison: https://arduino-esp8266.readthedocs.io/en/latest/filesystem.html#spiffs-and-littlefs
Everything else is advantage for LittleFS.

@johnkelton
Copy link

Thanks lorol!

@zekageri
Copy link

I would be curious about the https part. Ofc it would be a big big big game changer for the esp32 web things.
Little fs is far better than SPIFFS. I have used SPIFFS before with a lot of trouble. I wasn't able to set a bigger partition for my flash other than 1 or max 1.5 mb. When i did change it to more than 2mb the file system was so slow that the esp crashed when served the files.

In the littlefs part i'am using the board_build.partitions = large_spiffs_16MB.csv in platformIO.
Even with this large partition the writes and reads are visually faster than spiffs custom 1mb partition.

I would have a question myself too.

I'am serving my files like this now:

server.on("/", HTTP_GET, [](AsyncWebServerRequest *request) {
    AsyncWebServerResponse* response = request->beginResponse(LITTLEFS, "/Home_index.html", "text/html");
    response->addHeader("Content-Encoding", "gzip");
    request->send(response);
});

My files are gzipped with windows 7zip in a way that i delete the .gz extension after the file name. It is because if i left it in there the IOS based devices can't recognize (or can't decode ) my files and the user get garbage on the browser.

So the question is, can i set a header like this : response->addHeader("Content-Encoding", "gzip");
with this file serving approach: server.serveStatic("/css/", SPIFFS, "/css/").setCacheControl("max-age=<large prime number>"); ?

@zekageri
Copy link

My problems about the deadlocks are gone after i moved to LITTLEFS. I have no crash since then. My webpage at bootup is heavily requesting files from LITTLEFS and nothing bad happens.
I have nearly 15 HTTP get at onload. Some requesting small sized files and some are heavy big JS files.
So yeah, thats interesting.

Do you have a hard number on how many simultaneous requests is "heavily requesting"? Have you tried pointing a benchmarking tool such as Apache ab to your device? At some point the device just has to spend time reading a buffer from flash, which takes time, no matter how efficient the filesystem implementation is.

Using ab, I have managed to trigger the watchdog even with webserver routes that do not touch the filesystem at all. Of course, this needs a much higher concurrency level than a route that touches the filesystem, but it is possible. The issue is that the current AsyncTCP design is inherently deadlock-prone because the xQueueSend waits without a timeout while the async_tcp thread eventually hits a network operation that requires the TCP/IP thread to run.

Just a random idea - make the async_tcp thread keep a private (as in, not accessed anywhere by the TCP/IP thread) linked list, then, instead of pulling ONE event from the queue, pull ALL of the events, stash them in the linked list, and process them from there instead. Possibly repeat this transfer of events just before calling into tcpip_api_call().

No i have not tested with such things.
Currently the front-end part is based on http requests. I have many many buttons and sliders and everything and a lot of css and js files. Some of them are big ones.

When the page loads it loads like 7 js files and 3-4 css files. After these are loaded, i HTTP get from JS, things like config file, big json weather info, user prog etc etc, like 5 http request more on page load. After that i have dinamically generated buttons on the page from the user config json file. These buttons uses http request to do things on the back-end. If i tapping to these buttons like crazy nothing is crashing. The requests are safelly reach the back-end and queued up.

@rorosaurus
Copy link

I was able to fix this issue by gzipping all the large JS files, and switching to FatFS.

Thank you, @lorol for the very helpful sketch data upload tool! I've been desperate for easy Arduino FatFS upload support for Windows, and you've answered my prayers!

@lorol
Copy link
Contributor

lorol commented Sep 9, 2020

@rorosaurus you are welcome!
What is the reason you preferred FatFS, not LittleFS? Do you have enough RAM (free heap), especially if you leave max10 files open at a time by default?

@rorosaurus
Copy link

I'm using ESP32 (I assume you are ESP8266?), and Arduino board manager only offered presets for SPIFFS or FatFS. :) I left the default max files open setting.

@lorol
Copy link
Contributor

lorol commented Sep 19, 2020

I am using ESP32 lately :) See my LITTLEFS library which allows this FS for ESP32. https://github.com/lorol/LITTLEFS
Now it’s available on Arduino Library Manager.
On the README.md there you will see info about the tool to upload fs from data. Enjoy.

@johnnytolengo
Copy link

Hello guys, I have the same situation with ESP32 (AP Mode) using last libraries (ESPAsyncWebServer, AsyncTCP). With LITTLEFS I solved the timeout timer issue but I am still getting:

[W][AsyncTCP.cpp:949] _poll(): pcb is NULL
[AsyncTCP.cpp:969] _poll(): rx timeout 4
[AsyncTCP.cpp:969] _poll(): rx timeout 4

The behaviour is rare because when I open the browser(Android or Labtop) with the esp32 URL the html and css and js libraries are loaded fast enough and at certain point the browser keeps "loading" but I did't see that the files are loading, then after about 80 seconds(always 80 sec) the browser finish the transaction and show the page correctly.

After that always I get :

[W][AsyncTCP.cpp:949] _poll(): pcb is NULL
[W][AsyncTCP.cpp:969] _poll(): rx timeout 4
[W][AsyncTCP.cpp:969] _poll(): rx timeout 4

Any idea about what is freezing the browser?

@zekageri
Copy link

zekageri commented Dec 4, 2020

Hello guys, I have the same situation with ESP32 (AP Mode) using last libraries (ESPAsyncWebServer, AsyncTCP). With LITTLEFS I solved the timeout timer issue but I am still getting:

[W][AsyncTCP.cpp:949] _poll(): pcb is NULL
[AsyncTCP.cpp:969] _poll(): rx timeout 4
[AsyncTCP.cpp:969] _poll(): rx timeout 4

The behaviour is rare because when I open the browser(Android or Labtop) with the esp32 URL the html and css and js libraries are loaded fast enough and at certain point the browser keeps "loading" but I did't see that the files are loading, then after about 80 seconds(always 80 sec) the browser finish the transaction and show the page correctly.

After that always I get :

[W][AsyncTCP.cpp:949] _poll(): pcb is NULL
[W][AsyncTCP.cpp:969] _poll(): rx timeout 4
[W][AsyncTCP.cpp:969] _poll(): rx timeout 4

Any idea about what is freezing the browser?

You can observe the files requested by the browser if you open the developer console. On windows you press ctr+shift+i, on the console you can see all the files that the browser requesting on the "Network" tab and see wich file is timing out. It is usually the favicon or the manifest.json something like that.

About the PCB is NULL and the rx timeout, they are still happening to me too.

@avillacis
Copy link
Contributor

Using ab, I have managed to trigger the watchdog even with webserver routes that do not touch the filesystem at all. Of course, this needs a much higher concurrency level than a route that touches the filesystem, but it is possible. The issue is that the current AsyncTCP design is inherently deadlock-prone because the xQueueSend waits without a timeout while the async_tcp thread eventually hits a network operation that requires the TCP/IP thread to run.

A week ago, I decided that the current design of AsyncTCP is not salvageable, and a complete reimplementation (with the same external API) is required. So here it is:

AsyncTCPSock

This is a very rough draft of a reimplementation of the AsyncTCP API for ESP32, but using the high-level BSD socket API provided by LWIP, instead of the low-level packet/callback API used by AsyncTCP. The main goal of the reimplementation is to get rid of the event queue that fills up and causes the deadlocks, and instead use standard select() calls on network sockets in order to call the required callbacks.

So far, I have succeeded in making the following libraries work with my reimplementation: ESPAsyncWebServer, and async-mqtt-client, which are the two main libraries I use for my work projects. I will be implementing more of the API as needed by libraries I need, but not all of the API calls are functional - the ack() call is a no-op, and I have no plans whatsoever to implement the onPacket() callback, since it requires a pointer to a struct pbuf, which I cannot in good conscience fake when the onData() callback is just better.

Anyone brave enough to test this reimplementation is invited to do so. I have run torture tests with Apache ab changing nothing in the firmware but the library (AsyncTCP versus my AsyncTCPSock), and concurrency tests (50 requests at once) that cause resets with AsyncTCP are successfully completed with my library.

This library is only for ESP32. I have no ESP8266 devices available to test.

In order to test this library, you must move aside or delete your copy of AsyncTCP - the two libraries will collide with each other, since they declare the same header files and the same class names, for API compatibility.

At some unspecified time in the future, I might add SSL/TLS support. Do not count on it (yet), though.

@johnnytolengo
Copy link

Hello @avillacis , I tried today your library AsyncTCPSock with ESPAsyncWebServer sometimes still getting timeouts:
Servering files of 360kB

[W][AsyncTCP.cpp:963] _poll(): ack timeout 4

[W][AsyncTCP.cpp:970] _poll(): rx timeout 4

@stale
Copy link

stale bot commented Mar 19, 2021

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 19, 2021
@avillacis
Copy link
Contributor

Hello @avillacis , I tried today your library AsyncTCPSock with ESPAsyncWebServer sometimes still getting timeouts:
Servering files of 360kB

[W][AsyncTCP.cpp:963] _poll(): ack timeout 4

[W][AsyncTCP.cpp:970] _poll(): rx timeout 4

Do you have a minimum example that displays this behavior? Do you have a test that makes it more likely to trigger these messages?

@stale
Copy link

stale bot commented Mar 19, 2021

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

@stale stale bot removed the stale label Mar 19, 2021
@AllanTee
Copy link

AllanTee commented Mar 21, 2021

Hi,

Same problems. I'm trying to upgrade to LittleFS. Can anyone specify how this zekageris guide should be followed:
<<In the littlefs part i'am using the board_build.partitions = large_spiffs_16MB.csv in platformIO.>>
Obviously the "partitions_custom.csv" file needs editing, how exactly?

Thank You!
[az-delivery-devkit-v4, PlatformIO, Visual Studio Code]

@stale
Copy link

stale bot commented Jun 2, 2021

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 2, 2021
@zekageri
Copy link

zekageri commented Jun 9, 2021

Hi,

Same problems. I'm trying to upgrade to LittleFS. Can anyone specify how this zekageris guide should be followed:
<<In the littlefs part i'am using the board_build.partitions = large_spiffs_16MB.csv in platformIO.>>
Obviously the "partitions_custom.csv" file needs editing, how exactly?

Thank You!
[az-delivery-devkit-v4, PlatformIO, Visual Studio Code]

The partition file does not need to be edited. You must choose the partition table that suits your esp32. If you got your standard 4mb esp32 you go for the default.csv.
About porting to LITTLEFS, here is a link There are plenty of explanation here but if you stuck you can ask questions.

@stale
Copy link

stale bot commented Jun 9, 2021

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

@stale stale bot removed the stale label Jun 9, 2021
@dduehren
Copy link

I have an 8266 implementation of a web site that uses radio buttons and therefore jquery to update their values. The problem that I have is that this operation occasionally creates a copy of the jquery file (i have also seen .css files and a couple of the html files also duplicated, and not always perfectly). Currently it is SPIFFS based, using onServer. Based on this thread I have many questions that I hope can be answered.

  1. I am naive about debugging, What are the tools you are using for the traces, etc?
  2. You mention that SPIFFS max open file limit can be changed. I have not been able to figure out how to do this?
  3. The local jquery file is large ~ 98K. Can I use a compressed version. and/or should I load the file in chuncks?
  4. How would moving to LittleFS help?
  5. Is it recommended to use the AddHeader approach when using jquery and css rather than having it load from the html page?
  6. I need to learn about server.serveStatic

Any help is appreciated, and otherwise it's my worklist.
Thanks,

@dduehren
Copy link

Going to open separate issue

@stale
Copy link

stale bot commented Apr 16, 2022

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 16, 2022
@stale
Copy link

stale bot commented Apr 30, 2022

[STALE_DEL] This stale issue has been automatically closed. Thank you for your contributions.

@mathieucarbou
Copy link
Contributor

Hello,

Just to let you know that there has ben a lengthy discussion around this problem here: mathieucarbou/ESPAsyncWebServer#165

A fix was implemented: (we could call that a workaround or patch though because the real fix would be to get rid of the current queue design)

There is also some perf tests available in my fork.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

10 participants