Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deleting BLEClient is not safe #4047

Closed
jackjansen opened this issue Jun 1, 2020 · 18 comments
Closed

deleting BLEClient is not safe #4047

jackjansen opened this issue Jun 1, 2020 · 18 comments
Labels
Status: Stale Issue is stale stage (outdated/stuck)

Comments

@jackjansen
Copy link

My BLE client code was structured more-or-less as

BLEClient *client = new BLEClient();
client->connect(address, type);
// interact with client
client->disconnect();
delete client;

but this is unsafe. Sometimes I get a corrupted heap, sometimes I get other crashes, possibly after running this code again (I am creating many connections to many BLE devices over a long time).

I think the problem may be that disconnect() is asynchronous, because often my crashes are in BLEClient::gattClientEventHandler() in the ESP_GATTC_DISCONNECT_EVT case handler, but they can also happen in the BLE thread, or in completely unrelated code.

The workaround I have is to never delete my BLEClient objects but to cache and reuse them, but from the code (and the minimal documentation) it seems like I should have been able to create and delete them on-the-fly.

@jamesi8086
Copy link
Contributor

jamesi8086 commented Jun 1, 2020

From what I tested, you have to wait until client->isConnected returns false before you can delete it. I added a spinloop which checks every 50ms before deleting.
Other methods such as keeping old instances in a list for later garbage-collection does not yeild anything stable. Seems risky to have more than one BLEClient at any time.

Caching and reusing may work, but it did not work for me when I tried to use the same instance to reconnect to different devices.

@chegewara
Copy link
Contributor

BLEClient is deleted inside library after disconnecting.

@jamesi8086
Copy link
Contributor

oh? is it a recent change?

@chegewara
Copy link
Contributor

@buxtronix
Copy link
Contributor

I have just had this issue also. It turns out that GATTC_DISCONNECT (and other events) are propagated to ALL clients, not just the one that disconnected.

Fix is in #4055 - my multi-client code has been pretty stable since i applied it.

@jackjansen
Copy link
Author

jackjansen commented Jun 5, 2020

Hmm. Looking at the BLEClient code a bit more.

I think that maybe adding a BLEDevice::removePeerDevice(m_appId, true); to the destructor is enough to fix the original issue (but unfortunately given the structure of my code I cannot test this hypothesis easily, sorry....).

But connect() looks a bit suspicious to me as well, especially in the case of a second connect. The global m_appid is incremented and the new value registered with BLEDevice::addDevice and with the esp-idf esp_ble_gattc_app_register(), but a previously registered connection id is not unregistered. And because the previous value of the local m_appId is lost that value will never be unregistered...

Together with what @yongkimleng and @buxtronix found this means that you can never delete the BLEClient, and moreover there may always be "ghost callbacks from past connections" when you reuse the BLEClient.

@buxtronix
Copy link
Contributor

The 'ghost callbacks' I've been seeing (addressed in #4064) happen even when no connection was ever closed (ie two new clients connecting after a reboot). Something to note (maybe two issues here?)

@TheNitek
Copy link
Contributor

You probably ran into this problem: #3973

@buxtronix
Copy link
Contributor

No, #3973 is about cleanup with deleted clients. The problem I have seen is with new connections after boot, before any disconnect event has occurred (but also happens on disconnect, but this is for similar reasons relating to bad callbacks ).

@stale
Copy link

stale bot commented Aug 28, 2020

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale Issue is stale stage (outdated/stuck) label Aug 28, 2020
@stale
Copy link

stale bot commented Sep 11, 2020

[STALE_DEL] This stale issue has been automatically closed. Thank you for your contributions.

@stale stale bot closed this as completed Sep 11, 2020
@llemtt
Copy link

llemtt commented Oct 21, 2020

The BLEClient multi connect implementation is still buggy.

I started with reusing and reconnecting multiple BLEClients but it requires at least the following fix in "gattClientEventHandler" or something equivalent:

	case ESP_GATTC_DISCONNECT_EVT: {
			// If we receive a disconnect event, set the class flag that indicates that we are
			// no longer connected.
			m_isConnected = false;
			if (m_pClientCallbacks != nullptr) {
				m_pClientCallbacks->onDisconnect(this);
			}
			esp_ble_gattc_app_unregister(m_gattc_if);
			m_semaphoreOpenEvt.give(ESP_GATT_IF_NONE);
			m_semaphoreRssiCmplEvt.give();
			m_semaphoreSearchCmplEvt.give(1);
			BLEDevice::removePeerDevice(m_appId, true);
		-->     m_gattc_if = ESP_GATT_IF_NONE; // let Client multi reconnection working
			break;
	} // ESP_GATTC_DISCONNECT_EVT

This because upon disconnection BLEClients aren't properly cleaned up.

I tried also to use a new BLEClient each time I need a connection but BLEClient deletion is even more buggy.

@kind3r
Copy link
Contributor

kind3r commented Jan 24, 2021

Any reason why the disconnect client callback is before all the cleanup ?
Like for example:

case ESP_GATTC_DISCONNECT_EVT: {
				if (evtParam->disconnect.conn_id != getConnId()) break;
				// If we receive a disconnect event, set the class flag that indicates that we are
				// no longer connected.
				bool m_wasConnected = m_isConnected;
				m_isConnected = false;
				esp_ble_gattc_app_unregister(m_gattc_if);
				m_semaphoreOpenEvt.give(ESP_GATT_IF_NONE);
				m_semaphoreRssiCmplEvt.give();
				m_semaphoreSearchCmplEvt.give(1);
				BLEDevice::removePeerDevice(m_appId, true);
				if (m_wasConnected && m_pClientCallbacks != nullptr) {
					m_pClientCallbacks->onDisconnect(this);
				}
				break;
		} // ESP_GATTC_DISCONNECT_EVT

kind3r added a commit to kind3r/arduino-esp32 that referenced this issue Jan 24, 2021
Cleanup before calling disconnect callback for safe delete.
Reject other events during registration.
Adresses espressif#4047, espressif#4055
@chegewara
Copy link
Contributor

Any reason why the disconnect client callback is before all the cleanup ?

Probably its my mistake.

kind3r added a commit to kind3r/arduino-esp32 that referenced this issue Jan 25, 2021
@kind3r
Copy link
Contributor

kind3r commented Jan 25, 2021

I think I've reached some degree of stability in my fork. I've been running some re-connection tests in my wannabe BLE gateway development and it seems to be fine so far with very little memory fragmentation.

My client does the following:

  1. Scan to find the device I want to connect to and stop the scan once found
  2. Connect to the device
  3. Discover services
  4. Discover characteristics on some of the services
  5. Read some of the characteristics I need and subscribe to one of them
  6. When the remote devices disconnects (after being idle for 3s) wait 1s and repeat from step 2

Sometimes the device connection fails (device is in a metal encosure and at some distance) and the automatic retry kicks in, this reuses the same BLEClient and seems to work fine.

Sometimes the device disconnects randomly, the BLE gateway deletes the BLEClient uppon disconnection so that also works fine.

I've been running this for the past 30 min, there is still some memory fragmentation from time to time that I can't quite pinpoint yet and I'm not sure if it will be possible to do anything about it. This does not happen after each re-connection cycle but once in a while the heap_caps_get_minimum_free_size(MALLOC_CAP_INTERNAL | MALLOC_CAP_8BIT) just drops. One way I reduced some of the fragmentation in my gateway code was by replacing std::map with fixed sized pre-allocated arrays as adding and deleting items from map allocates and de-allocates memory and is both time consumming and leads to fragmentation. Since the number of possible client connections is limited by the ESP BLEDevice::m_connectedClientsMap is a perfect candidate for for this at some point.

UPDATE: after about 1h of this, WiFiClient failed :D

[E][WiFiClient.cpp:395] write(): fail on fd 58, errno: 11, "No more processes"
||   Miniumum Free DRAM |   Minimum Free IRAM   || 
||      1996            |       41320           ||

@kind3r
Copy link
Contributor

kind3r commented Jan 26, 2021

The NimBLE implementation seems more stable and a whole lot more memory efficient (saves about 70K RAM and 400K flash) and is more suitable for my use case.

@llemtt
Copy link

llemtt commented Jan 26, 2021

@kind3r Your wifi/ble gateway looks interesting, I have already implemented an udp/ble gateway and I ended up using NimBLE because there was no way to let multi clients work.

@kind3r
Copy link
Contributor

kind3r commented Jan 26, 2021

I have not tested multiple clients yet but after the fixes I made to BLE lib I see no reason why it should not be working. My tests basically do the connect/reconnect/disconnect etc. on one device but with a new BLEClient each time. So applying the same techinique to multiple different devices should be fine, as long as there is a limit implemented on the simultaneous connections. Nevertheless, I think the RAM constraints would prevent more than 3 connections at a time, which is neither needed for my project nor practical as BLE sessions should be kept short to save the battery on the devices. Still, with NimBLE there is a lot of room for more connections and other stuff that I want to implement.

me-no-dev pushed a commit that referenced this issue Feb 16, 2021

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
- Remove client from the list of devices in case registration fails
- Filter other events not related to registration during registration phase
- Cleanup if connect fails
- Reset if after disconnect
- Disconnect callback *after* cleanup is done so object can be deleted

This fixes some of the issues I had like:
- `BLEClient::connect` hangs up and never recovered because registration failed
- `BLEClient` could not be deleted after disconnect or deletion creating ghost events #4047
- `BLEClient` could not be properly reused after a connection was attempted (successful or not) 

* Cleanup in case of registration and connect failure.
Cleanup before calling disconnect callback for safe delete.
Reject other events during registration.
Adresses #4047, #4055

* Clear if after unregister #4047
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale Issue is stale stage (outdated/stuck)
Projects
None yet
Development

No branches or pull requests

7 participants