-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: filter delay errors #4459
Comments
This can be ignored as it is very much possible that unsubscribe is sent for each pubsubTopic separately which must include all contentTopics that may have been subscribed to independently. But, i did notice that when i bring my status-desktop down the number of subscriptions haven't reduced in fleet node whereas i do see a new subscription as i bring up my status-desktop. This indicates that either some of the filter subscriptions are not being unsubscribed explicitly when node is going down leading to open subscriptions being left on fleet node. |
Created this issue that would help with such debugging in the future. |
This is what needs to be done probably in status-go or fix #4459 (comment) as ultimately when desktop is brought down gracefully, noticed that filter subscription is not cleared in nwaku. This can be solved by sending an filter unsubscribeAll when status-go is shutting down, or unsubscribing to topics which are not being unsubscribed to. |
Do note that as per @richard-ramos we should not wait for the response for unsubscribeAll, rather this should be done in a fire and forget manner. |
I have not been able to reproduce this issue when i tested with Have not noticed any significant delays in receiving of the messages. Max delay i had observed was ~3 seconds. @churik @anastasiyaig Would require more information as to when the delay with Filter is noticed (it was mentioned about 60 seconds of delay was noticed). |
we did have a try yesterday, i dont see any delay from my end, however i know that the 'founder' of that issue was autotests on mobile side, maybe we can try to launch them again? @pavloburykh wdyt? Can we keep this issue open for a while , perhaps massive dogfooding will confirm the absence / presence of it, so far I cant reproduce as well |
Hey @anastasiyaig @chaitanyaprem!
Totally agree. Let's keep it open for a while as our e2e still catching issues that might be related. Here is an example from nightly e2e run (December 27):
Receiver_device.log.zip I am currently on vacation till January 2. Will try to investigate and provide more details when I am back. But from what I noticed, this issue is kinda random. Most of the time messages are delivered instantly but sometimes delay occurs. Please note, that I do not take into consideration cases with online/offline handling or returning from the background as those are separate issues and described here by @churik. |
|
After going through logs, looks like connection to filter node was dropped after 1 or 2 messages are received. This could be the reason why the rest of the messages did not get delivered. Although, I could not find a reason as to why the connection was terminated and not re-established. I don't have enough info to know if other filter peers are connected at this point. Also i am noticing that node is also connecting to go-waku.test and go-waku.prod fleets. Not sure if that is also causing some issue. @richard-ramos , I don't think it is a good idea to connect these fleets to shards.test fleet. |
Looks like the lightpush delay fix is already present in this version. So that could not be the cause of the issue. |
Hello to everyone! Sharing fresh logs from the failed e2e which are still catching data delivery issues. Here is an example from PR e2e run (January 3): Sender's device sends "something to reply to" message to the community channel |
Can we try with this PR? status-im/status-mobile#18395 this adds extra logging to go-waku and also updates it to latest version. |
thanks @richard-ramos! We will re-run tests and provide new logs based on the results. |
@richard-ramos I have checked e2e results in status-im/status-mobile#18395 This time I do not see cases when data has not been delivered at all. Although there are couple of examples when data has been delivered with a rather long delay. For example: |
Thanks for the logs. Taking a further look at logs for both Receiver and Sender for peer disconnections, i have noticed a lot of peer disconnections in both the logs. Oddly the time interval between disconnections are same in both receiver and sender towards fleet nodes. Towards peer |
Related issue in infra repo status-im/infra-status#27 that should address this issue. |
Hey team! The situation with data delivery has definitely been improved after deploying status-im/infra-status#27 But unfortunately we still encounter some issues. Here is an example caught by today's e2e run: Contact request has not been delivered to receiver
|
i think i saw that recently too but was not able to reproduce constantly... status-im/status-desktop#13270 |
Quick look at the logs shows that receiver did not have any filterPeers which is why contact request is not received. |
That is something that I was observing a lot yesterday actually. App was saying that there is no internet connection when it was not like that at all and peer numbers was jumping like crazy (from 1 to 6 and back) |
Also, One more observation i had noticed is on the mobile there is no notification showing we are offline even if peercount is 0. |
@chaitanyaprem thank you for this observation. I will create related issue in mobile repo. |
After looking into the logs further, finding something suspicious that after loosing all peer connections discv5 seems to have been stopped and not restarted which is preventing from discovering and connecting to peers. Either it could be due to the issue @qfrank has identified here or something else. In the meantime, i will try to find out if there is something else going on. |
Hi @chaitanyaprem! Here is an example of failed e2e from this PR: Message has not been delivered in community channel
|
Thanks! |
After analyzing the logs further it looks like for some reason the fleet node thinks subscription is not there, whereas the app thinks it is subscribed . One possible reason i can think of is some jumbling of subscribe and unsubscribe for same contentTopic happened and probably went out of sequence to fleet node which may have caused this. ![]() Logs showing unsubscribe and subscribe around same time which may cause messages to go out of sequence.
|
This could be due to the same issue mentioned here. Do you have any debug logs around this time? I can take a look and confirm |
Hey @chaitanyaprem! Just to make sure I understand correctly: you need some more logs specifically from this PR status-im/status-mobile#18492 , correct? |
@pavloburykh in case you have an option to apply |
No, I am assuming the issue @anastasiyaig faced could be similar to what has been observed in this case (which was fixed by @qfrank ). But i can only confirm that if i have logs from @anastasiyaig when she has faced the issue. |
hey @anastasiyaig! We have debug level applied by default in mobile PR builds. |
I just reproduced it again. Attaching fresh geth.log @chaitanyaprem |
Based on updated logs shared by you (unable to upload them here because fileSize is too big), it looks like the reason for message loss seems to be peer scores going down. I am not sure why this could be happening. Any thoughts @cammellos @vitvly @richard-ramos ?
|
I believe this problem might be happening due to status-im/status-desktop#13267 . I suspect that a shard assignment removal or change happened before, but the private key used originally for the community was not removed, so the topic validation in relay would still be applying a wrong key, and end up penalizing peers by mistake. if the problem is due to that, then likely it is fixed in #4627 (this is already in desktop |
The changes mentioned by @richard-ramos has been merged to status-desktop master, the release binary for this PR (status-im/status-desktop#13372) should be good to use for test. @pavloburykh @anastasiyaig could you confirm if this issue has been resolved or not? |
The changes mentioned by @richard-ramos fixes only 1 of the issue observed above i.e why relay peer scores are down causing message loss. The issue with Filter subscribe/unsubscribe is a separate issue and still need to be analyzed to figure out where the issue is. Ref #4459 (comment) This issue is becoming dumping ground for multiple issues. Let us close this and open separate issues to track independently and to avoid confusion. |
Created separate issue for Filter unsusbcribe and subscribe out of sequence #4659 |
Created a separate issue for relay loss observed due to lower peer scores #4660 Hence closing this one to avoid more issues being updated here. Please open new issues for anything else that is being observed. |
Quoting @richard-ramos Discord conversation:
Quoting @chaitanyaprem from Discord conversation:
As of now, i see a lot of subscribe requests when i start my desktop app (but that could just be to different contentTopics) and unsubscribe only per pubsubTopic (which could just include all contentTopics which were subscribed).
Unable to debug further because of not logging contentTopics and not having an API to confirm what FilterSubscriptions are active as of now.
Maybe it would be good to have some Admin API's to query list of Filter Subscriptions and similar useful API's for monitoring other protocols as well.This may help node operators and also in turn Waku team to debug issues.
For now, I will try to figure out another way what could be causing this issue. And at the same time recommend to use unsubscribeAll while shutting down or existing status-go.
Couple of more observation we had noticed during this session:
In case peers are subscribed and are not available/alive , for each filterPush we try to dial the peers which seems like a bad strategy. If peers are down for few mins, then this leads to a lot of dials from the filterFullNode(which doesn't look optimal).
Secondly, the idle/inactive period after which a filterFullNode considers a subscription to be invalid may need to be reduced to maybe 30 mins or so(not sure what is the optimal value here at this point). But, having such idle subscriptions hang on a longer time could just be a minor attack vector on a service-node. An attacker could just spawn peers randomly and keep hitting with subscriptions and not unsubscribe leaving filterFullNode wasting resources just dialing peers for each message. Not sure if this is a big issue, but something to think of.
QA Test Scenario Description
Scenario 1
1.
2.
Scenario 2
1.
2.
The text was updated successfully, but these errors were encountered: