Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Private messages no longer received after websocket reconnect #1760

Closed
laszlovl opened this issue Oct 18, 2019 · 11 comments
Closed

Private messages no longer received after websocket reconnect #1760

laszlovl opened this issue Oct 18, 2019 · 11 comments
Labels

Comments

@laszlovl
Copy link
Contributor

Test setup:

  • Latest Converse.js master
  • Firefox
  • A bot that sends a message to a MUC and a private chat every 5 seconds
  • The following config:
        websocket_url: "wss://xxx/ws",
        view_mode: "fullscreen",
        clear_messages_on_reconnection: true,

Everything functions fine at first. Then on the websocket server, I issue a service nginx restart to force the websocket connection to restart. This generates log messages:

 ERROR: Websocket closed unexpectedly converse-core.js:284:17
 2019-10-18T12:10:24.476Z INFO: Status changed to: DISCONNECTED converse-core.js:293:21
 2019-10-18T12:10:24.478Z INFO: RECONNECTING: the connection has dropped, attempting to reconnect. converse-core.js:293:21
 2019-10-18T12:10:24.504Z WARNING: Could not restore session for jid: Error message: The "restore" method can only be used with a BOSH connection. converse-core.js:286:17
 2019-10-18T12:10:24.504Z INFO: Status changed to: CONNECTING converse-core.js:293:21
 2019-10-18T12:10:25.120Z INFO: SASL authentication succeeded. converse-core.js:293:21
 2019-10-18T12:10:25.162Z INFO: Status changed to: undefined converse-core.js:293:21
 2019-10-18T12:10:25.285Z INFO: Status changed to: CONNECTED converse-core.js:293:21
 2019-10-18T12:10:25.287Z INFO: Reconnected 

After this, the bot's messages are still received in the MUC but no longer in the private conversation.

The same problem does not occur when using BOSH instead of websockets.

@laszlovl
Copy link
Contributor Author

FYI, setting enable_smacks: true doesn't make a difference.

@guusdk
Copy link
Member

guusdk commented Oct 25, 2019

I'm unsure if this makes any difference, but to be clear: When you say 'private message', do you mean a direct one-on-one message, or a 'private message' as defined in https://xmpp.org/extensions/xep-0045.html#privatemessage ?

@laszlovl
Copy link
Contributor Author

Right, I just meant a regular one-on-one message.

@laszlovl
Copy link
Contributor Author

I found an easier way to reproduce the disconnect/reconnect: tcpkill. Simply tcpkill port 7443 or tcpkill host xmpp.test.im, wait until tcpkill detects data being sent over the TCP connection (at which time it'lll print some console output and kill the connection), then close tcpkill again.

Using this method, the connection to my server (ejabberd) behaves the same way as initially described in this ticket: it'll succesfully reconnect, but after that one-to-one messages are silently discarded.

@guusdk kindly provided an Openfire environment for me to test on. There, I can't reproduce this exact issue because Converse doesn't even succesfully recover from a websocket disconnection:

The connection to wss://laszlo.goodbytes.im:7443/ws was interrupted while the page was loading. websocket.js:159
 ERROR: Websocket error [object Event] converse-core.js:284:17
 2019-10-25T16:12:04.936Z INFO: Status changed to: CONNFAIL converse-core.js:293:21
 2019-10-25T16:12:04.937Z INFO: Status changed to: DISCONNECTED converse-core.js:293:21
 2019-10-25T16:12:06.937Z INFO: RECONNECTING: the connection has dropped, attempting to reconnect. converse-core.js:293:21
 2019-10-25T16:12:06.942Z WARNING: Could not restore session for jid: [email protected]/converse.js-98066644 Error message: The "restore" method can only be used with a BOSH connection. converse-core.js:286:17
 2019-10-25T16:12:06.942Z INFO: Status changed to: CONNECTING converse-core.js:293:21
 2019-10-25T16:12:07.137Z INFO: SASL authentication succeeded. converse-core.js:293:21
 2019-10-25T16:12:07.159Z INFO: Status changed to: undefined converse-core.js:293:21
 2019-10-25T16:12:07.200Z INFO: Status changed to: CONNECTED converse-core.js:293:21
 2019-10-25T16:12:07.200Z INFO: Reconnected converse-core.js:293:21
 2019-10-25T16:12:07.220Z INFO: Strophe.Connection.prototype.bind called but "do_bind" is false converse-core.js:293:21
 ERROR: null

... after which an empty window with spinner appears indefinitely.

Almost every fatal Converse bug I encountered this year was triggered by a connection disconnect/reconnect, be it BOSH or websocket. Note that also without doing server restarts or tcpkill it's very easy to lose your connection to the server; just move out of wifi range, put your laptop in standby and wake up again, etc. Mac browsers are even worse because they are eager on killing all of your tab's connections after it has been in the background for a few minutes.

I would urge everyone working on either BOSH or websocket (re)connection methods in Converse to employ this tcpkill method during testing to verify that your code properly survives a reconnect.

Any suggestions on adding automated testcases for situations like this are welcome too.

@licaon-kter
Copy link
Contributor

Seems related to #1727 but I'm on ejabberd.

@guusdk
Copy link
Member

guusdk commented Oct 26, 2019

Unsure if this is related: After the connection hiccup, Openfire starts logging messages that suggest that the client is sending data, before it is authenticated.

I'm noticing this message in the logs that @laszlovl provided:

The "restore" method can only be used with a BOSH connection.

Openfire doesn't generate that, so I'm assuming that it is Converse that logs this.

The issue might be explained by a scenario where 'restore' does not working, but Converse nonetheless continues as if the restore was successful.

@jcbrand jcbrand added the bug label Nov 9, 2019
@laszlovl
Copy link
Contributor Author

The difference is that converse-muc has:

     async onReconnection () {
         // ... some lines removed ...
         this.registerHandlers();
     },

while converse-chat has no registerHandlers:

    onReconnection () {
        if (_converse.clear_messages_on_reconnection) {
            this.clearMessages();
        }
        this.announceReconnection();
    },

but only:

_converse.api.listen.on('presencesInitialized', (reconnecting) => (!reconnecting && registerMessageHandlers()));

Adding some loglines and triggering the reconnect shows that indeed converse-muc's registerHandlers is triggered while converse-chat's is not. Any subsequent MUC message is handled properly, while the regular/private message never reaches _converse.handleMessageStanza.

But as I mentioned in the chat a couple of times, this is as far as I can get. I don't understand why converse-muc and converse-chat have different ways of handling their handlers, so I don't know how to fix it properly and need some help here.

I really hate to be that guy that criticizes an overworked and underpaid open source maintainer, but if a bug like this is ignored for a month I don't know if this project has any future. This issue affects a dozen of Converse instances out there, since websocket connections are preferred over bosh connections and the bug is present since at least v5.0.

Just go to any public XMPP/Converse instance such as https://web.404.city/ and login, trigger a reconnect (with tcpkill, or just close & reopen your laptop, or whatever) then observe any private messages sent to it are silently discarded. There are probably hundreds or thousands of users currently suffering from this issue and missing messages, often without even realising it.

For a chat client, it's absolutely paramount that the chat messages you send & receive actually fscking arrive. Infinitely more important than emojis, foreign language support or whatever other gimmick. I am very happy to contribute time and resources towards this goal, but if I can't get the rest here to share that vision I'm just wasting my time.

@jcbrand
Copy link
Member

jcbrand commented Nov 18, 2019

@laszlovl: Thanks for doing some more research, I think you've identified the problem, which is that the message handlers for private chats are explicitly not re-registered upon reconnection.

Looks like this bug was introduced in commit 7200967, and then taken forward during refactoring.

I believe you could have fixed this bug yourself if you used git bisect to find the offending commit and then read the commit message to understand that not registering the handlers was an unintended side-effect.

@laszlovl
Copy link
Contributor Author

@jcbrand Thanks for looking into this! I can confirm that it fixes the problem in my quick initial testing.

I did this research a month ago and discussed it on the chat, but I repeatedly couldn't find you (or anyone else) willing to assist me further. I get that it's impossible to dedicate hours of your time to everybody and their favorite issue/feature request, but if one is not able to get some help in investigating complicated (and severe!) issues like these, it'll be pretty damn hard to get anyone other than you involved in development.

Again, I laud all your work and really don't mean for this to be a personal attack; just trying to convey my point of view and ultimately improve the state of Converse. :)

@marclaporte
Copy link
Member

We saw an issue (using Openfire), which is probably this one. We will test.

@jcbrand
Copy link
Member

jcbrand commented Nov 20, 2019

Fix released in version 5.0.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants