In June 2018 users started reporting slack-side disconnects in their Ruby bots, reported via slack-ruby-client#208, regardless of the concurrency library (Eventmachine or Celluloid) used. I was already observing this behavior in my larger production bots, especially playplay.io with hundreds of teams.
Because this was a new behavior I assumed that the client was not seeing or handling some new type of communication error correctly at the websocket library layer. This has been an issue before, and in the case of Celluloid, IO exceptions had to be handled by emitting a
:close event here. That is, the websocket library was seeing a problem and raising an exception, but the client didn’t properly emit a close event, so the bots just sat there waiting for something to happen.
Faye::WebSocket::Client implementation has not seen a similar problem as it deferred reading data to
WebSocket::Driver::StreamReader, which seemed to bubble up disconnects properly until these reports.
The server-side disconnects had started happening without any code changes, so naturally I assumed this was new behavior at Slack and opened a ticket on June 13. Slack quickly looped in @aoberoi from the SDKs team to help, but he found nothing wrong on the Slack side.
To relieve the immediate problem I stuck a restart into my bots (eg. slack-strava@bc29324) and began seeing about a dozen of these server-side disconnects every hour.
Gross, but effective.
Because both the asynchronous library and the websocket implementation were used by thousands of projects, I naturally assumed slack-ruby-client would be more likely to have a bug. The celluloid-io code that I wrote was not detecting server-side disconnects occasionally, therefore we must have not been using celluloid-io correctly in concurrency/celluloid.rb. I summarized my conclusion in celluloid-io#187 and asked for help.
It turned out that Celluloid::IO was effectively unmaintained and I was told to switch to async. While that was a viable idea, replacing the concurrency library could easily create many new bugs without fixing the underlying issue. I wanted to understand the root cause, first.
Websocket Ruby Driver
Interestingly, both Celluloid and EventMachine used websocket-ruby-driver and, since both concurrency libraries exhibited a similar problem, I opened websocket-driver-ruby#61 and asked for ideas. I got a really helpful response from @ioquatix, the author of async.
I didn’t realize the socket
read could return
nil, meaning something like a disconnect. A read of length 0 returned
nil. You also needed to handle empty strings, which may have meant that the client was disconnected. That’s
Adding the code to handle all of the above in slack-ruby-client#216 solved nothing.
What else could go wrong?
There was some speculation that the bug was in
Celluloid::IO::SSLSocket. It wasn’t.
Having no good ideas I ventured into replacing EventMachine and Celluloid with Async in slack-ruby-client#219 with a lot of help. That solved nothing, as expected, but replaced an unsupported library with a supported one.
Websocket-Level Ping Monitoring
We naturally reached the obvious conclusion that there was no bug and that the protocol required an active heartbeat, and that there was none implemented.
@RodneyU215 from Slack stepped in and added ping monitoring in slack-ruby-client#226 following Slack’s RTM guidelines (see Ping and Pong). The entire discussion and the evolution of that code is worth a read. I notably learned that Ruby
Time.now was not monotonic.
I found the final code in slack-ruby-client#226 to be a rather elegant and straightforward example of a complex asynchronous process.
The active ping monitoring was released in slack-ruby-client 0.14.1, slack-ruby-bot 0.12.0 and slack-ruby-bot-server 0.9.0 and seemed to have worked well, short of slack-ruby-client#257 being watched.
Again, much thanks to Slack’s team for debugging, support and code, including Rodney Urquhart, Ankur Oberoi and Jason Roche, as well as Samuel Williams for his Async library, his adult supervision and for catching tons of non-obvious hard bugs across multiple code reviews.