The fastest WebSocket implementation

Introduction
Adapted photo of William Warby on Unsplash.

This post measures the performance of wtx and other projects to figure out which one is faster. If any metrics or procedures described here are flawed, feel free to point them.

Metrics

Differently from autobahn, which is the standard automated test suite that verifies client and server implementations, there isn't an easy and comprehensive benchmark suite for the WebSocket Protocol (well at least I couldn't find any) so let's create one.

Enter ws-bench! Three parameters that result in reasonable combinations trying to replicate what could possibly happen in a production environment are applied to listening servers by the program.

LowMediumHigh
Number of connections1128256
Number of messages164128
Transfer memory (KiB)164128

Number of connections

Tells how well a server can handle multiple connections concurrently. For example, there are single-thread, concurrent single-thread or multi-thread implementations.

In some cases this metric is also influenced by the underlying mechanism responsible for scheduling the execution of workers/tasks.

Number of messages

When a payload is very large, it is possible to send it using several sequential frames where each frame holds a portion of the original payload. This frame formed by different smaller frames is called here "message" and the number of "messages" can measure the implementation's ability of handling their encoding or decoding as well as the network latency (round trip time).

Transfer memory

It is not rare to hear that the cost of a round trip is higher than the cost of allocating memory, which is generally true. Unfortunately, based on this concept some individuals prefer to indiscriminantly call the heap allocator without investigating whether such a thing might incur a negative performance impact.

Frames tend to be small but there are applications using WebSocket to transfer different types of real-time blobs. That said, let's investigate the impact of larger payload sizes.

Investigation

ProjectLanguageForkApplication
uWebSocketsC++https://github.com/c410-f3r/uWebSocketsexamples/EchoServer.cpp
fastwebsocketsRusthttps://github.com/c410-f3r/fastwebsocketsexamples/echo_server.rs
gorilla/websocketGohttps://github.com/c410-f3r/websocketexamples/echo/server.go
tokio-tungsteniteRusthttps://github.com/c410-f3r/tokio-tungsteniteexamples/echo-server.rs
websocketsPythonhttps://github.com/c410-f3r/regular-crates/blob/main/ws-bench/_websockets.py_websockets.py
wtxRusthttps://github.com/c410-f3r/regular-crates/tree/main/wtxexamples/web_socket_echo_server_raw_tokio.rs

In order to try to ensure some level of fairness, all six projects had their files modified to remove writes to stdout, impose optimized builds where applicable and remove SSL or compression configurations.

The benchmark procedure is quite simple: servers listen to incoming requests on different ports, ws-bench binary is called with all uris and the resulting chart is generated. In fact, everything is declared in this bash script.

ChartConnectionsMessagesMemoryfastwebsocketsgorilla/websocketstokio_tungsteniteuWebsocketswebsocketswtx_hyperwtx-_raw_async_stdwtx_raw_tokio
Chartlowmidhigh1042731028823264❗6765
Chartlowhighlow57595783578457605728❗580257645736
Chartlowhighmid336546235192526160163159❗
Chartlowhighhigh331960360325725250282249❗
Chartmidlowhigh18221815311412❗13
Chartmidmidhigh4503572439594816975435143474❗3498
Chartmidhighlow5684❗5800572156876681568957645684❗
Chartmidhighmid1102013735836590721987469376895❗6933
Chartmidhighhigh19808231781547119821383271375913693❗13749
Charthighlowlow5271984610535241❗88
Charthighlowmid848674511043605048❗
Charthighlowhigh12482785710595554❗58
Charthighmidlow29873051302729555071298130002942❗
Charthighmidmid20150214751459318931413681117210987❗11268
Charthighmidhigh4184643514207062177941091161181555515524❗
Charthighhighlow582859415830579094005778❗58775808
Charthighhighmid537565506344829473121077583662834333❗37000

Tested with a notebook composed by i5-1135G7, 256GB SSD and 32GB RAM. Combinations of low and mid were discarded for showing almost zero values in all instances.

soketto and ws-tools were initially tested but eventually abandoned at a later stage due to frequent shutdowns. I didn't dive into the root causes but they can return back once the underlying problems are fixed by the authors.

Result

Introduction

wtx as a whole scored an average amount of 6350.31 ms, followed by tokio-tungstenite with 7602.94 ms, uWebSockets with 8393.94 ms, fastwebsockets with 10140.58 ms, gorilla/websockets with 10900.23 ms and finally websockets with 17042.41 ms.

websockets performed the worst in several tests but it is unknown whether such behavior could be improved. Perhaps some modification to the _weboskcets.py file? Let me know if it is the case.

Among the three metrics, the number of messages was the most impactful because the client always verifies the content sent back from a server leading a sequential-like behavior. Perhaps the number of messages is not a good parameter for benchmarking purposes.

To finish, wtx was faster in all tests and can indeed be rotulated as the fastest WebSocket implementation at least according to the presented projects and methodology.