gRPC Throughput: Channel Reuse
Introduction
gRPC captures the logic of network connections via channels. These channels use HTTP/2 for multiplexing and other techniques to efficiently ship data to a gRPC server from the client and vice versa. One of the struggles the team had during benchmarking was saturating RogueDB with enough data to eliminate gRPC as the bottleneck. The performance from using multiple threads per channel did not match expectations and quick ad-hoc experiments sowed only more confusion. Despite adding threads, performance did not provide efficient linear increases from the increased thread usage. In fact, the client and the server in stripped down tests showed steady state of approximately 30% CPU utilization with an active bidrectional stream.
After weeks of delays due to performance and tweaking based on intuition, a series of benchmarks were developed to understand the impact of multiple threads per channel and multiple single-threaded channels. The following benchmarks breakdown the effects of new channels and threads for both CPU utilization and total throughput.
Setup
For this benchmark, we use bidrectional streams that match RogueDB's CRUD APIs. The benchmark was ran on a single machine (via localhost), specifically the XPS 17 Intel i7-13700H with 32GB of RAM, rather than between two machines over a full-fledged network. Communication pattern and type of messages for the API are not critical as we focus on relative performance. For the full code, see our public GitHub repo.
Throughput Results
The benchmark for multiple threads for a single channel:
- 1 Thread: 34,904 op/s
- 2 Threads: 28,181 op/s
- 3 Threads: 41,064 op/s
- 4 Threads: 51,548 op/s
- 5 Threads: 63,042 op/s
- 6 Threads: 69,867 op/s
- 7 Threads: 71,208 op/s
- 8 Threads: 71,667 op/s
- 9 Threads: 70,886 op/s
- 10 Threads: 72,012 op/s
- 11 Threads: 73,629 op/s
- 12 Threads: 74,782 op/s
- 13 Threads: 74,341 op/s
- 14 Threads: 74,836 op/s
- 15 Threads: 75,667 op/s
- 16 Threads: 76,546 op/s

The benchmark for multiple single threaded channels:
- 1 Channel: 37,502 op/s
- 2 Channels: 59,479 op/s
- 3 Channels: 81,193 op/s
- 4 Channels: 102,096 op/s
- 5 Channels: 120,082 op/s
- 6 Channels: 136,999 op/s
- 7 Channels: 157,149 op/s
- 8 Channels: 177,184 op/s
- 9 Channels: 193,363 op/s
- 10 Channels: 196,655 op/s
- 11 Channels: 207,227 op/s
- 12 Channels: 210,675 op/s
- 13 Channels: 208,955 op/s
- 14 Channels: 209,923 op/s
- 15 Channels: 207,747 op/s
- 16 Channels: 205,045 op/s

A multi-threaded channel increases total output by ~2x after mostly peaking around 6 threads. Multiple channels increases total throughput by ~5.5x once reaching about 11 channels. CPU utilization when using multiple threads for a single channel peaked near 30% while multiple channels each increased baseline usage around 20% each. While multiple threads do increase total throughput, at least 3 threads are required to overcome the associated performance penalty.
Discussion
Maximizing throughput with gRPC does not have a singular method when it comes to reusing channels vs. new channels in multi-threaded applications. When optimizing for CPU utilization, maximizing the number of threads (up to 6) appears to be the best case. If optimizing for raw throughput, creating new channels achieves this at the cost of more CPU usage. Applications aiming to achieve the best utilization with the most throughput ought to consider maximizing channel effectiveness with the optimal number of threads and only creating channels after reaching this limit. In theory, this should achieve around 11x increased throughput in total.
Our YCSB General Purpose benchmarks only utilize the creation of new channels for ease of maintenance and understanding. However, a second look to implement a hybrid approach would likely give a significant boost in overall throughput. Customers using RogueDB following this practice also reduce flooding the compute to allow multiple unrelated applications equal opportunity to maximize their workloads individually. The penalty for not executing the perfect ratio or exceeding ideal maximum channels and threads is minimal as can be seen in the benchmarks.
Conclusion
Using multiple channels provides the expected near 1:1 linear increase in throughput rather than reusing a channel. The trade-offs are increased CPU utilization server side that has a theoretical 2x worth of efficiency gains from multiple threads. Multiple threads per channel increase efficiency in utilization of the CPU server side with lower overall total throughput increases. Best practice is to assign multiple threads per channel (up to 6) and create channels only after maximizing the previous channel to the optimal limit.
