Clients timeout after broker reaches around 5000 connections

crnivader · April 12, 2023, 1:15pm

Hello,

I am currently stress testing my EMQX MQTT cluster, which is set up on our GKE Kubernetes cluster using helm. To carry out this testing, I am using the emqx-bench program. However, I am encountering an issue where client connections stop at around 5,000 connections, with an error message indicating a timeout. I have not made any changes to the limit configurations, so this issue seems unusual to me. Additionally, upon checking the EMQX logs, I did not find any indications as to why this issue may be occurring.

I even attempted to run the bench program simultaneously from multiple computers, but still encountered the same problem. Any assistance you can provide would be greatly appreciated.

Thank you for your time.

dmif · April 12, 2023, 9:30pm

Hello,

Could you please share the following details:

Precise version of the EMQX broker and helm chart that you’re using
What kind of traffic are you generating with emqtt_bench? Please share the full command
What kind of hardware resources (CPU, RAM) are allocated for running EMQX container?
Basic OS-level metrics, like CPU load and free memory.

Could you share the logs anyway?

5000 connections is not nearly enough to stress EMQX to the point of becoming unresponsive, also timeouts on the client side without any logs showing up on the broker side indicate that the problem likely lies somewhere in the load-balancer setup, perhaps there a hardlimit on the number of network connections or something along these lines.

crnivader · April 13, 2023, 1:37pm

Hi, Thanks for your help. We kinda “fixed” the issue. We are not really sure what went wrong but we believe it had something to do with our home routers or some thing like that. My colleague and I were trying to connect to the broker. We go to around 11k connections from 2x different networks. My got stuck again at 5k and his at around 6k. When i tried doing the same from 2 different pc-s in the same network they both got to around 2,5k and then clients started timing out again.

Right now I made a deployment on our cluster that executes this command instead of executing it on our laptops, I only set it to 10k connections and it worked perfectly. I will be doing real stress test tomorrow, and i can report back on how it went.

Thanks again and I hope this maybe helped somebody.

dmif · April 15, 2023, 7:38pm

I see, thanks for the reply. Your home router is likely behind the NAT, and I imagine your ISP might be limiting the number of TCP connections.