专栏首页智能计算时代Instant Messaging at LinkedIn: Scaling to 10000 of Connections

Instant Messaging at LinkedIn: Scaling to 10000 of Connections

We recently introduced Instant Messaging on LinkedIn, complete with typing indicators and read receipts. To make this happen, we needed a way to push data from the server to mobile and web clients over persistent connections instead of the traditional request-response paradigm that most modern applications are built on. In this post, we’ll describe the mechanisms we use to instantly send messages, typing indicators, and read receipts to clients as soon as they arrive. We’ll describe how we used the Play Framework and the Akka Actor Model to manage Server-sent events-based persistent connections. We’ll also provide insights into how we did load testing on our server to manage hundreds of thousands of concurrent persistent connections in production. Finally, we’ll share optimization techniques that we picked up along the way. Server-sent events Server-sent events (SSE) is a technology where a client establishes a normal HTTP connection with a server and the server pushes a continuous stream of data on the same connection as events happen, without the need for the client to make subsequent requests. The EventSource interface is used to receive server-sent events or chunks in text/event-stream format without closing the connection. Every modern web browser supports the EventSource interface, and there are readily-available libraries for iOS and Android. In our initial implementation, we chose SSE over Websockets because it works over traditional HTTP, and we wanted to start with a protocol that would provide the maximum compatibility with LinkedIn’s widespread member base, which accesses our products from a variety of networks. Having said that, Websockets is a much more powerful technology to perform bi-directional, full-duplex communication, and we will be upgrading to that as the protocol of choice when possible. Play Framework and Server-sent events At LinkedIn, we use the Play Framework for our server applications. Play is an open source, lightweight, fully asynchronous framework for Java and Scala applications. It provides out-of-the-box support for EventSource and Websockets. To maintain hundreds of thousands of persistent SSE connections in a scalable fashion, we usePlay’s integration with Akka. Akka allows us to raise the abstraction model and use the Actor Model to assign an Actor to each connection that the server accepts. The code snippet above demonstrates the use of Play’s EventSource API to accept an incoming connection in the application controller and assign it to be managed by an Akka Actor. The Actor is now responsible for the lifecycle of this connection, and so sending a chunk of data to the client as soon as an event happens is as simple as sending a message to the Akka Actor.

Notice how the only way to interact with the connection is to send a message to the Akka Actor managing that connection. This is fundamental to what makes Akka asynchronous, non-blocking, highly performant, and designed for a distributed environment. The Akka Actor in turn handles the incoming message by forwarding it to the EventSource connection that it manages.

This is it. It’s this simple to manage concurrent EventSource connections using the Play Framework and the Akka Actor model.

How do we know that this works well at scale? Read the next few sections to find out.

Load-testing with real production traffic

There is only so much that one can simulate with load-testing tools. Ultimately, a system needs to be tested against not-easily-replicable production traffic patterns. But how do we test against real production traffic before we actually launch our product? For this, we used a technique that we like to call a “dark launch.” This will be discussed in more detail in a later post.

For the purposes of this post, let’s say that we are able to generate real production traffic on a cluster of machines running our server. An effective way to test the limits of the system is to direct increasing amounts of traffic to a single node to uncover problems that you would have faced if traffic had increased manifold on the entire cluster.

As with anything else, we hit some limits, and the following sections are a fun story of how we eventually reached a hundred thousand connections per machine with simple optimizations.

Limit I: Maximum number of pending connections on a socket

During some of our initial load testing, we ran into a strange problem where we were unable to open more than approximately 128 concurrent connections at once. Please note that the server could easily hold thousands of concurrent connections, but we could not add more than about 128 connections simultaneously to that pool of connections. In the real world, this would be the equivalent of having 128 members initiate a connection to the same machine at the same time.

After some investigation, we learned about the following kernel parameter.

net.core.somaxconn

This kernel parameter is the size of the backlog of TCP connections waiting to be accepted by the application. If a connection indication arrives when the queue is full, the connection is refused. The default value for this parameters is 128 on most modern operating systems.

Bumping up this limit in /etc/sysctl.conf helped us get rid of the “connection refused” issues on our Linux machines.

Please note that Netty 4.x and above automatically pick up the OS value for this parameter and use it while creating the Java ServerSocket. However, if you would rather configure this on the application level too, you can set the followingconfiguration parameter in your Play application.

play.server.netty.option.backlog=1024

Limit II: JVM thread count

A few hours after we allowed a significant percentage of production traffic to hit our server for the first time, we were alerted to the fact that the load balancer was unable to connect to a few of our machines. On further investigation, we saw the following all over our server logs.

java.lang.OutOfMemoryError: unable to create new native thread

The following graph for the JVM thread count on our machines corroborated the fact that we were dealing with a thread leak and running out of memory.

We took a thread dump of the JVM process and saw a lot of sleeping threads in the following state.

On further investigation, we found that we had a bug in Netty’s idle timeout support on LinkedIn’s fork of the Play framework where a new HashedWheelTimer instance was being created for each incoming connection. This patch demonstrates the bug pretty clearly.

If you hit the JVM thread limit, chances are that there is a thread leak in your code that needs to be fixed. However, if you find that all your threads are actually doing useful work, is there a way to tweak the system to let you create more threads and accept more connections?

The answer, as always, is fun. It’s interesting to discuss how available memory limits the number of threads that can be created on a JVM. The stack size of a thread determines the memory available for static memory allocation. Thus, the absolute theoretical maximum number of threads is a process’s user address space divided by the thread stack size. However, the reality is that the JVM also uses memory for dynamic allocation on the heap. With a few quick tests with a small Java process, we could verify that as more memory is allocated for the heap, less is available for the stack. Thus, the limit on the number of threads decreases with increasing heap size.

To summarize, you can increase the thread count limit by decreasing the stack size per thread (-Xss) or by decreasing the memory allocated to the heap (-Xms, -Xmx).

Limit III: Ephemeral port exhaustion

We did not actually hit this limit, but we wanted to mention it here because it’s a common limit that one can hit when maintaining hundreds of thousands of persistent connections on a single node. Every time a load balancer connects to a server node, it uses an ephemeral port. The port is associated with the connection only for the duration of the connection and thus referred to as “ephemeral.” When the connection is terminated, the ephemeral port is available to be reused. Since persistent connections don’t terminate like usual HTTP connections, the pool of available ephemeral ports on the load balancer can get exhausted. It’s a condition where new connections cannot be created because the OS has run out of the port numbers allocated to establish new local sockets. There are various techniques to overcome ephemeral port exhaustion on modern load balancers, but those are outside the scope of this post.

We were lucky to have a very high limit of 250K connections per host possible from the load balancer. However, if you run into this limit, work with the team managing your load balancers to increase the limit on the number of open connections between the load balancer and your server nodes.

Limit IV: File descriptors

Once we had significant production traffic flowing to about 16 nodes in one data center, we decided to test the limit for the number of persistent connections each node could hold. We did this by shutting down a couple of nodes at a time so that the load balancer directed more and more traffic to the remaining nodes. This produced the following beautiful graph for the number of file descriptors used by our server process on each machine, which we internally dubbed the “caterpillar graph.”

A file descriptor is an abstract handle in unix-based operating systems that is used to access a network socket, among other things. As expected, more persistent connections per node meant more allocated file descriptors. As you can see, when only 2 of the 16 nodes were live, each of them were utilizing 20K file descriptors. When we shut down one of them, we saw the following error in the logs of the remaining one.

java.net.SocketException: Too many files open

We had hit the per-process file descriptor limit when all connections were directed to one node. The file descriptor limit for a running process can be seen in the following file under Max open files.

$ cat /proc/<pid>/limits

Max open files 30000

This can be bumped to 200K (as an example) by adding the following lines to/etc/security/limits.conf:

<process username> soft nofile 200000

<process username> hard nofile 200000

Note that there is also a system-wide file descriptor limit, a kernel parameter that can be tweaked in /etc/sysctl.conf.

fs.file-max

We bumped the per-process file descriptor limit on all our machines and, voila, we could easily throw more than 30K connections to each node now. What limit do we hit next…?

Limit V: JVM heap

Next up, we repeated the above process with about 60K connections directed to each of two nodes, and things started to go south again. The number of allocated file descriptors, and correspondingly, the number of active persistent connections, suddenly tanked and latencies spiked to unacceptable levels.

On further investigation, we found that we had run out of our 4GB JVM heap space. This produced another beautiful graph demonstrating that each GC run was able to reclaim less and less heap space till we spiraled toward being maxed out.

We use TLS for all internal communication inside our data centers for our instant messaging services. In practice, each TLS connection seems to consume about 20KB of memory on the JVM, and that can add up quickly as active persistent connections grow, leading to an out-of-memory situation like the above.

We bumped up the JVM heap space to 8GB (-Xms8g, -Xmx8g) and re-ran our tests to direct more and more connections to a single node, till we ran out of memory again at about 90K connections on one node and connections started to get dropped.

Indeed, we had run out of heap space once again, this time at 8G.

We had not run out of processing power, as CPU utilization was still well under 80%.

What did we do next? Well, our server nodes have the luxury of 64GB of RAM and thus, we bumped up the JVM heap space to 16GB. Since then, we haven’t hit the memory limit in our perf tests and have successfully held 100K concurrent persistent connections on each node in production. However, as you have seen in the sections above, we will hit some limit again as traffic increases. What do you think will it be? Memory? CPU? Tell us what you think by tweeting to me at @agupta03.

Conclusion

In this post, we presented an overview of how we used Server-sent events for maintaining persistent connections with LinkedIn’s Instant Messaging clients. We also showed how Akka’s Actor Model can be a powerful tool for managing these connections on the Play Framework.

Pushing the boundaries of what we can do with our systems in production is something we love to do here at LinkedIn. We shared some of the most interesting limits that we hit during our quest to hold hundreds of thousands of connections per node on our Instant Messaging servers. We shared details to help you understand the reasons behind each limit and techniques for extracting the maximum performance out of your systems. We hope that you will be able to apply some of these learnings to your own systems.

Acknowledgements

The development of Instant Messaging technology at LinkedIn was a huge team effort involving a number of exceptional engineers. Swapnil Ghike, Zaheer Mohiuddin,Aditya Modi, Jingjing Sun, and Jacek Suliga pioneered the development of a lot of the technology that is discussed in this particular post along with us.

本文分享自微信公众号 - 首席架构师智库(jiagoushipro),作者:Akhilesh Gupta

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2016-11-10

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 3 Lessons IBM's Watson Can Teach Us About Our Brains' Biases

    COGNITIVE COMPUTING IS TRANSFORMING THE WAY WE WORK. IT ALSO OFFERS A WINDOW TO ...

    首席架构师智库
  • 什么是创建区块链公司的最大障碍?

    创建区块链公司最大的障碍是什么?最初出现在Quora上:获得和分享知识的地方,使人们能够向他人学习,更好地了解世界。 Chronic联合创始人Samantha ...

    首席架构师智库
  • 呃哦:区块链可能没有我们想象的那么安全

    在对一百万份智能合约的分析中,一项新的分析工具发现34,200个安全漏洞。 在我们转向基于区块链的数字经济之前,我们需要解决这个系统中的缺陷。 ? 区块链有可能...

    首席架构师智库
  • 【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas(一)

    python中的数据清洗 | Pythonic Data Cleaning With NumPy and Pandas[1]

    needrunning
  • 【论文推荐】最新6篇图像分割相关论文—隐马尔可夫随机场、级联三维全卷积、信号处理、全卷积网络、多源域适应、循环分割

    【导读】专知内容组整理了最近六篇图像分割(Image Segmentation)相关文章,为大家进行介绍,欢迎查看! 1.Combination of Hidd...

    WZEARW
  • Graph application with Python, Neo4j, Gephi & Linkurious.js

    I love Python, and to celebrate Packt Python week, I’ve spent some time developi...

    fishexpert
  • GCN 论文英语表达总结

    -------------------------------------------------------一条开始认真脸的分界线--------------...

    张凝可
  • 【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

    WZEARW
  • How Badoo saved one million dollars switching to PHP7

    How Badoo saved one million dollars switching to PHP7 By Badoo on 14 Mar 2016 - ...

    netkiller old
  • 步进式的框架解释如何解决约束满足问题(CS AI)

    我们以逻辑网格难题的用例探讨逐步解释如何解决约束满足问题的问题。更具体地说,我们研究一种以易于理解的方式解释传播过程中可以采取的推理步骤的问题。因此,我们旨在为...

    刘子蔚

扫码关注云+社区

领取腾讯云代金券