GA4 (398745346)

Huanxin: Tens of millions of level chat room technology practice based on large-scale edge computing

Huanxin: Tens of millions of level chat room technology practice based on large-scale edge computing


  Currently, live broadcasting has become a popular trend, live broadcasting with goods, Internet celebrities bringing goods, online concerts of celebrities, etc., further making the live chat room a necessary capability at present. Constantly updating iteratively.Compared with centralized and single-center solutions, not only in the stability of servicessex, the number of users carried has been significantly improved, and the cost can also be greatly reduced, and the user experience has also become better.As for the CDN chat room solution that the industry has been trying, it also has its own limitations.sexunlike audio and video messages, the single content is relatively small and instantaneoussexThe amount of visits is large, and the probability of repeated visits is almost non-specific, so that the practical solution of CDN cannot meet the needs of this scenario.

  1. How does the large-scale edge chat room work?

 

  The working process of a large edge chat room is very simple. UserA joins chat room X, and user UserB also joins chat room X. At this time, userA sends a message hello to the chat room. After receiving the message, the server sends a message to UserA. After receiving a successful response, the server will spread the message to all the people in the same chat room, UserB in this example.

  2. The scene is simplified but the production is not simple

  Every link needs extra attention

  1. How to maintain millions or even tens of millions of long connections stably and efficiently

  2. How to maintain the status of chat room members

  3. How to choose message routing

  3. How to stabilize ultra-large-scale connections?

  There are mainly two directions to solve this problem, the number of connections on a single machine and the size of the cluster.

  1. Stand-alone load

  Regarding the improvement of single-machine connections, although the number of connections supported by a single-machine can reach a very high value, it is also necessary to consider whether it is an effective connection, because high-load connections and low-load connections are completely different concepts, not to mention other The business logic, the heartbeat maintenance logic alone, will cause a very large burden on the CPU and IO. These are not based on the business logic at all, so on the stand-alone load, generally use a stand-alone load that does not exceed 10W.

  2. Cluster size

  clusters of waterflatThe scalability determines the size of the cluster. The following figure shows the overall deployment structure of the server:

 

  The green area in the above figure is the area responsible for the long-term connection of the client. All IMS (IM Server) provide the same services, and there is no dependency between them at all; the yellow in the above figure is deployed in IDC. The main service is routing management of chat rooms and routing distribution of messages. IMS can be considered as unlimited waterflatExtended architecture design, so the size of the cluster can be considered unlimited. The scale of the cluster has increased, especially the load scheduling on the edge side has become a new problem. Based on the company’s stable and efficient edge scheduling method, the perfect cooperation between the client and the server achieves an efficient and user-friendly connection experience.

  4. The key: the instantaneous number of connections carried

  For the scenario of long connection, the connection is keptimportant, the instantaneous quantity that the connection can carry is also a very critical indicator. And this point is also a problem of two different dimensions of stand-alone and cluster. The perspective of cluster is roughly the same as the connection above, and the connection creation and disconnection for stand-alone is more complicated than simply maintaining the connection. , you need to consider the authentication problem when you log in, the specific scene of the chat room, and the cleaning of the chat room when you log out.

 

  Briefly introduce the overall strategy. It mainly divides the things that happen when creating and disconnecting into two categories. One is the actions that need to be executed synchronously, and the other is the actions that can be processed asynchronously, and they are put into the asynchronous queue for execution. deal with.The strategy itself is relatively simple, but it can be achieved during the actual execution process, and it is more difficult to maintain the strategy with the iteration of the version. In order to achieve this goal, we adhere to a principle that everything that needs to be added to the synchronization logic The content of the content needs to be given the corresponding reason, and it needs to be discussed synchronously within the team, otherwise it can only use the asynchronous queue method. This is not to say that the asynchronous queue does not need to be reviewed or discussed, and the synchronization needs to be clearly targetedsexOnly in this way can the clarity of the synchronization logic and the sustainability of the strategy be guaranteedsex.

  5. How to maintain the membership status?

  A chat room is a specific form of multi-person chat. Messages are only spread to online users. Users who are offline will automatically exit the chat room and will not receive offline messages after they go online again. As for some scenes where you can continue to watch the live broadcast after being disconnected and reconnected, and you can see some historical news when you enter, it is the ability to realize automatic subscription and pull history through other means.

  Hierarchical management of members

 

  The left side of the above figure is the relationship table directly corresponding between the chat room and the user, that is, the user chat room information in the figure, and the synchronization will generate the member information of the chat room, which will be used frequently in the case of message routing. Query the personnel in the chat room to further spread the news. The core point of layering is the maintenance of the node chat room. Only when the chat room list of the current node changes, the node chat room information will be modified, and the change will be synchronized to the IDC, that is, the last routing table. Here only the corresponding logic will be triggered when the first person joins the chat room and the last person exits the chat room.

 

  The above is to further extract the logic of hierarchical registration. Each level will register the corresponding relationship of the chat room of the current level, and keep it alive to the chat room relationship of the upper level. We only verified the 3rd level. As for more Hierarchy is theoretically feasible, but it is not recommended. Each additional level of complexity and handling of abnormal situations will double. For the message delivery introduced later, all levels must work normally to deliver the message normally. .

  Member’s heartbeat maintained

 

  Since the chat room information on this node is operated at the memory level, the probability of problems is generally relatively small.guarantee its consistencysexIt is relatively simple, but cross-node, especially cross-computer room, cross-region network interaction, it is difficult to guarantee that it is normal every time, so when synchronizing related information, a similar keep-alive mechanism and asynchronous queue mechanism are added. Retry mechanism, etc. to further ensure business stabilitysexof course, there is also a timely exception handling mechanism. After all, the user cannot be allowed to enter the chat room, but it has been unable to receive messages, and the current state cannot be restored.

  6. How to choose message routing?

  multi-level routing

 

  Based on the logic of the hierarchical registration above, it can be seen that the delivery of messages is also delivered in a hierarchical manner. This design reduces the difficulty of delivering each layer. For example, if there are 200 IMS and 10 Edges, the extreme case The number of distributions that IDC needs to distribute is 10, and the distribution number of each Edge is 20. If there is only one level, IDC needs to distribute 200 requests. This does not seem to be a very large number, but don’t forget that this is only for The distribution of a message, and if there are 5000 requests, it is 200*5000=1,000,000, and there are millions of distributions. The complexity of each level can be effectively reduced by grading, and it can also be minimized. Cross-room, cross-region calls further reduce risks. Although hierarchical distribution brings benefits, it also makes path maintenance relatively more complicated.

  Message push-pull combination

 

  In the chat room scenario, in most scenarios, the method of directly pushing messages is adopted. The way of filtering, screening and discarding strategies for messages in large chat rooms is also a very complicated issue.As for the message to the delivery stage, directly push the message to the client, so that the instant messagesexIt is indeed guaranteed, but the situation of the client is different, the configuration of the machine is different, the operating status of the machine is different at that time, and the network status is also different, so in this case, it is necessary to support the client to pull certain data according to its own situation. The number of messages can be more flexibly adapted to different scenarios.Although these strategies are simple, they are reallyimplementWhen it comes to online services, there are still many details that need to be considered. It is still difficult to achieve stability. After all, it is difficult to expect good monitoring in this specific scenario.We also start with the simple push-pull method of fixed mode, and then follow up with more details according to the specific situation.sextuning.

  fixed route

 

  For some specific large-scale live broadcast scenarios, a simple routing method is also provided. From the above chat room routing management, it can be seen that there may still be problems. Therefore, for the known particularly large chat room scenarios, In this scenario, it can be considered that all IMS services can be covered, so the hierarchical registration of the chat room is a bit redundant, so the registration of the chat room level is changed to the registration of the node, and the related content is completed by default depending on the service registration of the system. This makes the whole thing very simple and efficient.

  This method has its advantages in such a super-large chat room, and also has its own bottleneck point, so the message will be delivered to the node no matter whether there is a user joining the chat room at the node, so each IMS All messages must be processed, although many messages do not need to be delivered. There is no one-size-fits-all solution, so these two approaches are complementary and not mutually exclusive.

  7. Large-scale edge chat rooms VS central clusters

  Compared with the traditional center-clustered chat room, the large-scale edge chat room solution has no essential difference in terms of technical architecture. It is still multi-level routing and message push-pull combination.

  The difference lies in the different forms of deployment, and it is precisely these differences that make many things change.The way of large-scale edge chat rooms increases the connectivity of the edgesexable to rely more oncloseuser’s placecloseDeployment to achieve the goal of solving the last five kilometers. And it can use the resources of each computer room to reach a million, tens of millions or even higher number of users.During the implementation of the large-scale edge chat room solution, it also played a key role in reducing costs, because the central computer room generally guarantees the availability ofsexand stablesex, the BGP network is generally used, and the cost is much more expensive than the non-BGP network in the edge computer room.The system as a whole is availablesexFrom the perspective of large-scale edge chat rooms, compared with central cluster chat rooms, the disaster recovery for computer room failuressexbetter. Of course, here mainly introduces the advantages of large-scale chat rooms, any kind of solution is not omnipotent, and has its own advantages and disadvantages.The deployment of large-scale edge chat rooms is complicated, and the requirements for the operation and maintenance system are relatively high, and the network between services is stablesexIt is also more difficult to guarantee, so it is generally applicable to large-scale, public chat rooms, and it is not suitable for scenarios where price comparison is more detailed, or small-scale, but it adds a lot of uncertaintysex.

  8. Large-scale edge chat room VS CDN mode

  For large-scale chat rooms, we have considered whether we can use the relatively mature CDN distribution technology in the industry. In the specific practice process, it is found that for this kind of small package, and the scene that will not be distributed repeatedly, here refers to the same message, which is unlikely to be obtained continuously for a period of time. The scene of the chat room is generally received at the time. When it arrives, if you don’t receive it, you don’t expect to receive a message later.Moreover, the CDN solution is to aggregate the messages, and the client pulls them regularly, and the messages overlap.sexdelaysexCan not meet the needs of customers.

  In terms of technical difficulty, joining a chat room with 10 million members and refreshing the message every 10s may alsocloseThe 100W QPS request is a very large adjustment for the CDN system, and even if multiple CDNs are connected, there will be a relatively high proportion of timeouts. What’s more, the delay of 10S can be clearly perceived in some scenes.

  Compared with centralized and single-center solutions, large-scale chat rooms are not only stable in servicesex, the number of users carried has been significantly improved, and the cost can also be greatly reduced, and the user experience has also become better.As for the CDN chat room solution that the industry has been trying, it also has its own limitations.sexunlike audio and video messages, the single content is relatively small and instantaneoussexThe amount of visits is large, and the probability of repeated visits is almost unequal, so that the practical solution of CDN cannot meet the requirements of this scenario.

  8. Typical case: Cartel World Cup, edge network + low latency, supporting 18 million users online at the same time in a large-scale chat room, sending 40 million messages per second;

  The 2022 World Cup in Qatar has come to a successful conclusion. During this period, Huanxin has professionally transformed the live chat room of the World Cup for operator customers, and helped customers achieve technical support for tens of millions of chat rooms. It solved the customer’s demand for a large number of online users of the live broadcast of the World Cup. Through the adjustment of the structure, it can support 18 million users online at the same time. The message processing capacity has reached 5000QPS, and the message delivery volume has reached 40 million+/second level. The overall solution of Huanxin can not only support such a large scale, but also the cost can be compared with the CDN solution, and the machine can be efficiently expanded and contracted.



Source link