Wave message Liu Jun: smart Power system Innovation accelerates the Development of generative AI Industry 08/18 Update SLTechnology News&Howtos

Wave message Liu Jun: smart Power system Innovation accelerates the Development of generative AI Industry

2025-08-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

On November 29, at the 2023 artificial Intelligence Computing Conference (AICC) held in Beijing, Liu Jun, Senior Vice President of Wave Information, shared Chaochao's thoughts on the innovation of intelligent computing system and the development of AI industry in his keynote speech "Intelligent Computing system Innovation accelerating the Development of AI Industry".

The following is a transcript of the speech

At present, generative artificial intelligence and large models promote the rapid growth of computing power demand. How to better support the innovation and application of AI through intelligent computing system has become the key to the intelligent computing industry. To deal with the development and challenge of generative AI, we should consider comprehensively from four aspects: computing power system, AI software infrastructure (AI Infra), algorithm model and industrial ecology, so as to accelerate the landing of intelligent industry.

The Innovation of Intelligent Computing system to solve the Computational Challenge of generative AI

At the computing system level, the challenges faced by generative artificial intelligence mainly come from three aspects: computing, data and interconnection.

At the computing level, the trend of computing power diversification is becoming more and more obvious, resulting in long development cycle, large investment in customized development and long business migration time of AI computing system. In addition, large model training requires high computing power scale. In the case of limited computing power on a single chip, it is necessary to build larger clusters to achieve performance expansion.

At the data level, the large model evolves from single mode such as text and picture to multi-mode and cross-mode, and the training data set reaches the level of TB or even PB, and the requirements for data storage in different operation stages show a diversified trend.

At the interconnection level, more than 40% of the network bandwidth in traditional RoCE networks is wasted due to uneven ECMP hashes, and the high tail delay results in network communication time accounting for 40% of the training time, which greatly reduces the computational efficiency. At the same time, the network is a cluster shared resources, when the size of the cluster reaches a certain order of magnitude, the fluctuation of network performance will affect the utilization of all computing resources, and network failures will affect the connectivity of dozens or more acceleration cards.

In the face of triple challenges, Chaochao Information summarizes many years of experience in product research and development and user service, and puts forward three-part solutions.

In the aspect of computing, first of all, for multiple computing forces, we should use a unified system architecture and unified interface specifications to be compatible with all kinds of AI accelerator cards to ensure the efficient release of chip computing power. Tide Information focused on opening up the design of diversified AI computing platform as early as 2018. The newly released G7 multiple computing platform is the only AI computing platform in the industry that can be compatible with SXM and OAI accelerator cards and achieve 8-card fully interconnected, 16-card fully interconnected and hybrid cube interconnected system topologies. In order to ensure large-scale node expansion performance, the open accelerated computing architecture developed by Tide Information supports PCIe, RocE and a variety of private interconnection protocols. The maximum interconnection between nodes and inter-node cards is 896 GB/s. Efficient expansion is achieved across nodes through network card-free RDMA, and the cluster performance speedup exceeds 90%.

In terms of data storage, according to the data storage requirements of large models, Tide Information is the first in the industry to implement a cluster system that supports lossless interaccess of files, objects, big data and other unstructured protocols. At the same time, it supports flash, disk, tape and CD storage media, and supports four levels of hot, warm, cold and ice storage management throughout the data life cycle, and supports a data center with a set of storage architecture. The real realization of data fusion, management fusion.

In terms of network interconnection, Tide Information released the flagship 51.2T high-performance switch for spanning AI computing scenarios, providing enterprise-level smart computing networks with high throughput, high scalability and high reliability of intelligent computing network products and solutions, solving the common problems of traditional RoCE schemes, such as low effective bandwidth, high tail delay and slow fault convergence, and improving the training performance of large models by more than 38%. The performance is similar to that of InfiniBand, which helps AI users to release large model productivity efficiently.

AI Infra full stack optimization: release multivariate computing power and improve the computational efficiency of large models

The chain of large model algorithm development is long, which means that many engineering tools are needed to support it. Therefore, in addition to the computing system, AI software infrastructure (AI Infra) is also in urgent need of innovation.

The development of AIGC large model is an extremely complex system engineering. Even if the problem of computing power supply at the bottom is solved, it is still faced with the problem that it can not be built and used well. " "poor construction" means that the construction of computing platform not only requires the integration of hardware such as server, storage and network, but also needs to consider the compatibility and version selection between different hardware and software to ensure the adaptability and stability of drivers and tools.

In order to accelerate the production and application of the model, Tide Information has developed a large model intelligence software stack OGAI (Open GenAI Infra) at the AI Infra level. In terms of computing power deployment, OGAI has opened up the industry's first AI computing power cluster system environment deployment solution PODsys; realizes automatic breakpoint continuation training from the computing power scheduling platform layer in terms of long-term security for large-scale training, and stable access to more than 40 + chips with standardized and modular access in multiple computing power access. In the aspect of data governance, a flow and customizable data cleaning pipeline is constructed, which can effectively shorten the data cleaning time and improve the accuracy of text audit and filtering. In terms of computational efficiency optimization, through the extreme optimization of distributed parallel algorithms, the training and computing efficiency of hundreds of billions of parameter models has been improved to 54%. In multi-model management, more than 10 mainstream open source models and meta-brain ecological models have been supported. It has been proved by practice that the innovation of AI Infra full-stack basic software and workflow is the key to the efficient release of multiple computing power and improve the computing efficiency of large models.

Basic large model, the core support for the landing development of generative AI

At present, the large model technology is promoting the rapid development of the generative artificial intelligence industry, and the key capability of the basic large model is the core support of the ability performance of the large model in the industry and application. However, in the process of continuous evolution, the basic large model is still faced with the challenges and constraints of key factors such as data, algorithm, computing power and so on.

At present, driven by favorable factors such as policy support, the improvement of computing power, huge data resources and the enhancement of scientific research strength, China has made some achievements in basic large models, but it still needs to increase original breakthroughs in basic technology. tamp the underlying model and algorithm capabilities.

Wave Information starts with practice, increases the input of model structure innovation, high-quality data preparation and efficient computing power utilization, and applies these technologies to the "Source 2.0" large model, showing advanced capabilities in programming, reasoning, logic and so on.

In terms of algorithm, Source 2.0 proposes and adopts a new attention algorithm structure, which effectively improves the expression ability and generation accuracy of the natural language of the model; in terms of data, Source 2.0 has made a comprehensive innovation in training data sources, data enhancement and synthesis methods, and finally enhanced the mathematical logic ability of the model. In terms of computing power, "Source 2.0" adopts the strategy of non-uniform pipeline parallelism + optimizer parameter parallelism + data parallelism + Loss computing block, which significantly reduces the bandwidth requirement of the large model for interconnection between chips, and enables the model training to work efficiently under the computing power scale of "limited conditions".

As a large basic model of hundreds of billions of levels, Source 2.0 has tested the ability of code generation, mathematical problem solving and factual question and answer in the public evaluation of the industry, showing a more advanced ability performance. In order to meet the capability requirements of different industries and different scenarios, Chaochao Information comprehensively open source "Source 2.0" full series of large models, supporting users to build their own intelligent products and capabilities in the most convenient way. vertical integration of frameworks, models and data based on industry characteristics to improve the accuracy and availability of basic models.

Ecological integration, joint innovation to accelerate the landing of AI applications

With a strong basic large model, we need to go further into the application scenario. The ability of the large model can be applied to the industrial link, which requires the deep cooperation of many manufacturers. In the face of the challenge of complex and discrete ecology and difficult landing of industrial AI, Tide Information puts forward meta-brain ecology, aggregates high-quality partners for collaborative innovation, and realizes the complementary advantages of different manufacturers through "technical support, scheme alliance, platform sharing".

At present, meta-brain Ecology, supported by AI computing platform, AI resource platform and AI algorithm platform of tide information, has docked more than 40 chip manufacturers, 400 + algorithm manufacturers and 4000 + system integrators, and realized the docking of "100 models" and "thousands of lines" through multiple computing power supply, full-stack AI Infra software stack and rich large model experience, helping thousands of industries to accelerate the generation of AI industry innovation and efficiently release productivity.

Intelligent computing power is innovative. The deep integration of AIGC, digital economy and real economy will create more subversive social and economic values, while Chaochao Information will adhere to the development concept of openness, sharing and co-construction, seize the opportunity of the AIGC market, and jointly promote the landing of artificial intelligence.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.