技术
computing power center

Building the cornerstone of computing power in the era of intelligence

  • 首页
  • 技术
  • Building the cornerstone of computing power in the era of intelligence

Overview

 DiShan Technology 10,000P Computing Power Center Technical Solution: Building the Computing Foundation for the Intelligent Era

In the era of large AI models, computing power has become a core element of national technological competitiveness and industrial innovation. According to a report by Gartner, a worldrenowned consulting firm, the global computing power market size reached 1.25 billion US dollars in 2022 and is projected to grow to 1.8 billion US dollars by 2025. Global tech giants are scrambling to deploy ultralargescale computing infrastructure to seize the commanding heights of AI technology. For instance, Google's Tensor Processing Unit (TPU) has been widely applied in its AI research and services, significantly enhancing the capabilities of machine learning and data analysis.

China has also incorporated computing power networks into its "new infrastructure" strategy, emphasizing the construction of an independent, controllable, efficient and green computing power system. As a pioneer in the AI field, DiShan Technology has a profound insight into industry trends. Combining its own technological accumulation and industrial resources, the company has launched the 10,000P highperformance computing power center construction project. With the tenets of "technological leadership, ecological empowerment, security and credibility, and lowcarbon development", this project aims to break through computing power bottlenecks, drive algorithmic innovation in fields such as large model training, autonomous driving, biomedicine and weather forecasting, and facilitate the deep integration of the digital economy and the real economy.

As a leading intelligent computing infrastructure in China, DiShan Technology Computing Power Center will undertake three core missions: first, to build a "testbed" for AI technological innovation, providing largescale computing resources for universities, research institutions and enterprises to accelerate algorithm iteration; second, to construct an "accelerator" for industrial empowerment, lowering the threshold for enterprises' AI applications through inclusive computing power services; third, to set a "new benchmark" for green intelligent computing, exploring lowcarbon technology paths and practicing the concept of sustainable development. In the future, the center will become a key hub for AI technology breakthroughs and industrial implementation.

Overall Technical Architecture

The computing power center adopts an overall technical architecture of "threelayer integration and collaborative linkage". Through the indepth integration of the Infrastructure as a Service (IaaS) layer, Platform as a Service (PaaS) layer and intelligent Operation and Maintenance (O&M) layer, it realizes efficient resource utilization and fullprocess intelligent management.

 1. Infrastructure as a Service (IaaS) Layer

Computing Cluster**: Adopts a heterogeneous architecture of "10,000cardlevel GPU + CPU", with NVIDIA H100/A100 and domestic highperformance accelerator cards deployed at the core. The singlecard computing power exceeds 300 TFLOPS, and the overall FP16/FP32 mixedprecision computing power reaches 10,000P. The cluster supports dynamic scheduling and can flexibly switch computing modes according to task requirements.

Storage System**: Builds an EBlevel allflash distributed storage array, adopting NVMe protocol and RDMA network acceleration. The readwrite bandwidth exceeds 10 TB/s with latency lower than 100 μs, meeting the stringent data throughput requirements for the training of 100billionparameter models.

Network Architecture**: Constructs a lossless network based on 800G Ethernet + HDR InfiniBand, adopting a FatTree topology to achieve microsecondlevel communication latency between nodes. The network supports Traffic Engineering (TE) and congestion control to ensure the stability of largescale parallel training.

 2. Platform as a Service (PaaS) Layer

Resource Scheduling Platform**: Integrates Kubernetes container orchestration and Slurm job scheduling to realize unified pooling of GPU/CPU/storage resources. The platform supports intelligent sharding scheduling, which can dynamically allocate a single task to the optimal resource combination to improve utilization rate.

AI Development Platform**: Builtin with efficient training frameworks such as ModelScope and DeepSpeed, providing preconfigured environment images and JupyterLab interactive interface. The platform integrates AutoML toolchain to support automatic model search and hyperparameter optimization.

Model Service Platform**: Based on cloudnative architecture, it supports fullscenario deployment of online inference, batch inference and edge inference. The platform has builtin monitoring dashboards and elastic scaling strategies to ensure service SLA.

 3. Intelligent Operation and Maintenance & Security Management Layer

Intelligent Monitoring System**: Deploys an AIOps system, which uses machine learning algorithms to realtime analyze device temperature, power consumption and load data, predict hardware failures and trigger early warnings. The system supports 3D visualized data center heat map display.

Automated Operation and Maintenance (AIOps)**: Develops an operation and maintenance knowledge graph, combining expert experience database and realtime data to realize automatic fault location and repair. Operation and maintenance robots can perform more than 75% of daily maintenance operations.

Security Protection System**: Builds a "Zero Trust" security architecture, adopting national cryptographic algorithm encryption, Hardware Security Module (HSM) and trusted computing technology to form a protection network covering the entire data lifecycle.

 Core System Design Solution

 1. HighPerformance Computing Cluster Design

 Adopts a modular design of "blade server + GPU module". Each module integrates 8 B300 cards, supporting hotswappable maintenance for flexible expansion and management. The cluster realizes intercard interconnection through NVLink 4.0 with a bandwidth of up to 600 GB/s, ensuring highspeed data transmission.

 Introduces the Parameter Server (PS) architecture, which distributes and stores model parameters in highspeed SSDs, effectively reducing communication bottlenecks and improving overall performance.

 Supports mixedprecision training and quantization compression technology. Mixedprecision training reduces computing resource consumption by using lowerprecision data types (such as FP16) for partial calculations while maintaining model accuracy. Quantization compression technology further reduces model size and computational complexity by reducing the number of bits of model parameters. The combination of these two technologies can significantly improve model training efficiency, usually by more than 30%.

 2. Liquid Cooling System

 Innovatively adopts a dualcirculation system of "cold plate liquid cooling + microchannel spray cooling", with the coolant temperature controlled at 1525℃ and the PUE value reduced to 1.12. The system is equipped with backup cooling towers and energy storage batteries to ensure stable operation under extreme weather conditions.

 Develops an intelligent temperature control algorithm, dynamically adjusting coolant flow according to load, achieving an energy saving rate of 18%.

 Cooperates with energy companies to establish a waste heat recovery system, using waste heat for park heating to realize cascade utilization of energy.

 3. Open and Decoupled Intelligent Computing Network Architecture

 Complies with the Open Compute Project (OCP) standard, with network devices supporting whitebox switches and opensource protocol stacks, reducing hardware procurement costs by 30%.

 Introduces programmable switching chips (P4) to realize flexible customization of network functions and intelligent traffic scheduling.

 Reserves 400G/1.6T optical module interfaces to provide redundancy for future network upgrades.

 4. Full Data Lifecycle Management

 Builds an integrated data lakehouse architecture, integrating Hadoop, Delta Lake and Lakehouse technologies to support PBlevel realtime data analysis.

 Develops a data governance platform, providing functions such as data lineage tracking, quality assessment and compliance auditing to meet the requirements of the *Data Security Law*.

 Applies federated learning framework to support secure crossinstitutional data sharing and break through the problem of "data silos".

 Security and Compliance Assurance System

 1. Data Security

 Implements "data classification and grading" management. Homomorphic encryption technology is adopted for core data to achieve "usable but not visible".

 Deploys a Data Loss Prevention (DLP) system, which uses AI algorithms to identify sensitive data flows and automatically block them.

 2. Access Control

 Establishes a "Zero Trust Network Architecture", adopting MultiFactor Authentication (MFA) and dynamic access policies to restrict the permissions of privileged accounts.

 Builds a User and Entity Behavior Analytics (UEBA) system, which uses machine learning to detect abnormal operations and generate audit reports.

 3. Network Security

 Deploys a honeypot system and threat intelligence platform to capture Advanced Persistent Threats (APT) in real time.

 Sets up microsegmentation firewalls at network boundaries, combining with SDWAN technology to achieve dynamic traffic isolation.

 4. Disaster Recovery and High Availability

 Adopts a "threesite fivecenter" disaster recovery architecture, with core data synchronously replicated to offsite backup centers in real time.

 Key business systems support activeactive cluster deployment, reducing the Recovery Time Objective (RTO) to 5 minutes.

 Conducts regular "redblue team confrontation" drills to verify the effectiveness of emergency response plans.

 Green and Sustainable Development Strategy

 1. EnergySaving Design

 Selects highefficiency titaniumlevel UPS with a conversion efficiency of 98%; the air conditioning system adopts frequency conversion technology, which can intelligently adjust cooling capacity according to load.

 Optimizes cabinet layout through CFD simulation, achieving a hotcold aisle isolation rate of 90% to reduce ineffective heat dissipation.

 2. Green Energy Application

 Cooperates with new energy groups to connect selfbuilt photovoltaic power stations and direct wind power supply, with renewable energy accounting for 35%.

 Participates in the carbon trading market, offsetting residual carbon emissions by purchasing green certificates and carbon sinks to achieve carbon neutrality goals.

 3. Intelligent Carbon Management

 Develops a carbon footprint tracking system to realtime monitor equipment energy consumption and carbon emissions, generating carbon emission reduction optimization suggestions.

 Explores collaborative scheduling of computing power and green electricity, executing highenergyconsumption tasks during offpeak power periods to reduce electricity costs.

 4. Modularity and Scalability

 Adopts a Prefabricated Modular Data Center (PMDC), shortening the construction period by 50% and supporting ondemand expansion.

 Reserves interfaces for cuttingedge technologies such as AI photonic computing and quantum computing to maintain technological foresight for more than 10 years.

 Implementation Assurance and Operation Mode

 1. Construction Period and Milestones

Phase I (6 months)**: Complete infrastructure construction and deployment of 5,000P computing power, and launch the basic service platform.

Phase II (8 months)**: Expand computing power to 8,000P, and launch AI development toolchain and industryspecific solutions.

Phase III (6 months)**: Achieve fullload operation of 10,000P computing power and establish an ecological partner network.

2. Construction Mode

 Adopts the EPC+O (Engineering, Procurement and Construction + Operation) mode, jointly implemented with XX Construction Group and XX Cloud Service Provider.

 Introduces BIM technology for digital construction to reduce construction errors and material waste.

 

 3. Operation and Services

 Establishes a "computing power supermarket" mode, providing three service plans: payasyougo, annual/monthly subscription and computing power leasing.

 Establishes a professional expert service team to provide valueadded services such as model optimization, performance tuning and scenario adaptation.

 Regularly holds AI developer competitions and technical salons to build a vibrant industrial ecosystem.

 Expected Benefits and Social Value

Economic Benefits**: With a total investment of 8 billion yuan, the project is expected to achieve an average annual revenue of 10 billion yuan after operation, driving the upstream and downstream industrial scale to exceed 30 billion yuan. Computing power sharing can reduce enterprises' R&D costs by more than 40%.

Technological Innovation**: Supports the training of GPT4level large models, promoting cuttingedge breakthroughs in braininspired computing, protein structure prediction and other fields. Incubate more than 10 AI unicorn enterprises within three years.

Industrial Empowerment**: Provides customized solutions for industries such as automobiles, finance and healthcare, helping 100 enterprises realize intelligent transformation.

Social Benefits**: Reduce carbon emissions by 900,000 tons annually, equivalent to planting 500,000 trees. Narrow the regional digital divide through inclusive computing power services.

Strategic Value: Enhance China's international voice in the AI field and help achieve the strategic goal of becoming a "computing power powerhouse".

 Ecological Cooperation and Future Outlook

DiShan Technology will work hand in hand with industry chain partners to build a computing power ecosystem.

The DiShan Technology 10,000P Computing Power Center is not only a technological project, but also a strategic infrastructure for the future. Adhering to the core concepts of openness, innovation, green development and security, we will create an efficient, intelligent and sustainable new paradigm of computing power, injecting strong impetus into technological progress and industrial transformation in the artificial intelligence era. Let us join hands to embrace the intelligent future!