DeepSeek-V3/R1 Inference System - A high-performance AI inference system designed to maximize throughput and minimize latency.
## Primary Goal of DeepSeek-V3/R1 Inference System
The primary goal of the DeepSeek-V3/R1 Inference System is to maximize throughput and minimize latency for AI model inference, particularly for the DeepSeek-V3/R1 model.
## Hardware Used in DeepSeek-V3/R1 Inference System
The DeepSeek-V3/R1 Inference System runs on H800 GPUs, which support FP8 and BF16 precision formats for matrix multiplication and core MLA calculations.
## Performance Optimization via Expert Parallelism
The system uses cross-node expert parallelism (EP) to optimize performance by expanding batch sizes, improving GPU matrix calculation efficiency, and distributing experts across GPUs to reduce memory access requirements and lower latency.
## Strategies to Reduce Communication Latency
The system reduces communication latency through computation-communication overlap strategies, such as using dual-batch strategies during the prefill phase and a 5-stage pipeline during the decode phase.
## Dynamic Resource Allocation in DeepSeek-V3/R1 Inference System
The system dynamically allocates resources based on service load, deploying all nodes for inference during peak daytime hours and reducing inference nodes during low-load nighttime hours to allocate resources for research and training.
## Key Performance Statistics of DeepSeek-V3/R1 Inference System
Key performance statistics include:
- Total input tokens: 608B (56.3% cache hit rate).
- Total output tokens: 168B.
- Average output speed: 20-22 tokens per second.
- Throughput per H800 node: 73.7k input tokens per second (prefill), 14.8k output tokens per second (decode).
- Daily cost: $87,072 (peak nodes: 278, average nodes: 226.75).
- Theoretical daily revenue: $562,027, with a profit margin of 545%.
## Main Functions of DeepSeek-V3/R1 Inference System
The main functions of the system include managing the inference process of the DeepSeek-V3/R1 model, efficiently handling the prefill and decode stages, and providing AI model inference services via API or web interface.
## Load Balancing in DeepSeek-V3/R1 Inference System
The system achieves load balancing through specialized load balancers for prefill, decode, and expert parallelism stages, ensuring even distribution of computational load across GPUs.
## Theoretical Profit Margin of DeepSeek-V3/R1 Inference System
The theoretical profit margin of the system is 545%, based on its high throughput and efficient resource utilization.
## Documentation and Resources for DeepSeek-V3/R1 Inference System
Users can find documentation and resources for the system at the following URLs:
- [DeepSeek-V3/R1 Inference System Overview](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md)
- [DeepSeek-R1 GitHub Repository](https://github.com/deepseek-ai/DeepSeek-R1)
### Citation sources:
- [DeepSeek-V3/R1 Inference System](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md) - Official URL
Updated: 2025-03-31