docs: 更新 design.md 项目结构和模块文档

2026-04-02 20:11:19 +08:00 · 2026-04-02 20:11:19 +08:00 · 8b6509b305
parent 912d7c7f54
commit 8b6509b305
1 changed files with 105 additions and 96 deletions
--- a/assets/docs/design.md
+++ b/assets/docs/design.md
@ -9,117 +9,126 @@ Thus, the AstrAI project was born - 1B parameters, Chinese-English bilingual, su
 The system is divided into the following modules:
 ```mermaid
-graph LR
+flowchart TB
    %% Style definitions
-    classDef config fill:#e1f5fe,stroke:#01579b;
+    classDef config fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
-    classDef trainer fill:#f3e5f5,stroke:#4a148c;
+    classDef data fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px;
-    classDef data fill:#e8f5e8,stroke:#1b5e20;
+    classDef model fill:#fff3e0,stroke:#e65100,stroke-width:2px;
-    classDef model fill:#fff3e0,stroke:#e65100;
+    classDef trainer fill:#f3e5f5,stroke:#4a148c,stroke-width:2px;
-    classDef inference fill:#fce4ec,stroke:#880e4f;
+    classDef inference fill:#fce4ec,stroke:#880e4f,stroke-width:2px;
-    classDef parallel fill:#e0f2f1,stroke:#004d40;
+    classDef parallel fill:#e0f2f1,stroke:#004d40,stroke-width:2px;
    classDef scripts fill:#fffbe6,stroke:#f57f17,stroke-width:2px;
    subgraph Config["Config Module (config/)"]
        direction LR
        C1[model_config.py<br/>Model Architecture]
        C2[train_config.py<br/>Training Params]
        C3[param_config.py<br/>Hyperparameters]
        C4[schedule_config.py<br/>Scheduler Config]
    end
-    %% Config module
+    subgraph Data["Data Module (data/)"]
-    subgraph Config["Config"]
+        direction LR
-        C1[model_config.py]
+        D1[dataset.py<br/>Dataset]
-        C2[train_config.py]
+        D2[sampler.py<br/>Sampler]
-        C3[scheduler_config.py]
+        D3[serialization.py<br/>Serialization]
        D4[tokenizer.py<br/>Tokenizer]
    end
-    class Config config;
+    
-
+    subgraph Model["Model Module (model/)"]
-    %% Trainer module
+        direction LR
-    subgraph Trainer["Trainer"]
+        M1[transformer.py<br/>Transformer Architecture]
-        T1[trainer.py]
+        M2[module.py<br/>Model Components]
        T2[train_content.py]
        T3[schedule.py]
        T4[strategy.py]
        T5[train_callback.py]
    end
-    class Trainer trainer;
+    
-
+    subgraph Trainer["Trainer Module (trainer/)"]
-    %% Data module
+        direction TB
-    subgraph Data["Data"]
+        T1[trainer.py<br/>Trainer Entry]
-        D1[dataset.py]
+        T2[train_context.py<br/>Training Context]
-        D2[sampler.py]
+        T3[strategy.py<br/>Training Strategy]
-        D3[mmap.py]
+        T4[schedule.py<br/>LR Scheduler]
-        D4[tokenizer.py]
+        T5[train_callback.py<br/>Callbacks]
-        D5[checkpoint.py]
+        T6[metric_util.py<br/>Metrics]
    end
-    class Data data;
+    
-
+    subgraph Inference["Inference Module (inference/)"]
-    %% Model module
+        direction LR
-    subgraph Model["Model"]
+        I1[generator.py<br/>Text Generation]
-        M1[transformer.py]
+        I2[core.py<br/>Inference Core]
-        M2[module.py]
+        I3[server.py<br/>API Service]
    end
-    class Model model;
+    
-
+    subgraph Parallel["Parallel Module (parallel/)"]
-    %% Inference module
+        direction LR
-    subgraph Inference["Inference"]
+        P1[setup.py<br/>Parallel Init]
-        I1[generator.py]
+        P2[module.py<br/>Parallel Components]
        I2[core.py]
    end
-    class Inference inference;
+    
-
+    subgraph Scripts["Scripts (scripts/)"]
-    %% Parallel module
+        direction LR
-    subgraph Parallel["Parallel"]
+        S1[tools/<br/>Train & Inference]
-        P1[setup.py]
+        S2[demo/<br/>Demos]
        P2[module.py]
    end
    class Parallel parallel;
-    %% Config dependencies
+    %% External config input
-    C2 -.-> T1
+    Config --> Trainer
-    C1 -.-> M1
+    
-    C3 -.-> T3
+    %% Training flow
-
+    Trainer -->|Load Model| Model
-    %% Trainer internal dependencies
+    Trainer -->|Load Data| Data
-    T1 --> T5
+    Trainer -->|Setup| Parallel
-    T1 --> T2
+    
-    T2 --> T3
+    %% Inference flow
-    T2 --> T4
+    Inference -->|Use Model| Model
-
+    Inference -->|Use| generator
-    %% Data flow
+    
-    D1 --> D2
+    %% Data dependency
-    D1 --> D3
+    Data -->|Data Pipeline| Model
-    D1 --> D4
+    
-    D1 --> D5
+    %% Parallel dependency
-
+    Parallel -->|Distributed| Trainer
-    %% Model dependencies
+    
-    M1 --> M2
+    %% Scripts
-
+    Scripts -->|Execute| Trainer
-    %% Inference dependencies
+    Scripts -->|Execute| Inference
    I1 --> I2
    %% Cross-module dependencies
    T2 -.-> M1
    I1 -.-> M1
    T2 -.-> D1
    T1 -.-> P1
 ```
-### 1. Configuration Management (/config/)
+### 1. Configuration Module (config/)
- **Model Configuration**: Defines model structure parameters (such as layers, heads, dimensions, etc.), managed uniformly through `ModelConfig`.
+- **model_config.py**: Defines model structure parameters (layers, heads, dimensions, etc.), managed through `ModelConfig`.
- **Training Configuration**: Sets training parameters (such as batch size, training stages PT/SFT/DPO, optimizers, etc.), loaded by `TrainConfig`.
+- **train_config.py**: Sets training parameters (batch size, training stages PT/SFT/DPO, optimizers, etc.), loaded by `TrainConfig`.
- **Scheduler Configuration**: Controls learning rate strategies (such as cosine annealing) and training progress.
+- **param_config.py**: Manages hyperparameters for training and inference.
 - **schedule_config.py**: Controls learning rate strategies (cosine annealing) and training progress.
-### 2. Hardware and Parallelism (/parallel/)
+### 2. Data Module (data/)
- **Distributed Initialization**: Initializes multi-GPU/multi-machine training environments through the `setup_parallel` function according to configuration.
+- **dataset.py**: Dataset handling and loading.
 - **sampler.py**: Data sampling for different training stages.
 - **serialization.py**: Data serialization and deserialization.
 - **tokenizer.py**: Text tokenization and encoding.
-### 3. Data Processing (/data/)
+### 3. Model Module (model/)
- **Efficient Loading**: Uses memory mapping (mmap) technology to load massive corpora, avoiding memory overflow and achieving zero-copy reading.
+- **transformer.py**: Transformer architecture implementation.
 - **module.py**: Model components and layers.
-### 4. Model and Training (/model/, /trainer/)
+### 4. Trainer Module (trainer/)
- **Unified Model Architecture**: Based on Transformer, supporting flexible configuration of different scales (such as 7B, 13B).
+- **trainer.py**: Main training entry point.
- **Strategy-based Trainer**: `Trainer` automatically switches training strategies according to training stages (PT/SFT/DPO), reusing the same training loop.
+- **train_context.py**: Training context management (model, optimizer, scheduler, metrics).
- **Training Context Management**: Unifies management of model, optimizer, scheduler, and metrics, supporting seamless multi-stage transitions.
+- **strategy.py**: Training strategies for PT/SFT/DPO stages.
 - **schedule.py**: Learning rate scheduler.
 - **train_callback.py**: Training callbacks (checkpoint, early stopping, etc.).
 - **metric_util.py**: Metrics calculation and tracking.
-### 5. Inference Service (/inference/, /utils/)
+### 5. Inference Module (inference/)
- **Unified Generation Interface**: Provides synchronous, batch, and streaming generation methods, adapting to all training stages.
+- **generator.py**: Text generation with various methods (sync, batch, streaming).
- **KV Cache Optimization**: Caches Key/Value during autoregressive generation, utilizing high-speed on-chip memory acceleration on NVIDIA GPU.
+- **core.py**: Inference core with KV cache optimization.
- **RAG Support**: Combines retriever and embedding models to inject relevant information from external knowledge bases, improving answer quality.
+- **server.py**: API service for inference.
- **Intelligent Text Segmentation**:
+
-  - **Structure-first Segmentation**: Splits by titles, paragraphs, etc.;
+### 6. Parallel Module (parallel/)
-  - **Semantic Segmentation**: Based on sentence embedding similarity, ensuring fragment semantic completeness and improving fine-tuning effects.
+- **setup.py**: Distributed initialization for multi-GPU/multi-machine training.
 - **module.py**: Parallel communication components.
 ### 7. Scripts (scripts/)
 - **tools/**: Main scripts for training and inference (train.py, generate.py, etc.).
 - **demo/**: Demo scripts for interactive chat, batch generation, etc.
 ## 3. Training Process