docs: 更新 design.md 项目结构和模块文档

This commit is contained in:
ViperEkura 2026-04-02 20:11:19 +08:00
parent 912d7c7f54
commit 8b6509b305
1 changed files with 105 additions and 96 deletions

View File

@ -9,117 +9,126 @@ Thus, the AstrAI project was born - 1B parameters, Chinese-English bilingual, su
The system is divided into the following modules: The system is divided into the following modules:
```mermaid ```mermaid
graph LR flowchart TB
%% Style definitions %% Style definitions
classDef config fill:#e1f5fe,stroke:#01579b; classDef config fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
classDef trainer fill:#f3e5f5,stroke:#4a148c; classDef data fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px;
classDef data fill:#e8f5e8,stroke:#1b5e20; classDef model fill:#fff3e0,stroke:#e65100,stroke-width:2px;
classDef model fill:#fff3e0,stroke:#e65100; classDef trainer fill:#f3e5f5,stroke:#4a148c,stroke-width:2px;
classDef inference fill:#fce4ec,stroke:#880e4f; classDef inference fill:#fce4ec,stroke:#880e4f,stroke-width:2px;
classDef parallel fill:#e0f2f1,stroke:#004d40; classDef parallel fill:#e0f2f1,stroke:#004d40,stroke-width:2px;
classDef scripts fill:#fffbe6,stroke:#f57f17,stroke-width:2px;
subgraph Config["Config Module (config/)"]
direction LR
C1[model_config.py<br/>Model Architecture]
C2[train_config.py<br/>Training Params]
C3[param_config.py<br/>Hyperparameters]
C4[schedule_config.py<br/>Scheduler Config]
end
%% Config module subgraph Data["Data Module (data/)"]
subgraph Config["Config"] direction LR
C1[model_config.py] D1[dataset.py<br/>Dataset]
C2[train_config.py] D2[sampler.py<br/>Sampler]
C3[scheduler_config.py] D3[serialization.py<br/>Serialization]
D4[tokenizer.py<br/>Tokenizer]
end end
class Config config;
subgraph Model["Model Module (model/)"]
%% Trainer module direction LR
subgraph Trainer["Trainer"] M1[transformer.py<br/>Transformer Architecture]
T1[trainer.py] M2[module.py<br/>Model Components]
T2[train_content.py]
T3[schedule.py]
T4[strategy.py]
T5[train_callback.py]
end end
class Trainer trainer;
subgraph Trainer["Trainer Module (trainer/)"]
%% Data module direction TB
subgraph Data["Data"] T1[trainer.py<br/>Trainer Entry]
D1[dataset.py] T2[train_context.py<br/>Training Context]
D2[sampler.py] T3[strategy.py<br/>Training Strategy]
D3[mmap.py] T4[schedule.py<br/>LR Scheduler]
D4[tokenizer.py] T5[train_callback.py<br/>Callbacks]
D5[checkpoint.py] T6[metric_util.py<br/>Metrics]
end end
class Data data;
subgraph Inference["Inference Module (inference/)"]
%% Model module direction LR
subgraph Model["Model"] I1[generator.py<br/>Text Generation]
M1[transformer.py] I2[core.py<br/>Inference Core]
M2[module.py] I3[server.py<br/>API Service]
end end
class Model model;
subgraph Parallel["Parallel Module (parallel/)"]
%% Inference module direction LR
subgraph Inference["Inference"] P1[setup.py<br/>Parallel Init]
I1[generator.py] P2[module.py<br/>Parallel Components]
I2[core.py]
end end
class Inference inference;
subgraph Scripts["Scripts (scripts/)"]
%% Parallel module direction LR
subgraph Parallel["Parallel"] S1[tools/<br/>Train & Inference]
P1[setup.py] S2[demo/<br/>Demos]
P2[module.py]
end end
class Parallel parallel;
%% Config dependencies %% External config input
C2 -.-> T1 Config --> Trainer
C1 -.-> M1
C3 -.-> T3 %% Training flow
Trainer -->|Load Model| Model
%% Trainer internal dependencies Trainer -->|Load Data| Data
T1 --> T5 Trainer -->|Setup| Parallel
T1 --> T2
T2 --> T3 %% Inference flow
T2 --> T4 Inference -->|Use Model| Model
Inference -->|Use| generator
%% Data flow
D1 --> D2 %% Data dependency
D1 --> D3 Data -->|Data Pipeline| Model
D1 --> D4
D1 --> D5 %% Parallel dependency
Parallel -->|Distributed| Trainer
%% Model dependencies
M1 --> M2 %% Scripts
Scripts -->|Execute| Trainer
%% Inference dependencies Scripts -->|Execute| Inference
I1 --> I2
%% Cross-module dependencies
T2 -.-> M1
I1 -.-> M1
T2 -.-> D1
T1 -.-> P1
``` ```
### 1. Configuration Management (/config/) ### 1. Configuration Module (config/)
- **Model Configuration**: Defines model structure parameters (such as layers, heads, dimensions, etc.), managed uniformly through `ModelConfig`. - **model_config.py**: Defines model structure parameters (layers, heads, dimensions, etc.), managed through `ModelConfig`.
- **Training Configuration**: Sets training parameters (such as batch size, training stages PT/SFT/DPO, optimizers, etc.), loaded by `TrainConfig`. - **train_config.py**: Sets training parameters (batch size, training stages PT/SFT/DPO, optimizers, etc.), loaded by `TrainConfig`.
- **Scheduler Configuration**: Controls learning rate strategies (such as cosine annealing) and training progress. - **param_config.py**: Manages hyperparameters for training and inference.
- **schedule_config.py**: Controls learning rate strategies (cosine annealing) and training progress.
### 2. Hardware and Parallelism (/parallel/) ### 2. Data Module (data/)
- **Distributed Initialization**: Initializes multi-GPU/multi-machine training environments through the `setup_parallel` function according to configuration. - **dataset.py**: Dataset handling and loading.
- **sampler.py**: Data sampling for different training stages.
- **serialization.py**: Data serialization and deserialization.
- **tokenizer.py**: Text tokenization and encoding.
### 3. Data Processing (/data/) ### 3. Model Module (model/)
- **Efficient Loading**: Uses memory mapping (mmap) technology to load massive corpora, avoiding memory overflow and achieving zero-copy reading. - **transformer.py**: Transformer architecture implementation.
- **module.py**: Model components and layers.
### 4. Model and Training (/model/, /trainer/) ### 4. Trainer Module (trainer/)
- **Unified Model Architecture**: Based on Transformer, supporting flexible configuration of different scales (such as 7B, 13B). - **trainer.py**: Main training entry point.
- **Strategy-based Trainer**: `Trainer` automatically switches training strategies according to training stages (PT/SFT/DPO), reusing the same training loop. - **train_context.py**: Training context management (model, optimizer, scheduler, metrics).
- **Training Context Management**: Unifies management of model, optimizer, scheduler, and metrics, supporting seamless multi-stage transitions. - **strategy.py**: Training strategies for PT/SFT/DPO stages.
- **schedule.py**: Learning rate scheduler.
- **train_callback.py**: Training callbacks (checkpoint, early stopping, etc.).
- **metric_util.py**: Metrics calculation and tracking.
### 5. Inference Service (/inference/, /utils/) ### 5. Inference Module (inference/)
- **Unified Generation Interface**: Provides synchronous, batch, and streaming generation methods, adapting to all training stages. - **generator.py**: Text generation with various methods (sync, batch, streaming).
- **KV Cache Optimization**: Caches Key/Value during autoregressive generation, utilizing high-speed on-chip memory acceleration on NVIDIA GPU. - **core.py**: Inference core with KV cache optimization.
- **RAG Support**: Combines retriever and embedding models to inject relevant information from external knowledge bases, improving answer quality. - **server.py**: API service for inference.
- **Intelligent Text Segmentation**:
- **Structure-first Segmentation**: Splits by titles, paragraphs, etc.; ### 6. Parallel Module (parallel/)
- **Semantic Segmentation**: Based on sentence embedding similarity, ensuring fragment semantic completeness and improving fine-tuning effects. - **setup.py**: Distributed initialization for multi-GPU/multi-machine training.
- **module.py**: Parallel communication components.
### 7. Scripts (scripts/)
- **tools/**: Main scripts for training and inference (train.py, generate.py, etc.).
- **demo/**: Demo scripts for interactive chat, batch generation, etc.
## 3. Training Process ## 3. Training Process