chore: 简化格式并更新文档

2026-03-31 00:28:58 +08:00 · 2026-03-31 00:28:58 +08:00 · 50488bd659
parent eb57e55fca
commit 50488bd659
14 changed files with 506 additions and 582 deletions
--- a/README.md
+++ b/README.md
@ -1,286 +1,147 @@
-![image-20250306182014120](/assets/images/project_logo.png)
+<div align="center">
  <img src="assets/images/project_logo.png" width="auto" alt="Logo">
-<div style="display: flex; flex-direction: column; align-items: center; justify-content: center; text-align: center; font-size: 16px; font-weight: bold; margin-top: 50px;">
+  <h1>KHAOSZ</h1>
  <div>
-    <a href="#english" style="text-decoration: none; margin: 0 10px; color: blue;">English</a> | 
+    <a href="#english">English</a> • 
-    <a href="#chinese" style="text-decoration: none; margin: 0 10px; color: blue;">中文</a>
+    <a href="#chinese">中文</a>
  </div>
-  <h1 style="margin: 20px 0 0 0; font-size: 2.5em; font-weight: bold;">KHAOSZ </h1>
+  <p>
    <strong>A lightweight Transformer training & inference framework</strong>
  </p>
 </div>
-<h2 id="english">English Version</h2>
+## 📖 Table of Contents | 目录
-A training and inference framework for autoregressive Transformer language models.
+<div align="center">
-**Model Download Options (choose one):**
+| English | 中文 |
 |---------|------|
 | [Installation](#installation) | [安装](#安装) |
 | [Quick Start](#quick-start) | [快速开始](#快速开始) |
 | [Documentation](#documentation) | [文档](#文档) |
 | [License](#license) | [许可证](#许可证) |
-1. Visit [HuggingFace](https://huggingface.co/ViperEk/KHAOSZ) and check **Files and versions**
+</div>
 2. Run `scripts/download.py` to download model parameters
-**Demo Video:** [bilibili](https://www.bilibili.com/video/BV1z5RPYHEkd)
+---
-For training data sources, please refer to the **Model Card** section on the HuggingFace download page.
+<a id="english"></a>
 ## English
-**License:** The code follows the GPL-3.0 license. Please provide attribution when using it.
+### Features
- **📊 Device Selection:** Uses CUDA for training by default
+- 🚀 **High Performance**: Optimized for both training and inference
- **🌐 Performance Optimization:** Enable `dtype=torch.bfloat16` to accelerate training and reduce memory usage. Ensure your hardware supports this feature
+- 🔧 **Flexible**: Support for seq/sft/dpo training
- **🤖 Language Support:** The model supports training in Chinese and English. Since the BBPE tokenizer hasn't been trained on multilingual text, OOV (Out-of-Vocabulary) issues are minimal for Chinese and English, but may exist for other languages
+- 💡 **Easy to Use**: Simple API with comprehensive examples
 - 📦 **Lightweight**: Minimal dependencies
-
+### Installation
 ### 📌 Training Guide
 To train this Transformer model, follow these steps:
 **(1). Prepare the Dataset:**
 Place the dataset in the specified root directory. This system uses the BBPE tokenizer for tokenization and requires training with pre-tokenized segments (stored as *.h5 format files).
 **(2). Install Dependencies:**
 ```bash
 git clone https://github.com/username/khaosz.git
 cd khaosz
 pip install -e .
 ```
-**(3). Run the Training Script:**
+### Quick Start
 ```bash
-python train.py \
+# Train
--train_type=train_type[seq, sft, dpo] \
+python tools/train.py \
--data_root_path=/path/to/dataset \
+  --train_type=seq \
--param_path=/path/to/param_path \
+  --data_root_path=/path/to/dataset \
--n_epoch=5 \
+  --param_path=/path/to/param_path
--batch_size=8 \
+
--max_lr=2e-4 \
+# Generate
--ckpt_interval=10000 \
+python tools/generate.py --param_path=/path/to/param_path
 --ckpt_dir=checkpoints 
 ```
-**Parameter Explanation:**
+### Demo
 - `--train_type`: Training type (seq, sft, dpo)
 - `--data_root_path`: Dataset root directory
 - `--param_path`: Path to model training parameters
 - `--n_epoch`: Total number of training epochs
 - `--batch_size`: Batch size
 - `--accumulation_steps`: Number of batches per training step
 - `--warmup_steps`: Warmup steps
 - `--max_lr`: Maximum learning rate (using warmup + cosine decay)
 - `--ckpt_interval`: Checkpoint saving interval
 - `--ckpt_dir`: Checkpoint saving directory
 - `--resume_dir`: Resume training from specified path
 ### 👉 Usage Guide
 **(1). Chat with the Model:**
 Open `chat.py` or use the streaming/non-streaming interfaces:
 **Streaming Output:**
 ```python
 import torch
 from khaosz import Khaosz
 model_dir = "your_model_parameter_dir"
 model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
 history = []
 while True:
    query = input(">> ")
    if query == "!exit":
        break
    response_size = 0
    for response, history in model.stream_generate(
        query=query, 
        history=history,
        temperature=0.85,
        top_p=0.95,
        top_k=50
    ):
        print(response[response_size:], end="")
        response_size = len(response)       
 ```
 **Non-streaming Output:**
 ```python
 import torch
 from khaosz import Khaosz
 model_dir = "your_model_parameter_dir"
 model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
 history = []
 while True:
    query = input(">> ")
    if query == "!exit":
        break
    response = model.generate(
        query=query, 
        history=history,
        temperature=0.85,
        top_p=0.95,
        top_k=50
    )
    print(response)
 ```
 **(2). Retrieval-Augmented Generation (RAG):**
 ```python
 import torch
 from khaosz import Khaosz
 model_dir = "your_model_parameter_dir"
 model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
 retrieved_content = model.retrieve_generate(
    query=query,
    retrieve_top_k=5,
    temperature=0.6,
    top_k=30,
    top_p=0.95
 )
 print(retrieved_content)
 ```
 <h2 id="chinese">中文版本</h2>
 这是一个支持基于自回归模式的 Transfomer 语言模型训练以及推理框架
 **模型下载选项（任选其一）：**
 1. 访问 [HuggingFace](https://huggingface.co/ViperEk/KHAOSZ) 查看 **Files and versions**
 2. 运行 `scripts/download.py` 下载模型参数
 **演示视频：** [bilibili](https://www.bilibili.com/video/BV1z5RPYHEkd)
 训练数据来源请参见 HuggingFace 下载页面中的 **Model Card** 部分。
 **许可证：** 代码遵循 GPL-3.0 协议，使用时请注明出处。
 - **📊 设备选择：** 默认使用 CUDA 进行训练
 - **🌐 性能优化：** 启用 `dtype=torch.bfloat16` 以加速训练并减少内存占用，请确保硬件支持该特性
 - **🤖 语言支持：** 模型支持中文和英文训练。由于 BBPE 分词器未使用多语言文本训练，因此中英文的 OOV（未登录词）问题较少，其他语言可能存在 OOV 问题
 ### 📌 训练指南
 要训练该 Transformer 模型，请按照以下步骤操作：
 **(1). 准备数据集：**
 将数据集放置在指定的根目录下， 本系统采用 BBPE 分词器进行分词，并且要求使用已经经过分词的 token 分段训练（分段存储为 *.h5 格式）
 **(2). 安装依赖：**
 ```bash
 # run download before using
 python demo/download.py
 # run demo
 python demo/stream_chat.py
 python demo/generate_batch.py
 python demo/generate_ar.py
 ```
 - [bilibili](https://www.bilibili.com/video/BV1z5RPYHEkd)
 <a id="license"></a>
 ### License
 GPL-3.0
 ---
 <a id="chinese"></a>
 ## 中文
 ### 特性
 - 🚀 **高性能**: 训练与推理双向优化
 - 🔧 **灵活**: 支持 seq/sft/dpo 多种训练方式
 - 💡 **易用**: 简洁的 API 与丰富的示例
 - 📦 **轻量**: 依赖少，部署简单
 ### 安装
 ```bash
 git clone https://github.com/username/khaosz.git
 cd khaosz
 pip install -e .
 ```
-**(3). 运行训练脚本：**
+### 快速开始
 ```bash
-python train.py \
+# 训练
--train_type=train_type[seq, sft, dpo] \
+python tools/train.py \
--data_root_path=/path/to/dataset \
+  --train_type=seq \
--param_path=/path/to/param_path \
+  --data_root_path=/path/to/dataset \
--n_epoch=5 \
+  --param_path=/path/to/param_path
--batch_size=8 \
+
--max_lr=2e-4 \
+# 生成
--ckpt_interval=10000 \
+python tools/generate.py --param_path=/path/to/param_path
 --ckpt_dir=checkpoints 
 ```
-**参数说明：**
+### 演示
 - `--train_type`: 训练类型（seq, sft, dpo）
 - `--data_root_path`: 数据集根目录
 - `--param_path`: 模型训练参数路径
 - `--n_epoch`: 总训练轮数
 - `--batch_size`: 批量大小
 - `--accumulation_steps`: 每个训练步骤的 batch 数量
 - `--warmup_steps`: 预热步数（warmup steps）
 - `--max_lr`: 最大学习率（使用预热 + 余弦衰减）
 - `--ckpt_interval`: 检查点保存间隔
 - `--ckpt_dir`: 检查点保存目录
 - `--resume_dir`: 从指定路径恢复训练
 ```bash
 # 使用前先下载模型
 python demo/download.py
-
+# 运行示例
-### 👉 使用指南
+python demo/stream_chat.py
-
+python demo/generate_batch.py
-**(1). 与模型对话：**
+python demo/generate_ar.py
 打开 `chat.py` 或使用流式/非流式接口：
 **流式输出：**
 ```python
 import torch
 from khaosz import Khaosz
 model_dir = "your_model_parameter_dir"
 model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
 history = []
 while True:
    query = input(">> ")
    if query == "!exit":
        break
    response_size = 0
    for response, history in model.stream_generate(
        query=query, 
        history=history,
        temperature=0.85,
        top_p=0.95,
        top_k=50
    ):
        print(response[response_size:], end="")
        response_size = len(response)       
 ```
-**非流式输出：**
+- [bilibili](https://www.bilibili.com/video/BV1z5RPYHEkd)
 ```python
 import torch
 from khaosz import Khaosz
-model_dir = "your_model_parameter_dir"
+### 许可证
 model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
 history = []
-while True:
+GPL-3.0
    query = input(">> ")
    if query == "!exit":
        break
-    response = model.generate(
+---
        query=query, 
        history=history,
        temperature=0.85,
        top_p=0.95,
        top_k=50
    )
    print(response)
 ```
-**(2). 基于检索的生成（RAG）：**
+<a id="documentation"></a>
 ## 📚 Documentation | 文档
-```python
+| Document | 说明 |
-import torch
+|----------|------|
-from khaosz import Khaosz
+| [参数说明](assets/docs/params.md) | Training & inference parameters |
 | [设计文档](assets/docs/design.md) | Framework design |
 | [数据流程](assets/docs/dataflow.md) | Data processing pipeline |
 | [模型介绍](assets/docs/introduction.md) | Model architecture |
-model_dir = "your_model_parameter_dir"
+### Download | 下载
 model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
-retrieved_content = model.retrieve_generate(
+- [HuggingFace](https://huggingface.co/ViperEk/KHAOSZ)
-    query=query,
+- `python demo/download.py`
    retrieve_top_k=5,
    temperature=0.6,
    top_k=30,
    top_p=0.95
 )
 print(retrieved_content)
 ```
--- a/assets/docs/dataflow.md
+++ b/assets/docs/dataflow.md
@ -1,205 +1,205 @@
-# KHAOSZ 数据流文档
+# KHAOSZ Data Flow Documentation
-本文档描述 KHAOSZ 项目（一个自回归 Transformer 语言模型的训练与推理框架）的数据流。涵盖从原始数据到模型训练、推理的完整流程。
+This document describes the data flow of the KHAOSZ project (a training and inference framework for autoregressive Transformer language models). It covers the complete flow from raw data to model training and inference.
-## 概述
+## Overview
-KHAOSZ 采用模块化设计，主要组件包括：
+KHAOSZ adopts a modular design with the following main components:
- **数据模块** (`khaosz/data/`): 数据集、采样器、分词器、序列化工具
+- **Data Module** (`khaosz/data/`): Dataset, sampler, tokenizer, serialization tools
- **模型模块** (`khaosz/model/`): Transformer 模型及其子模块
+- **Model Module** (`khaosz/model/`): Transformer model and its submodules
- **训练模块** (`khaosz/trainer/`): 训练器、训练上下文、策略、调度器
+- **Training Module** (`khaosz/trainer/`): Trainer, training context, strategies, schedulers
- **推理模块** (`khaosz/inference/`): 生成核心、KV 缓存管理、流式生成
+- **Inference Module** (`khaosz/inference/`): Generation core, KV cache management, streaming generation
- **配置模块** (`khaosz/config/`): 模型、训练、调度等配置
+- **Config Module** (`khaosz/config/`): Model, training, scheduler, and other configurations
- **并行模块** (`khaosz/parallel/`): 分布式训练支持
+- **Parallel Module** (`khaosz/parallel/`): Distributed training support
-数据流总体可分为 **训练数据流** 与 **推理数据流** 两条主线。
+The data flow can generally be divided into two main lines: **Training Data Flow** and **Inference Data Flow**.
-## 数据流图
+## Data Flow Diagram
 ```mermaid
 flowchart LR
-    subgraph A[数据准备]
+    subgraph A[Data Preparation]
        direction TB
-        A1[原始文本] --> A2[BBPE 分词器]
+        A1[Raw Text] --> A2[BBPE Tokenizer]
-        A2 --> A3[序列化为 .h5 文件]
+        A2 --> A3[Serialize to .h5 files]
-        A3 --> A4[数据集加载<br/>BaseDataset]
+        A3 --> A4[Dataset Loading<br/>BaseDataset]
-        A4 --> A5[可恢复分布式采样器<br/>ResumableDistributedSampler]
+        A4 --> A5[Resumable Distributed Sampler<br/>ResumableDistributedSampler]
-        A5 --> A6[DataLoader 批量加载]
+        A5 --> A6[DataLoader Batch Loading]
    end
-    subgraph B[训练循环]
+    subgraph B[Training Loop]
        direction TB
-        B1[批次数据] --> B2[训练策略<br/>BaseStrategy]
+        B1[Batch Data] --> B2[Training Strategy<br/>BaseStrategy]
-        B2 --> B3[Transformer 模型]
+        B2 --> B3[Transformer Model]
-        B3 --> B4[输出 logits]
+        B3 --> B4[Output logits]
-        B4 --> B5[损失计算]
+        B4 --> B5[Loss Calculation]
-        B5 --> B6[反向传播]
+        B5 --> B6[Backpropagation]
-        B6 --> B7[优化器更新]
+        B6 --> B7[Optimizer Update]
-        B7 --> B8[学习率调度器]
+        B7 --> B8[Learning Rate Scheduler]
-        B8 --> B9[检查点保存]
+        B8 --> B9[Checkpoint Save]
    end
-    subgraph C[推理生成]
+    subgraph C[Inference Generation]
        direction TB
-        C1[检查点加载] --> C2[推理模型加载]
+        C1[Checkpoint Loading] --> C2[Inference Model Loading]
-        C2 --> C3[生成核心<br/>GeneratorCore]
+        C2 --> C3[Generation Core<br/>GeneratorCore]
-        C3 --> C4[采样策略<br/>温度/top‑k/top‑p]
+        C3 --> C4[Sampling Strategy<br/>Temperature/top-k/top-p]
-        C4 --> C5[生成下一个 token]
+        C4 --> C5[Generate Next Token]
-        C5 --> C6[KV 缓存更新]
+        C5 --> C6[KV Cache Update]
-        C6 --> C7{是否达到最大长度?}
+        C6 --> C7{Max Length Reached?}
-        C7 -->|否| C5
+        C7 -->|No| C5
-        C7 -->|是| C8[输出生成文本]
+        C7 -->|Yes| C8[Output Generated Text]
    end
    A --> B
    B --> C
 ```
-## 各模块详细说明
+## Detailed Module Descriptions
-### 1. 数据模块
+### 1. Data Module
-#### 1.1 分词器 (`tokenizer.py`)
+#### 1.1 Tokenizer (`tokenizer.py`)
- 基于 Byte‑Level BPE (BBPE) 实现
+- Implemented based on Byte-Level BPE (BBPE)
- 支持特殊 token：`<bos>`, `<eos>`, `<pad>`, `<|im_start|>`, `<|im_end|>`
+- Supports special tokens: `<bos>`, `<eos>`, `<pad>`, `<|im_start|>`, `<|im_end|>`
- 提供 `encode`/`decode` 方法，将文本与 token ID 相互转换
+- Provides `encode`/`decode` methods for mutual conversion between text and token IDs
- 训练时从语料库学习词汇表，保存为 `.json` 文件
+- Learns vocabulary from corpus during training, saved as `.json` files
-#### 1.2 序列化 (`serialization.py`)
+#### 1.2 Serialization (`serialization.py`)
- **`save_h5`**: 将多个张量按组保存为 HDF5 文件（`.h5`），每个键对应一个张量列表
+- **`save_h5`**: Saves multiple tensors by groups as HDF5 files (`.h5`), each key corresponds to a list of tensors
- **`load_h5`**: 加载 `.h5` 文件，返回 `Dict[str, List[Tensor]]`，支持共享内存 (`share_memory=True`)
+- **`load_h5`**: Loads `.h5` files, returns `Dict[str, List[Tensor]]`, supports shared memory (`share_memory=True`)
- **`Checkpoint` 类**: 封装模型状态字典、训练轮次、迭代次数，支持 safetensors 格式保存与加载
+- **`Checkpoint` class**: Encapsulates model state dict, training epoch, iteration count; supports safetensors format for saving and loading
-#### 1.3 数据集 (`dataset.py`)
+#### 1.3 Dataset (`dataset.py`)
- **`BaseDataset`**: 抽象基类，定义窗口采样、步长等通用逻辑
+- **`BaseDataset`**: Abstract base class, defines common logic for window sampling, stride, etc.
- **`BaseSegmentFetcher`** 与 **`MultiSegmentFetcher`**: 高效地从多个分段中获取指定索引范围的数据
+- **`BaseSegmentFetcher`** and **`MultiSegmentFetcher`**: Efficiently fetch data from specified index ranges in multiple segments
- **`DatasetFactory`**: 工厂模式，支持动态注册数据集类型（`seq`, `sft`, `dpo`, `grpo`）
+- **`DatasetFactory`**: Factory pattern, supports dynamic registration of dataset types (`seq`, `sft`, `dpo`, `grpo`)
- 数据集加载后通过 `MultiSegmentFetcher` 管理多个数据键（如 `"sequence"`, `"mask"`）
+- After dataset loading, multiple data keys (such as `"sequence"`, `"mask"`) are managed through `MultiSegmentFetcher`
-#### 1.4 采样器 (`sampler.py`)
+#### 1.4 Sampler (`sampler.py`)
- **`ResumableDistributedSampler`**: 支持分布式训练的可恢复采样器
+- **`ResumableDistributedSampler`**: Resumable sampler supporting distributed training
- 记录当前 epoch 和迭代位置，便于从断点继续训练
+- Records current epoch and iteration position, enabling training resume from breakpoints
- 支持 shuffle 与 drop_last 选项
+- Supports shuffle and drop_last options
-### 2. 模型模块
+### 2. Model Module
 #### 2.1 Transformer (`transformer.py`)
- 核心自回归解码器架构
+- Core autoregressive decoder architecture
- 包含嵌入层、多层 `DecoderBlock`、RMSNorm 和线性输出头
+- Contains embedding layer, multi-layer `DecoderBlock`, RMSNorm, and linear output head
- 支持权重绑定 (`tie_weight=True`) 以减小参数量
+- Supports weight tying (`tie_weight=True`) to reduce parameter count
- 使用 Rotary Position Embedding (RoPE) 注入位置信息
+- Uses Rotary Position Embedding (RoPE) to inject position information
-#### 2.2 子模块 (`module.py`)
+#### 2.2 Submodules (`module.py`)
- **`RotaryEmbedding`**: 生成 RoPE 的 cos/sin 缓存
+- **`RotaryEmbedding`**: Generates RoPE cos/sin cache
- **`DecoderBlock`**: 包含多头注意力（支持 GQA）、前馈网络（FFN）、残差连接
+- **`DecoderBlock`**: Contains multi-head attention (supports GQA), feedforward network (FFN), residual connections
- **`RMSNorm`**: 层归一化变体
+- **`RMSNorm`**: Layer normalization variant
- **`Linear`**, **`Embedding`**: 自定义线性层与嵌入层，支持并行化包装
+- **`Linear`**, **`Embedding`**: Custom linear layer and embedding layer, supporting parallelism wrappers
-### 3. 训练模块
+### 3. Training Module
-#### 3.1 训练上下文 (`train_context.py`)
+#### 3.1 Training Context (`train_context.py`)
- **`TrainContext`**: 数据类，封装训练所需的所有组件（模型、优化器、数据加载器、策略等）
+- **`TrainContext`**: Data class encapsulating all components needed for training (model, optimizer, data loader, strategy, etc.)
- **`TrainContextBuilder`**: 构建器模式，逐步组装训练上下文，支持从检查点恢复
+- **`TrainContextBuilder`**: Builder pattern, progressively assembles training context, supports resume from checkpoint
-#### 3.2 训练器 (`trainer.py`)
+#### 3.2 Trainer (`trainer.py`)
- **`Trainer`**: 主训练循环，管理回调函数（进度条、检查点、指标记录、梯度裁剪、调度器）
+- **`Trainer`**: Main training loop, manages callbacks (progress bar, checkpoint, metric logging, gradient clipping, scheduler)
- 支持分布式训练（通过 `spawn_parallel_fn` 启动多进程）
+- Supports distributed training (launches multi-process via `spawn_parallel_fn`)
- 训练步骤包括：
+- Training steps include:
-  1. `on_train_begin` → 2. `on_epoch_begin` → 3. `on_batch_begin` → 4. 前向/损失计算 → 5. `on_batch_end` → 6. 梯度累积 → 7. `on_step_begin` → 8. 优化器更新 → 9. `on_step_end` → 10. `on_epoch_end`
+  1. `on_train_begin` → 2. `on_epoch_begin` → 3. `on_batch_begin` → 4. Forward/loss calculation → 5. `on_batch_end` → 6. Gradient accumulation → 7. `on_step_begin` → 8. Optimizer update → 9. `on_step_end` → 10. `on_epoch_end`
-#### 3.3 策略 (`strategy.py`)
+#### 3.3 Strategy (`strategy.py`)
- **`BaseStrategy`**: 定义训练策略接口（如 `SeqStrategy`, `SFTStrategy`, `DPOStrategy`）
+- **`BaseStrategy`**: Defines training strategy interface (such as `SeqStrategy`, `SFTStrategy`, `DPOStrategy`)
- 策略接收批次数据，执行模型前向传播、损失计算，返回 loss 张量
+- Strategy receives batch data, executes model forward pass, loss calculation, returns loss tensor
- 由 `StrategyFactory` 根据配置动态创建
+- Created dynamically by `StrategyFactory` according to configuration
-#### 3.4 调度器 (`schedule.py`)
+#### 3.4 Scheduler (`schedule.py`)
- **`BaseScheduler`**: 抽象基类，定义学习率调度接口
+- **`BaseScheduler`**: Abstract base class defining learning rate scheduling interface
- **`SchedulerFactory`**: 工厂模式，支持注册多种调度器（如 `cosine`, `sgdr`）
+- **`SchedulerFactory`**: Factory pattern, supports registration of various schedulers (such as `cosine`, `sgdr`)
- 调度器根据配置自动创建，并与优化器绑定
+- Scheduler is automatically created according to configuration and bound to optimizer
-### 4. 推理模块
+### 4. Inference Module
-#### 4.1 生成核心 (`core.py`)
+#### 4.1 Generation Core (`core.py`)
- **`GeneratorCore`**: 提供 `generate_iterator` 方法，执行单步生成
+- **`GeneratorCore`**: Provides `generate_iterator` method, executes single-step generation
- 应用采样策略（温度、top‑k、top‑p）对 logits 进行筛选
+- Applies sampling strategies (temperature, top-k, top-p) to filter logits
- 支持 KV 缓存以加速自回归生成
+- Supports KV cache to accelerate autoregressive generation
-#### 4.2 KV 缓存管理 (`core.py`)
+#### 4.2 KV Cache Management (`core.py`)
- **`KVCacheManager`**: 管理每层的 K 和 V 缓存，支持批量生成与长度扩展
+- **`KVCacheManager`**: Manages K and V cache for each layer, supports batch generation and length extension
- 缓存形状为 `[batch_size, n_kv_heads, seq_len, head_dim]`
+- Cache shape is `[batch_size, n_kv_heads, seq_len, head_dim]`
-#### 4.3 生成器 (`generator.py`)
+#### 4.3 Generator (`generator.py`)
- **`GenerationRequest`**: 封装生成请求参数（top_k, top_p, temperature, max_len, query, history 等）
+- **`GenerationRequest`**: Encapsulates generation request parameters (top_k, top_p, temperature, max_len, query, history, etc.)
- **`build_prompt`**: 将查询与历史记录转换为 ChatML 格式的提示字符串
+- **`build_prompt`**: Converts query and history into ChatML format prompt string
- **`pad_sequence`**: 对输入 ID 进行填充，使其长度一致
+- **`pad_sequence`**: Pads input IDs to consistent length
- 提供流式与非流式生成接口
+- Provides streaming and non-streaming generation interfaces
-## 训练数据流详细步骤
+## Training Data Flow - Detailed Steps
-1. **数据准备**
+1. **Data Preparation**
-   - 原始文本经过 BBPE 分词器转换为 token ID 序列
+   - Raw text is converted to token ID sequences through BBPE tokenizer
-   - 将 token ID 序列（可能带有掩码、标签等）按组保存为 `.h5` 文件
+   - Token ID sequences (possibly with masks, labels, etc.) are saved by groups as `.h5` files
-   - 文件可包含多个分段，每个分段对应一个张量
+   - Files can contain multiple segments, each segment corresponds to a tensor
-2. **数据集加载**
+2. **Dataset Loading**
-   - `BaseDataset` 的 `load` 方法调用 `load_h5`，得到 `segments` 字典
+   - `BaseDataset`'s `load` method calls `load_h5`, obtaining `segments` dictionary
-   - 创建 `MultiSegmentFetcher` 管理多个键的数据
+   - Create `MultiSegmentFetcher` to manage data for multiple keys
-   - 计算总样本数，并根据窗口大小、步长确定每个样本的起始/结束索引
+   - Calculate total sample count, and determine start/end indices for each sample based on window size and stride
-3. **采样与批量加载**
+3. **Sampling and Batch Loading**
-   - `ResumableDistributedSampler` 根据当前 epoch 和迭代位置生成索引序列
+   - `ResumableDistributedSampler` generates index sequence based on current epoch and iteration position
-   - `DataLoader` 使用采样器获取索引，调用数据集的 `__getitem__` 获取实际数据
+   - `DataLoader` uses sampler to get indices, calls dataset's `__getitem__` to get actual data
-   - 批量数据形状为 `[batch_size, window_size]`（或根据具体数据集类型变化）
+   - Batch data shape is `[batch_size, window_size]` (or varies according to specific dataset type)
-4. **策略前向与损失计算**
+4. **Strategy Forward and Loss Calculation**
-   - 批次数据传入策略（如 `SeqStrategy`）
+   - Batch data is passed to strategy (such as `SeqStrategy`)
-   - 策略内部调用 `Transformer` 模型，得到 logits
+   - Strategy internally calls `Transformer` model, obtaining logits
-   - 根据任务类型计算交叉熵损失（或 DPO 损失等）
+   - Calculate cross-entropy loss (or DPO loss, etc.) according to task type
-   - 返回 loss 张量
+   - Return loss tensor
-5. **反向传播与优化**
+5. **Backpropagation and Optimization**
-   - 损失除以累积步数进行归一化，然后执行 `loss.backward()`
+   - Loss is normalized by dividing by accumulation steps, then `loss.backward()` is executed
-   - 每累积 `accumulation_steps` 个批次后，执行优化器 `step()` 和 `zero_grad()`
+   - After accumulating `accumulation_steps` batches, optimizer `step()` and `zero_grad()` are executed
-   - 学习率调度器在每个 step 后更新学习率
+   - Learning rate scheduler updates learning rate after each step
-6. **检查点保存**
+6. **Checkpoint Saving**
-   - `CheckpointCallback` 按设定的间隔保存检查点
+   - `CheckpointCallback` saves checkpoints at set intervals
-   - 检查点包含模型状态字典、当前 epoch、iteration 等元数据
+   - Checkpoints contain model state dict, current epoch, iteration, and other metadata
-   - 使用 safetensors 格式保存，确保安全与效率
+   - Saved in safetensors format, ensuring safety and efficiency
-## 推理数据流详细步骤
+## Inference Data Flow - Detailed Steps
-1. **模型加载**
+1. **Model Loading**
-   - 从检查点加载 `Transformer` 模型与分词器
+   - Load `Transformer` model and tokenizer from checkpoint
-   - 模型设置为评估模式 (`model.eval()`)，启用推理模式 (`torch.inference_mode`)
+   - Set model to evaluation mode (`model.eval()`), enable inference mode (`torch.inference_mode`)
-2. **提示构建与编码**
+2. **Prompt Construction and Encoding**
-   - 用户查询与历史记录通过 `build_prompt` 转换为 ChatML 格式字符串
+   - User query and history are converted to ChatML format string through `build_prompt`
-   - 分词器将提示字符串编码为 token ID 序列 `input_ids`
+   - Tokenizer encodes prompt string to token ID sequence `input_ids`
-   - 若为批量生成，使用 `pad_sequence` 进行填充
+   - For batch generation, use `pad_sequence` for padding
-3. **自回归生成循环**
+3. **Autoregressive Generation Loop**
-   - 初始化 KV 缓存（可选）
+   - Initialize KV cache (optional)
-   - 循环直到生成 `max_len` 个 token 或遇到停止 token：
+   - Loop until generating `max_len` tokens or encountering stop token:
-     - 将当前 `input_ids`（或缓存后的新 token）输入模型，得到 `logits`
+     - Input current `input_ids` (or cached new token) to model, obtain `logits`
-     - 对 `logits` 应用 `apply_sampling_strategies`（温度、top‑k、top‑p）
+     - Apply `apply_sampling_strategies` (temperature, top-k, top-p) to `logits`
-     - 从处理后的分布中采样得到下一个 token ID
+     - Sample next token ID from the processed distribution
-     - 将新 token 追加到 `input_ids`，同时更新 KV 缓存
+     - Append new token to `input_ids`, while updating KV cache
-     - 若为流式生成，每生成一个 token 立即 yield 给调用方
+     - For streaming generation, yield each token to caller immediately
-4. **解码与输出**
+4. **Decoding and Output**
-   - 将生成的 token ID 序列通过分词器解码为文本
+   - Decode generated token ID sequence to text through tokenizer
-   - 去除特殊 token，返回纯文本响应
+   - Remove special tokens, return plain text response
-## 检查点与序列化
+## Checkpoint and Serialization
- **训练检查点**：保存模型参数、优化器状态、调度器状态、当前 epoch 与 iteration
+- **Training Checkpoint**: Saves model parameters, optimizer state, scheduler state, current epoch and iteration
- **模型参数**：支持 safetensors 格式，加载时自动处理权重绑定等特殊逻辑
+- **Model Parameters**: Supports safetensors format, automatically handles special logic like weight tying during loading
- **数据集序列化**：HDF5 格式支持高效随机读取与共享内存，适合大规模预训练数据
+- **Dataset Serialization**: HDF5 format supports efficient random access and shared memory, suitable for large-scale pre-training data
-## 总结
+## Summary
-KHAOSZ 的数据流设计体现了模块化、可扩展、可恢复的特点。训练数据流通过分块加载、可恢复采样、梯度累积等机制支持大规模分布式训练；推理数据流则利用 KV 缓存、采样策略实现高效的文本生成。各模块之间通过清晰的接口耦合，便于定制与扩展。
+The data flow design of KHAOSZ reflects the characteristics of modularity, extensibility, and resumability. The training data flow supports large-scale distributed training through chunk loading, resumable sampling, gradient accumulation, and other mechanisms; the inference data flow achieves efficient text generation using KV cache and sampling strategies. Clear interfaces between modules facilitate customization and extension.
-> 文档更新时间：2026‑03‑30  
+> Document Update Time: 2026-03-30  
-> 对应代码版本：参考 `pyproject.toml` 中定义的版本号
+> Corresponding Code Version: Refer to version number defined in `pyproject.toml`
--- a/assets/docs/design.md
+++ b/assets/docs/design.md
@ -1,16 +1,16 @@
-## 1. 为什么我要做这个项目？
+## 1. Why I Created This Project
-现在市面上有很多大模型，比如GPT、LLaMA这些，动不动就是几十亿甚至上千亿参数。但说实话，这些模型对硬件要求太高了，普通开发者根本玩不起。我就想：**能不能做一个既好用又能在普通电脑上跑起来的模型呢？** 这其实也是目前大部分人的期望， 能有一个可以本地部署的ai小型项目，实现完全私有化并且有一定的智能能力。
+There are many large language models on the market today, such as GPT, LLaMA, and others, with tens of billions or even hundreds of billions of parameters. But honestly, these models have extremely high hardware requirements, making them inaccessible for ordinary developers. I thought: **Can we create a model that is both useful and can run on ordinary computers?** This is also what most people currently hope for - a locally deployable AI project that achieves complete privatization while maintaining some level of intelligence.
-于是就有了这个KHAOSZ项目，1B参数，中英双语，支持对话、文本生成、RAG检索，而且训练代码都是开源的！
+Thus, the KHAOSZ project was born - 1B parameters, Chinese-English bilingual, supporting dialogue, text generation, RAG retrieval, and the training code is open source!
-## 2. 系统架构
+## 2. System Architecture
-系统分为以下板块
+The system is divided into the following modules:
 ```mermaid
 graph LR
-    %% 样式定义
+    %% Style definitions
    classDef config fill:#e1f5fe,stroke:#01579b;
    classDef trainer fill:#f3e5f5,stroke:#4a148c;
    classDef data fill:#e8f5e8,stroke:#1b5e20;
@ -18,16 +18,16 @@ graph LR
    classDef inference fill:#fce4ec,stroke:#880e4f;
    classDef parallel fill:#e0f2f1,stroke:#004d40;
-    %% 配置模块
+    %% Config module
-    subgraph Config["Config(配置模块)"]
+    subgraph Config["Config"]
        C1[model_config.py]
        C2[train_config.py]
        C3[scheduler_config.py]
    end
    class Config config;
-    %% 训练器模块
+    %% Trainer module
-    subgraph Trainer["Trainer(训练器模块)"]
+    subgraph Trainer["Trainer"]
        T1[trainer.py]
        T2[train_content.py]
        T3[schedule.py]
@ -36,8 +36,8 @@ graph LR
    end
    class Trainer trainer;
-    %% 数据模块
+    %% Data module
-    subgraph Data["Data(数据模块)"]
+    subgraph Data["Data"]
        D1[dataset.py]
        D2[sampler.py]
        D3[mmap.py]
@ -46,175 +46,159 @@ graph LR
    end
    class Data data;
-    %% 模型模块
+    %% Model module
-    subgraph Model["Model(模型模块)"]
+    subgraph Model["Model"]
        M1[transformer.py]
        M2[module.py]
    end
    class Model model;
-    %% 推理模块
+    %% Inference module
-    subgraph Inference["Inference(推理模块)"]
+    subgraph Inference["Inference"]
        I1[generator.py]
        I2[core.py]
    end
    class Inference inference;
-    %% 并行模块
+    %% Parallel module
-    subgraph Parallel["Parallel(并行模块)"]
+    subgraph Parallel["Parallel"]
        P1[setup.py]
        P2[module.py]
    end
    class Parallel parallel;
-    %% 配置依赖
+    %% Config dependencies
    C2 -.-> T1
    C1 -.-> M1
    C3 -.-> T3
-    %% 训练器内部依赖
+    %% Trainer internal dependencies
    T1 --> T5
    T1 --> T2
    T2 --> T3
    T2 --> T4
-    %% 数据流
+    %% Data flow
    D1 --> D2
    D1 --> D3
    D1 --> D4
    D1 --> D5
-    %% 模型依赖
+    %% Model dependencies
    M1 --> M2
-    %% 推理依赖
+    %% Inference dependencies
    I1 --> I2
-    %% 跨模块依赖
+    %% Cross-module dependencies
    T2 -.-> M1
    I1 -.-> M1
    T2 -.-> D1
    T1 -.-> P1
 ```
 ### 1. Configuration Management (/config/)
 - **Model Configuration**: Defines model structure parameters (such as layers, heads, dimensions, etc.), managed uniformly through `ModelConfig`.
 - **Training Configuration**: Sets training parameters (such as batch size, training stages PT/SFT/DPO, optimizers, etc.), loaded by `TrainConfig`.
 - **Scheduler Configuration**: Controls learning rate strategies (such as cosine annealing) and training progress.
-### 1. 配置管理（/config/）
+### 2. Hardware and Parallelism (/parallel/)
- **模型配置**：定义模型结构参数（如层数、头数、维度等），通过 `ModelConfig` 统一管理。
+- **Distributed Initialization**: Initializes multi-GPU/multi-machine training environments through the `setup_parallel` function according to configuration.
 - **训练配置**：设置训练参数（如批次大小、训练阶段 PT/SFT/DPO、优化器等），由 `TrainConfig` 加载。
 - **调度配置**：控制学习率策略（如余弦退火）和训练进度。
-### 2. 硬件与并行（/parallel/）
+### 3. Data Processing (/data/)
- **分布式初始化**：通过 `setup_parallel` 函数，根据配置初始化多卡/多机训练环境。
+- **Efficient Loading**: Uses memory mapping (mmap) technology to load massive corpora, avoiding memory overflow and achieving zero-copy reading.
-### 3. 数据处理（/data/）
+### 4. Model and Training (/model/, /trainer/)
- **高效加载**：使用内存映射（mmap）技术加载超大语料，避免内存溢出，实现零拷贝读取。
+- **Unified Model Architecture**: Based on Transformer, supporting flexible configuration of different scales (such as 7B, 13B).
 - **Strategy-based Trainer**: `Trainer` automatically switches training strategies according to training stages (PT/SFT/DPO), reusing the same training loop.
 - **Training Context Management**: Unifies management of model, optimizer, scheduler, and metrics, supporting seamless multi-stage transitions.
-### 4. 模型与训练（/model/, /trainer/）
+### 5. Inference Service (/inference/, /utils/)
- **统一模型架构**：基于 Transformer，支持灵活配置不同规模（如7B、13B）。
+- **Unified Generation Interface**: Provides synchronous, batch, and streaming generation methods, adapting to all training stages.
- **策略化训练器**：`Trainer` 根据训练阶段（PT/SFT/DPO）自动切换训练策略，复用同一训练循环。
+- **KV Cache Optimization**: Caches Key/Value during autoregressive generation, utilizing high-speed on-chip memory acceleration on NVIDIA GPU.
- **训练上下文管理**：统一管理模型、优化器、调度器和指标，支持多阶段无缝衔接。
+- **RAG Support**: Combines retriever and embedding models to inject relevant information from external knowledge bases, improving answer quality.
 - **Intelligent Text Segmentation**:
  - **Structure-first Segmentation**: Splits by titles, paragraphs, etc.;
  - **Semantic Segmentation**: Based on sentence embedding similarity, ensuring fragment semantic completeness and improving fine-tuning effects.
-### 5. 推理服务（/inference/, /utils/）
+## 3. Training Process
 - **统一生成接口**：提供同步、批量、流式生成方法，适配所有训练阶段。
 - **KV缓存优化**：在自回归生成中缓存 Key/Value，昇腾XPU下利用高速片上内存加速。
 - **RAG支持**：结合检索器和嵌入模型，从外部知识库注入相关信息，提升回答质量。
 - **智能文本分割**：
  - **结构优先分割**：按标题、段落等切分；
  - **语义分割**：基于句子嵌入相似度，确保片段语义完整，提升微调效果。
 The common training process for large language models (LLM) typically includes three stages: **Pre-training (PT)**, **Supervised Fine-Tuning (SFT)**, and **Reinforcement Learning from Human Feedback (RLHF)**. This system is designed to support seamless end-to-end flow, achieving efficient switching and state management of different training stages through modular strategies, ensuring the model's capabilities gradually evolve from general language understanding to human-preference-aligned dialogue and instruction execution.
-## 3. 训练流程
+### **2.1 Pre-training Stage**
-常见大语言模型（Large Language Model, LLM）的训练流程通常包含三个阶段：**预训练（Pre-training, PT）**、**监督微调（Supervised Fine-Tuning, SFT）** 以及 **基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）**。本系统设计支持全流程无缝衔接，通过模块化策略实现不同训练阶段的高效切换与状态管理，确保模型能力从通用语言理解逐步对齐至符合人类偏好的对话与指令执行。
+The pre-training stage aims to build the model's foundational language capabilities and general knowledge representation. This stage performs self-supervised learning on large-scale, unlabeled corpora (typically covering hundreds of GB to TB of text data). The model architecture is based on the standard Transformer Decoder, trained through masked language modeling objectives (such as causal language modeling), enabling the model to learn vocabulary, grammar, semantics, and world knowledge embedded in text.
-### **2.1 预训练阶段**
+**Core Formula: Causal Language Modeling**
 预训练阶段旨在构建模型的基础语言能力与通用知识表示。该阶段在大规模、无标注的语料库（通常涵盖数百GB至数TB的文本数据）上进行自监督学习。模型架构基于标准的Transformer Decoder，通过掩码语言建模（如因果语言建模）目标进行训练，使模型能够学习词汇、语法、语义及蕴含于文本中的世界知识。
 **核心公式：因果语言建模（Causal Language Modeling）**
 $$
 L_{\text{PT}} = - \sum_{t=1}^{T} \log P(x_t \mid x_{\lt t}; \theta)
 $$
-**符号说明：**
+**Symbol Description:**
- $T$：序列长度
+- $T$: Sequence length
- $x_t$：序列中第 $ t $ 个词元（token）
+- $x_t$: The $t$-th token in the sequence
- $x_{<t}$：位置 $ t $ 之前的所有词元
+- $x_{<t}$: All tokens before position $t$
- $\theta$：模型参数
+- $\theta$: Model parameters
- $P(x_t \mid x_{<t}; \theta)$：模型在给定上文条件下预测下一个词元的概率
+- $P(x_t \mid x_{<t}; \theta)$: The probability of the model predicting the next token given the preceding context
 The core of this stage lies in utilizing distributed parallel computing resources to achieve stable optimization of model parameters. The `PTStrategy` in the trainer module is specifically responsible for managing pre-training-specific data sampling, long sequence segmentation, and gradient accumulation logic. At the same time, the hardware adaptation module automatically selects the optimal parallel communication backend (such as NCCL) based on the runtime environment (such as NVIDIA GPU cluster) and performs computation graph optimization to maximize hardware utilization and training throughput.
 Additionally, the system achieves zero-copy reading of massive data through the efficient memory-mapped loader (`MmapFileHandler`) in the data module, overcoming traditional IO bottlenecks.
-本阶段的核心在于利用分布式的并行计算资源，实现模型参数的稳定优化。训练器模块中的`PTStrategy`策略，专门负责管理预训练特有的数据采样、长序列分段与梯度累积逻辑。同时，硬件适配模块会根据运行环境（如华为昇腾NPU集群或标准GPU集群）自动选择最优的并行通信后端（如HCCL或NCCL），并进行计算图优化，以最大化硬件利用率和训练吞吐量。
+### **2.2 Supervised Fine-Tuning Stage**
-另外系统通过数据模块中的高效内存映射加载器（`MmapFileHandler`），实现海量数据的零拷贝读取，以克服传统IO瓶颈。
+Although pre-trained models possess powerful language generation capabilities, they are not yet aligned with following human instructions and engaging in safe, helpful dialogues. The supervised fine-tuning stage aims to bridge this gap. This stage uses high-quality instruction-response paired datasets carefully written by humans.
 **Core Formula: Sequence-to-Sequence Conditional Language Modeling**
 Let the complete sequence $S = [s_1, s_2, \ldots, s_{P+L}]$, where:
-### **2.2 监督微调阶段**
+- The first $P$ tokens are prompts and corresponding control tokens: $X = [s_1, \ldots, s_P]$
 - The last $L$ tokens are responses and corresponding control tokens: $Y = [s_{P+1}, \ldots, s_{P+L}]$
-预训练模型虽具备强大的语言生成能力，但尚未对齐至遵循人类指令、进行安全有益对话的行为模式。监督微调阶段旨在弥合这一差距。该阶段使用由人工精心编写的、高质量的“指令-响应”配对数据集。
+The loss function is:
 **核心公式：序列到序列条件语言建模**
 设完整序列 $S = [s_1, s_2, \ldots, s_{P+L}]$，其中：
 - 前 $P$ 个token是prompt 以及对应控制token： $X = [s_1, \ldots, s_P]$
 - 后 $L$ 个token是response以及对应控制token： $Y = [s_{P+1}, \ldots, s_{P+L}]$
 损失函数为：
 $$
 L_{\text{SFT}} = - \sum_{t=P+1}^{P+L} \log P(s_t \mid s_{\lt t}; \theta)
 $$
-训练器模块将动态切换到`SFTStrategy`策略。此策略的核心是引入序列级的监督学习目标，例如预测给定指令下完整、正确的响应序列。训练上下文管理器（`TrainContext`）负责平滑地从PT阶段检查点加载模型状态，并初始化新的优化器和学习率调度器。本阶段不仅优化模型参数，更重要的是引导模型学习“对话”这一特定任务范式，使其输出风格、内容与格式均符合人类期望。
+The trainer module dynamically switches to the `SFTStrategy`. The core of this strategy is introducing sequence-level supervised learning objectives, such as predicting complete, correct response sequences given instructions. The training context manager (`TrainContext`) is responsible for smoothly loading model states from PT stage checkpoints and initializing new optimizers and learning rate schedulers. This stage not only optimizes model parameters but more importantly guides the model to learn the specific task paradigm of "dialogue," making its output style, content, and format conform to human expectations.
 ### **2.3 Reinforcement Learning from Human Feedback Stage**
 To generate outputs that are more helpful, harmless, and aligned with human preferences, the system further integrates a reinforcement learning stage. The traditional RLHF process includes two core steps: **Reward Model Training** and **Policy Model Fine-tuning**. The system supports policy fine-tuning represented by the Direct Preference Optimization (DPO) algorithm, with multiple engineering optimizations for stability and convergence.
-### **2.3 基于人类反馈的强化学习阶段**
+#### **2.3.1 Traditional RLHF (Reward Model Training)**
 为生成更具帮助性、无害性且符合人类偏好的高质量输出，系统进一步集成强化学习阶段。传统的RLHF流程包括**奖励模型训练**与**策略模型微调**两个核心步骤。系统支持以直接偏好优化(Direct Preference Optimization，DPO)算法为代表的策略微调，并针对稳定性与收敛性进行了多项工程优化。
 #### **2.3.1 传统 RLHF（奖励模型训练）**
 $$
 L_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left( r_\phi(x, y_w) - r_\phi(x, y_l) \right) \right]
 $$
-**符号说明：**
+**Symbol Description:**
- $r_\phi(x, y)$：参数为 $phi$ 的奖励模型给出的标量分数
+- $r_\phi(x, y)$: The scalar score given by the reward model with parameters $\phi$
- $y_w, y_l $：同一提示 $ x $ 下的优选和劣选回答
+- $y_w, y_l$: The preferred and dispreferred responses for the same prompt $x$
- $\sigma $：sigmoid 函数
+- $\sigma$: Sigmoid function
- $\mathcal{D} $：人类偏好数据集
+- $\mathcal{D}$: Human preference dataset
-
+#### **2.3.2 DPO Direct Preference Optimization** (Recommended)
 #### **2.3.2 DPO 直接偏好优化**（推荐）
 $$
 L_{\text{DPO}} = -E_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right]
 $$
-**符号说明：**
+**Symbol Description:**
- $\pi_\theta(y \mid x) $：当前策略模型生成回答的概率
+- $\pi_\theta(y \mid x)$: The probability of the current policy model generating response $y$
- $\pi_{\text{ref}}(y \mid x) $：参考模型生成回答的概率
+- $\pi_{\text{ref}}(y \mid x)$: The probability of the reference model generating response $y$
- $\beta $：温度参数（通常设为 0.1-0.5）
+- $\beta$: Temperature parameter (typically set to 0.1-0.5)
- 注意：隐式学习奖励函数 $r(x, y) = \beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)} $
+- Note: Implicitly learning reward function $r(x, y) = \beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}$
 In this stage, the trainer module enables the `RLHFStrategy` (or similar `DPOStrategy` direct preference optimization strategy). This strategy manages a complex training loop containing the policy model (LLM to be optimized), reference model (usually an SFT model snapshot), and reward model. The system flow is as follows:
-在本阶段，训练器模块启用`RLHFStrategy`策略（或类似的`DPOStrategy`直接偏好优化策略）。该策略管理一个复杂的训练循环，其中包含策略模型（待优化的LLM）、参考模型（通常为SFT后的模型快照）和奖励模型。系统流程如下：
+1. **Preference Data Collection and Reward Modeling**: First, by collecting human annotators' ranking preferences for multiple model-generated results for the same prompt, a separate reward model (RM) is trained. This model learns to output a scalar reward score for generated text to quantify the degree of alignment with human preferences.
 2. **Policy Optimization**: Then, using the reward model as the optimization signal, the SFT model (as the policy) is fine-tuned through reinforcement learning algorithms. The goal of policy optimization is to maximize the expected cumulative reward obtained from the reward model, while constraining the output distribution of the policy model and reference model from deviating too much through a KL divergence penalty term, preventing mode collapse and maintaining generation diversity. The training context manager maintains the states of the policy model, reference model, and reward model (or value function model) simultaneously at this stage, and coordinates complex multi-stage gradient computations.
-1. **偏好数据收集与奖励建模**：首先，通过收集人类标注员对同一提示词下多个模型生成结果的排序偏好数据，训练一个独立的奖励模型（Reward Model, RM）。该模型学习为生成文本输出一个标量奖励分数，以量化其符合人类偏好的程度。
+Through the above three-stage progressive training, the model completes its evolution from a general language foundation to a specialized, highly-aligned dialogue intelligence. The system, through unified `Trainer` interface and strategy pattern design, makes each stage of training highly reusable at the code level, clearly decoupled at the process level, providing an efficient, flexible, and scalable engineering foundation for large-scale language model research and iteration.
 2. **策略优化**：随后，使用奖励模型作为优化信号，通过强化学习算法对SFT模型（作为策略）进行微调。策略优化的目标是最大化从奖励模型获得的期望累计奖励，同时通过KL散度惩罚项约束策略模型与参考模型的输出分布不过度偏离，以防止模式崩溃并保持生成多样性。训练上下文管理器在此阶段同时维护策略模型、参考模型和奖励模型（或价值函数模型）的状态，并协调复杂的多阶段梯度计算。
 通过上述三阶段的递进式训练，模型完成了从通用语言基座到专业化、高对齐度对话智能体的进化。系统通过统一的`Trainer`接口和策略模式设计，使得各阶段训练在代码层面高度复用，在流程层面清晰解耦，为大规模语言模型的研发与迭代提供了高效、灵活且可扩展的工程基础。
--- a/assets/docs/introduction.md
+++ b/assets/docs/introduction.md
@ -1,50 +1,26 @@
-## 模型介绍
+## Model Introduction
-### 1. 模型搭建
+### 1. Model Architecture
-本模型采用Transformer架构， 使用GQA（q_head=24, kv_head=4） 机制，相较于传统的MHA可以节省KV cache 的显存占用（但是目前没有做KV cache），通过堆叠24层Transformer实现模型的搭建， 参数量为1.0b。Transformer 是自回归模型， 是通过计算前面所有的token的关系得到下一个token的概率分布
+This model uses the Transformer architecture with GQA mechanism (q_head=24, kv_head=4), which saves KV cache memory compared to traditional MHA (although KV cache is not currently implemented). The model is built by stacking 24 layers of Transformer blocks, with 1.0 billion parameters. Transformer is an autoregressive model that calculates the relationship between all previous tokens to obtain the probability distribution of the next token.
 ![structure](../images/structure.png)
-什么是自回归模型呢， 在把句子拆分成token之后, 模型会预测下一个token的概率分布。这意味着模型会根据给定的上下文（即已经出现的tokens序列），计算出下一个可能的token及其对应的概率。
+What is an autoregressive model? After splitting a sentence into tokens, the model predicts the probability distribution of the next token. This means the model calculates the probability of the next possible token and its corresponding probability based on the given context (the sequence of tokens that have already appeared).
 #### 1. Autoregression
 In autoregressive modeling, when a sentence is tokenized into a sequence of tokens, the model learns to predict what comes next. Given a sequence of tokens as input, the model calculates a probability distribution over all possible next tokens. This distribution tells us how likely each potential next token is, given the current context.
-#### 1. 自回归
+For instance, if the input sequence contains tokens representing a question, the model might predict that certain response tokens have higher probabilities than others. The sampling process then selects one token from this distribution—controlled by parameters like top_k, top_p, and temperature—to serve as the next token in the sequence.
-假设我们有一个句子被拆分成如下tokens列表：
+Once a token is selected, it is appended to the input sequence, and the model repeats this process. The updated sequence is then fed back into the model to predict the next token. This iterative process continues until either a special end-of-sequence token is generated, or the maximum sequence length is reached. These control tokens are essential because without them, the model would continue generating tokens indefinitely, eventually exhausting available memory.
-```
+#### 2. Causal Mask
 ["你好", "，" "今天", "天气"]
 ```
-接下来，模型会基于这个序列预测下一个可能出现的token。这通常以概率分布的形式给出，比如：
+Transformers use attention mechanism. The input shape is generally [bsz, seq_len], and the output is [bsz, seq_len, n_dim]. To predict the next token, the model's input and output must be offset by one position. The target predicted by the model must be offset by one position, and during training we also use the offset-by-one method:
 ```
 -> {"token": "不错", "probability": 0.4}
 -> {"token": "晴朗", "probability": 0.2}
 -> ......
 ```
 这里，“不错”和“晴朗”是两个可能跟随在“天气”之后的tokens，并且给出了每个token成为下一个token的可能性大小。
 之后，我们通过采样（通过top_k, top_p, temperature参数调整采样后的结果）得到下一个token并且将下一个token加入序列作为输入
 ```
 ["你好", "，" "今天", "天气", "不错"]
 ```
 之后都是在重复这个流程， 直到遇到控制流程结束的token（<|end_of_seqence|>）模型停止处理（一般模型都会设置控制token， 不然模型会一直输出到显存爆炸）。 
 #### 2. 因果掩码
 transformer 中采用注意力机制，输入的形状一般为[bsz, seq_len]， 输出为[bsz, seq_len，n_dim]， 为了实现预测下一个token， 模型的输入和输出必须错开来一个位置。模型预测的target必须错开一个位置， 在训练的时候我们也采用错开一个位置的方法
 ```
 sequence : [[1, 2, 3, 4, 5, 6]]
@ -52,18 +28,14 @@ input_ids: [[1, 2, 3, 4, 5]]
 target_ids: [[2, 3, 4, 5, 6]]
 ```
-
+The attention score calculation formula is:
 注意力得分计算的公式为
 $$ s_{ij} = softmax(\frac{q_i^Tk_j}{\sqrt{d_k}}) $$
 $$ s_{ij} := s_{ij} + mask_{ij} $$
 Here, the attention score represents the degree to which the model attends to the similarity between two tokens.
-其中注意力得分代表了模型对两个token之间相似程度的关注程度
+For decoder-only structure models, to prevent the model from "stealing" information from future positions, a mask needs to be added during attention calculation. We need to apply a mask before attention score calculation. This mask is typically a lower triangular matrix, and for a sequence of length n, its shape is [n, n]. Below is an example of how to create such a causal mask matrix for a sequence of length 5:
 对于decoder only结构的模型， 为了防止模型从未来的位置偷到信息， 在注意力的计算过程中需要增加掩码，我们需要在注意力得分计算之前应用一个掩码。这个掩码通常是一个下三角矩阵，对于长度为n的序列，它的形状是[n, n]。下面以一个长度为5的序列为例，展示如何创建这样的因果掩码矩阵：
 ```
 [[0, -inf, -inf, -inf, -inf],
@ -73,25 +45,21 @@ $$ s_{ij} := s_{ij} + mask_{ij} $$
 [0,    0,    0,    0,    0]]
 ```
-在这个矩阵中，0表示可以注意到的位置，而-inf表示应该被掩盖（即不应注意到）的位置。因为这个句子保证了注意力得分中 $j > i$  的部分通过softmax 之后由`inf` 变成0， 也就是模型不能看到未来的信息
+In this matrix, 0 represents positions that can be attended to, while -inf represents positions that should be masked (i.e., should not be attended to). Because this matrix ensures that after the softmax, the parts of the attention scores where $j > i$ change from `inf` to 0, meaning the model cannot see future information.
 #### 3. Rotary Position Embedding
-
+Rotary Position Embedding (RoPE) is a position encoding method designed to solve the problem of lacking direct modeling of sequence position information in Transformer models. Unlike traditional position encodings (such as sine and cosine function position encodings), RoPE embeds position information directly into the Query (Q) and Key (K) vectors, allowing the model to more naturally handle relative position relationships in sequences.
 #### 3. 旋转位置编码
 旋转位置编码（Rotary Position Embedding, RoPE）是一种为了解决Transformer模型中缺乏对序列位置信息直接建模的问题而设计的位置编码方法。与传统的位置编码（如正弦和余弦函数的位置编码）不同，RoPE通过将位置信息直接嵌入到查询（Query, Q）和键（Key, K）向量中来实现，使得模型能够更自然地处理序列中的相对位置关系。
 $$ q_i = R_i W_q x_i $$
 $$ k_j = R_j W_k x_j $$
 $$ q_i^T k_j = (R_i W_q x_i)^T( R_j W_k x_j) = x_i^T W_q^T R_{i-j} W_k x_j $$
-其中的 $R_{i-j}$ 控制了模型的不同token 在不同相对距离上注意力的衰减，在 $i - j$ 绝对值越大的时候， 衰减的程度越强， 通过这种方式能让模型学习到相对位置关系， 从而使得模型可以扩展和适应长序列
+The $R_{i-j}$ controls the attenuation of attention for different tokens at different relative distances. When the absolute value of $i - j$ is larger, the degree of attenuation is stronger. This approach allows the model to learn relative position relationships, enabling the model to scale and adapt to longer sequences.
 ## KV Cache Implementation
-## kv_cache 实现
+According to the attention calculation formula:
 根据注意力的计算公式
 $$
 \begin{align*}
@ -100,7 +68,7 @@ s_{ij} &= \text{softmax}\left( \frac{q_{i} k_{j}}{\sqrt{d_k}} \right)
 \end{align*}
 $$
-由于模型是自回归模型, 我们只用求序列最后一个部分，也就是说 $ i $ 的下标是确定的, 是序列最后一个元素, 我们求的是 $o_{n} $ 
+Since the model is an autoregressive model, we only need to calculate for the last part of the sequence, meaning the index $i$ is fixed as the last element of the sequence, and we compute $o_{n}$:
 $$
 \begin{align*}
@ -109,10 +77,10 @@ s_j &= \text{softmax}\left(\frac{q_n k_{j}}{\sqrt{d_k}} \right)
 \end{align*}
 $$
-如果我们把式子展开
+If we expand the expression:
 $$
 o_n = \sum_j \text{softmax}\left(\frac{q_n k_{j}}{\sqrt{d_k}}\right)v_{j}
 $$
-以上表达式只有k和v存在长度下标, 而 $q$ 没有， 所以计算过程中 $q$ 的输入是确定的上次输入的最后一个token, 而 $k,  v$ 是需要对不同长度的部分进行缓存的，同时缓存的时候应该注意位置编码的计算应该在kvcache的计算之前进行，否则会存在位置编码的计算错误
+In the above expression, only k and v have length indices, while $q$ does not. Therefore, during the calculation process, the input of $q$ is fixed as the last token from the previous input, while $k$ and $v$ need to be cached for parts of different lengths. Also, when caching, note that position encoding calculation should be performed before KV cache computation, otherwise there will be position encoding calculation errors.
--- a/assets/docs/params.md
+++ b/assets/docs/params.md
@ -0,0 +1,115 @@
 # Parameter Documentation
 ## Training Parameters
 ### Basic Parameters
 | Parameter | Description | Default Value |
 |-----------|-------------|---------------|
 | `--train_type` | Training type (seq, sft, dpo) | required |
 | `--data_root_path` | Dataset root directory | required |
 | `--param_path` | Model parameters or checkpoint path | required |
 | `--n_epoch` | Total training epochs | 1 |
 | `--batch_size` | Batch size | 1 |
 | `--accumulation_steps` | Gradient accumulation steps | 1 |
 ### Learning Rate Scheduling
 | Parameter | Description | Default Value |
 |-----------|-------------|---------------|
 | `--warmup_steps` | Warmup steps | 1000 |
 | `--max_lr` | Maximum learning rate (warmup + cosine decay) | 3e-4 |
 | `--max_grad_norm` | Maximum gradient norm | 1.0 |
 ### Checkpoint
 | Parameter | Description | Default Value |
 |-----------|-------------|---------------|
 | `--ckpt_interval` | Checkpoint save interval (iterations) | 5000 |
 | `--ckpt_dir` | Checkpoint save directory | checkpoint |
 | `--resume_dir` | Resume training from specified path | - |
 ### Optimizer Parameters
 | Parameter | Description | Default Value |
 |-----------|-------------|---------------|
 | `--adamw_beta1` | AdamW beta1 | 0.9 |
 | `--adamw_beta2` | AdamW beta2 | 0.95 |
 | `--adamw_weight_decay` | AdamW weight decay | 0.01 |
 ### Data Loading
 | Parameter | Description | Default Value |
 |-----------|-------------|---------------|
 | `--random_seed` | Random seed | 3407 |
 | `--num_workers` | DataLoader workers | 4 |
 | `--no_pin_memory` | Disable pin_memory | - |
 ### Distributed Training
 | Parameter | Description | Default Value |
 |-----------|-------------|---------------|
 | `--nprocs` | Number of GPUs | 1 |
 | `--device_type` | Device type (cuda/cpu) | cuda |
 ### Other Parameters
 | Parameter | Description | Default Value |
 |-----------|-------------|---------------|
 | `--window_size` | Maximum input sequence length | model config max_len |
 | `--stride` | Input sequence stride | - |
 | `--dpo_beta` | DPO beta value | 0.1 |
 | `--label_smoothing` | Label smoothing parameter | 0.1 |
 | `--start_epoch` | Starting epoch | 0 |
 | `--start_batch` | Starting batch | 0 |
 ---
 ## Generation Parameters
 ### GenerationRequest Parameters
 | Parameter | Description | Default Value |
 |-----------|-------------|---------------|
 | `query` | Input text or text list | required |
 | `history` | Conversation history | None |
 | `system_prompt` | System prompt | None |
 | `temperature` | Sampling temperature (higher = more random) | required |
 | `top_p` | Nucleus sampling threshold | required |
 | `top_k` | Top-k sampling count | required |
 | `max_len` | Maximum generation length | model config max_len |
 | `stream` | Whether to stream output | False |
 ### Usage Example
 ```python
 from khaosz.config.param_config import ModelParameter
 from khaosz.inference.generator import StreamGenerator, GenerationRequest
 # Load model
 param = ModelParameter.load("your_model_dir")
 param.to(device="cuda", dtype=torch.bfloat16)
 # Create generator
 generator = StreamGenerator(param)
 # Build request
 request = GenerationRequest(
    query="Hello",
    history=[],
    temperature=0.8,
    top_p=0.95,
    top_k=50,
 )
 # Generate
 response = generator.generate(request)
 ```
 ### Three Types of Generators
 | Generator | Usage |
 |-----------|-------|
 | `StreamGenerator` | Streaming output, returns word by word |
 | `LoopGenerator` | Non-streaming output, returns at once |
 | `BatchGenerator` | Batch generation, processes multiple queries simultaneously |
--- a/demo/download.py
+++ b/demo/download.py
@ -1,13 +1,12 @@
-import os
+from pathlib import Path
 from huggingface_hub import snapshot_download
-
+PROJECT_ROOT = Path(__file__).parent.parent
-PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+PARAMETER_ROOT = Path(PROJECT_ROOT, "params")
 if __name__ == "__main__":
    snapshot_download(
        repo_id="ViperEk/KHAOSZ",
-        local_dir=os.path.join(PROJECT_ROOT, "params"),
+        local_dir=PARAMETER_ROOT,
        force_download=True,
    )
--- a/demo/generate_ar.py
+++ b/demo/generate_ar.py
@ -1,20 +1,19 @@
 import os
 import torch
 from pathlib import Path
 from khaosz.config.param_config import ModelParameter
 from khaosz.inference.core import disable_random_init
-from khaosz.inference.generator import LoopGenerator, GenerationRequest
+from khaosz.inference.generator import GeneratorFactory, GenerationRequest
-
+PROJECT_ROOT = Path(__file__).parent.parent
-PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+PARAMETER_ROOT = Path(PROJECT_ROOT, "params")
 def generate_text():
    with disable_random_init():
-        model_dir = os.path.join(PROJECT_ROOT, "params")
+        param = ModelParameter.load(PARAMETER_ROOT)
-        param = ModelParameter.load(model_dir)
+        param.to(device="cuda", dtype=torch.bfloat16)
    param.to(device="cuda", dtype=torch.bfloat16)
    query = input(">> ")
    request = GenerationRequest(
@ -26,7 +25,7 @@ def generate_text():
        history=None,
        system_prompt=None,
    )
-    generator = LoopGenerator(param)
+    generator = GeneratorFactory.create(param, request)
    response = generator.generate(request)
    print(response)
--- a/demo/generate_batch.py
+++ b/demo/generate_batch.py
@ -1,19 +1,19 @@
 import os
 import torch
 from pathlib import Path
 from khaosz.config.param_config import ModelParameter
 from khaosz.inference.core import disable_random_init
-from khaosz.inference.generator import BatchGenerator, GenerationRequest
+from khaosz.inference.generator import GeneratorFactory, GenerationRequest
-PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+PROJECT_ROOT = Path(__file__).parent.parent
 PARAMETER_ROOT = Path(PROJECT_ROOT, "params")
 def batch_generate():
    with disable_random_init():
        model_dir = os.path.join(PROJECT_ROOT, "params")
        param = ModelParameter.load(model_dir)
-    param.to(device="cuda", dtype=torch.bfloat16)
+    with disable_random_init():
-    generator = BatchGenerator(param)
+        param = ModelParameter.load(PARAMETER_ROOT)
        param.to(device="cuda", dtype=torch.bfloat16)
    inputs = [
        "你好",
        "请问什么是人工智能",
@ -31,6 +31,7 @@ def batch_generate():
        history=None,
        system_prompt=None,
    )
    generator = GeneratorFactory.create(param, request)
    responses = generator.generate(request)
    for q, r in zip(inputs, responses):
--- a/demo/stream_chat.py
+++ b/demo/stream_chat.py
@ -1,21 +1,18 @@
 import os
 import torch
 from pathlib import Path
 from khaosz.config.param_config import ModelParameter
 from khaosz.inference.core import disable_random_init
-from khaosz.inference.generator import StreamGenerator, GenerationRequest
+from khaosz.inference.generator import GeneratorFactory, GenerationRequest
-
+PROJECT_ROOT = Path(__file__).parent.parent
-PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+PARAMETER_ROOT = Path(PROJECT_ROOT, "params")
 def chat():
    with disable_random_init():
-        model_dir = os.path.join(PROJECT_ROOT, "params")
+        param = ModelParameter.load(PARAMETER_ROOT)
-        param = ModelParameter.load(model_dir)
+        param.to(device="cuda", dtype=torch.bfloat16)
    param.to(device="cuda", dtype=torch.bfloat16)
    generator = StreamGenerator(param)
    history = []
    while True:
@ -32,6 +29,7 @@ def chat():
            history=history,
            system_prompt=None,
        )
        generator = GeneratorFactory.create(param, request)
        response_size = 0
        full_response = ""
--- a/khaosz/data/dataset.py
+++ b/khaosz/data/dataset.py
@ -7,7 +7,7 @@ from abc import ABC, abstractmethod
 from torch import Tensor
 from torch.utils.data import Dataset
 from khaosz.data.serialization import load_h5
-from typing import Callable, List, Dict, Literal, Optional, Union
+from typing import List, Dict, Optional, Union
 class BaseSegmentFetcher:
--- a/khaosz/data/serialization.py
+++ b/khaosz/data/serialization.py
@ -75,7 +75,7 @@ class Checkpoint:
            with open(save_path / "meta.json", "w") as f:
                json.dump(meta, f, indent=2)
-            st.save_file(self.state_dict, save_path / f"state_dict.safetensors")
+            st.save_file(self.state_dict, save_path / "state_dict.safetensors")
    @classmethod
    def load(
@ -96,7 +96,7 @@ class Checkpoint:
            dist.broadcast_object_list(meta_list, src=0)
            meta = meta_list[0]
-        state_dict = st.load_file(save_path / f"state_dict.safetensors")
+        state_dict = st.load_file(save_path / "state_dict.safetensors")
        return cls(
            state_dict=state_dict,
--- a/khaosz/inference/generator.py
+++ b/khaosz/inference/generator.py
@ -219,7 +219,7 @@ class BatchGenerator(GeneratorCore):
                    ids_list[i].append(token)
                    c_ids += 1
-                    is_active = not token in self.tokenizer.stop_ids
+                    is_active = token not in self.tokenizer.stop_ids
                    activate_task_mask[i] = is_active
                    active_mask.append(is_active)
--- a/khaosz/trainer/strategy.py
+++ b/khaosz/trainer/strategy.py
@ -7,7 +7,7 @@ import torch.nn.functional as F
 from torch.nn.parallel import DistributedDataParallel as DDP
 from torch import Tensor
-from typing import Any, Callable, Dict, Union, Optional
+from typing import Any, Callable, Dict, Union
 from abc import ABC, abstractmethod
--- a/khaosz/trainer/train_callback.py
+++ b/khaosz/trainer/train_callback.py
@ -6,7 +6,6 @@ import torch.nn as nn
 from pathlib import Path
 from tqdm import tqdm
 from torch.nn.utils import clip_grad_norm_
 from torch.optim.lr_scheduler import LRScheduler
 from typing import Callable, List, Optional, Protocol
 from khaosz.parallel import only_on_rank