Go to file
ViperEkura 0ca4871e80 ci(spell-check): 修改检查流程 2026-02-11 16:01:53 +08:00
.github/workflows ci(spell-check): 修改检查流程 2026-02-11 16:01:53 +08:00
assets docs(architecture): 添加系统架构文档并修复KV缓存数学公式 2026-01-18 14:10:31 +08:00
demo fix(demo): 修复拼写错误 2025-12-10 15:22:26 +08:00
khaosz feat(inference): 增加cuda_graph 装饰器 2026-02-07 21:14:39 +08:00
tests refactor(data): 重构MmapFileHandler类并改进数据加载机制 2026-01-11 19:37:28 +08:00
tools feat(model): 添加QK归一化和门控注意力支持 2026-01-05 16:14:44 +08:00
.gitignore feat(parallel): 改进设备策略注册表与并行设置功能 2025-12-19 15:25:31 +08:00
LICENSE Initial commit 2025-09-27 12:02:22 +08:00
README.md refactor(trainer): 统一参数命名以提升可读性 2025-09-28 22:14:24 +08:00
requirements.txt build(requirements): 升级 urllib3 版本从 2.5.0 到 2.6.0 2025-12-08 13:48:50 +08:00
setup.py build(setup): 更新版本号并调整 Python 版本要求 2025-11-09 16:40:20 +08:00

README.md

image-20250306182014120

KHAOSZ

English Version

This is a Chinese-English bilingual Transformer model supporting both languages. It contains model configurations and training workflows, completing training by loading parameters defined in param_path/config.json. The training script train.py parses command-line arguments, including dataset root directory, number of training epochs, batch size, checkpoint interval, and checkpoint directory.

Model Download Options (Choose One):

  1. Visit HuggingFace to access Files and versions
  2. Run scripts/download.py to download parameters

Demo Video: bilibili

Training dataset sources are listed in the Model Card section of the HuggingFace download link.

License: Code follows Apache-2.0 protocol. Please credit the source code when used.

  • 📊 Device Selection: Code defaults to CUDA training
  • 🌐 Performance Optimization: dtype=torch.bfloat16 is enabled to accelerate training and reduce memory usage. Ensure hardware supports this feature.
  • 🤖 Language Support: Model supports Chinese and English training. The BBPE tokenizer was trained without multilingual text, so OOV (out-of-vocabulary) issues are minimized for these languages but may exist for others.

📌 Training Guide

To train this Transformer model, follow these steps:

(1). Prepare Dataset:

Place datasets in the designated root directory. Files should be text documents in Chinese, English, or mixed. Format should align with model input requirements - preferably pre-tokenized token_ids stored as torch.Tensor (using torch.Tensor saves memory compared to Python lists, which default to 64-bit precision).

(2). Install Dependencies:

pip install -r requirements.txt
pip install .

(3). Run Training Script:

python train.py \
--train_type=train_type[seq, sft, dpo] \
--data_root_path=/path/to/dataset \
--param_path=/path/to/param_path \
--n_epoch=5 \
--batch_size=8 \
--max_lr=2e-4 \
--checkpoint_interval=10000 \
--checkpoint_dir=checkpoints

Parameters Explanation:

  • --train_type: Training type (seq, sft, dpo)
  • --data_root_path: Root directory of the dataset
  • --param_path: Path to the model training parameters
  • --n_epoch: Total number of training epochs
  • --batch_size: Batch size
  • --accumulation_steps: Number of batches per training step
  • --warmup_steps: Number of warmup steps
  • --max_lr: Maximum learning rate (using warmup + cosine decay)
  • --checkpoint_interval: Checkpoint saving interval
  • --checkpoint_dir: Directory to save checkpoints
  • --resume_dir: Resume training from the specified path

Training logs will be saved in train_log.txt. Checkpoints will be saved in the specified directory for resuming training or evaluation.

👉 Usage Guide

(1). Chatting with the Model:

Open chat.py or use streaming/non-streaming interfaces:

Streaming Output:

import torch
from khaosz import Khaosz

model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
history = []

while True:
    query = input(">> ")
    if query == "!exit":
        break
    
    response_size = 0
    for response, history in model.stream_generate(
        query=query, 
        history=history,
        temperature=0.85,
        top_p=0.95,
        top_k=50
    ):
        print(response[response_size:], end="")
        response_size = len(response)       

Non-streaming Output:

import torch
from khaosz import Khaosz

model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
history = []

while True:
    query = input(">> ")
    if query == "!exit":
        break
    
    response = model.generate(
        query=query, 
        history=history,
        temperature=0.85,
        top_p=0.95,
        top_k=50
    )
    print(response)

(2) Retrieval-Augmented Generation (RAG):

import torch
from khaosz import Khaosz

model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)

retrieved_content = model.retrieve_generate(
    query=query,
    retrieve_top_k=5,
    temperature=0.6,
    top_k=30,
    top_p=0.95
)
print(retrieved_content)

📌 Model Specifications

This model is based on a 24-layer Transformer with parameters defined in config.json, totaling approximately 1.0 billion (1.0B) parameters.

Key Design Choices:

  • Weight tying between embedding and final linear layers (standard for small models to save parameters)
  • Embedding layer optimization: Without weight tying, a 10,000-word vocabulary would consume ~102M parameters (0.1B)

Limitations:

  • May struggle with complex language phenomena due to smaller parameter size
  • Prone to overfitting on specialized datasets
  • Limited multilingual capabilities

Advantages:

  • Runs efficiently on lower-spec hardware
  • Shorter training time compared to larger models

Training Pipeline: The model has completed pre-training + SFT (Supervised Fine-Tuning) + DPO (Direct Preference Optimization) workflows. All corresponding training code is included in the repository.

中文版本

这是一个支持中英文双语的 Transformer 模型,能够处理两种语言。模型包含配置文件和训练流程,通过加载 `param_path/config.json` 中定义的参数完成训练。训练脚本 `train.py` 支持命令行参数解析包括数据集根目录、训练轮数epochs、批量大小batch size、检查点保存间隔、检查点目录等。

模型下载选项(任选其一):

  1. 访问 HuggingFace 查看 Files and versions
  2. 运行 scripts/download.py 下载模型参数

演示视频: bilibili

训练数据来源请参见 HuggingFace 下载页面中的 Model Card 部分。

许可证: 代码遵循 Apache-2.0 协议,使用时请注明出处。

  • 📊 设备选择: 默认使用 CUDA 进行训练
  • 🌐 性能优化: 启用 dtype=torch.bfloat16 以加速训练并减少内存占用,请确保硬件支持该特性
  • 🤖 语言支持: 模型支持中文和英文训练。由于 BBPE 分词器未使用多语言文本训练,因此中英文的 OOV未登录词问题较少其他语言可能存在 OOV 问题

📌 训练指南

要训练该 Transformer 模型,请按照以下步骤操作:

(1). 准备数据集:

将数据集放置在指定的根目录下。文件应为包含中文、英文或混合文本的文本文档。格式应符合模型输入要求——建议使用预分词后的 token_ids 并以 torch.Tensor 格式保存(使用 torch.Tensor 相比 Python 列表更节省内存,列表默认为 64 位精度)。

(2). 安装依赖:

pip install -r requirements.txt
pip install .

(3). 运行训练脚本:

python train.py \
--train_type=train_type[seq, sft, dpo] \
--data_root_path=/path/to/dataset \
--param_path=/path/to/param_path \
--n_epoch=5 \
--batch_size=8 \
--max_lr=2e-4 \
--checkpoint_interval=10000 \
--checkpoint_dir=checkpoints 

参数说明:

  • --train_type: 训练类型seq, sft, dpo
  • --data_root_path: 数据集根目录
  • --param_path: 模型训练参数路径
  • --n_epoch: 总训练轮数
  • --batch_size: 批量大小
  • --accumulation_steps: 每个训练步骤的 batch 数量
  • --warmup_steps: 预热步数warmup steps
  • --max_lr: 最大学习率(使用预热 + 余弦衰减)
  • --checkpoint_interval: 检查点保存间隔
  • --checkpoint_dir: 检查点保存目录
  • --resume_dir: 从指定路径恢复训练

训练日志将保存在 train_log.txt 中。检查点将保存在指定目录,用于恢复训练或评估。

👉 使用指南

(1). 与模型对话:

打开 chat.py 或使用流式/非流式接口:

流式输出:

import torch
from khaosz import Khaosz

model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
history = []

while True:
    query = input(">> ")
    if query == "!exit":
        break
    
    response_size = 0
    for response, history in model.stream_generate(
        query=query, 
        history=history,
        temperature=0.85,
        top_p=0.95,
        top_k=50
    ):
        print(response[response_size:], end="")
        response_size = len(response)       

非流式输出:

import torch
from khaosz import Khaosz

model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)
history = []

while True:
    query = input(">> ")
    if query == "!exit":
        break
    
    response = model.generate(
        query=query, 
        history=history,
        temperature=0.85,
        top_p=0.95,
        top_k=50
    )
    print(response)

(2). 基于检索的生成RAG

import torch
from khaosz import Khaosz

model_dir = "your_model_parameter_dir"
model = Khaosz(model_dir).to(device='cuda', dtype=torch.bfloat16)

retrieved_content = model.retrieve_generate(
    query=query,
    retrieve_top_k=5,
    temperature=0.6,
    top_k=30,
    top_p=0.95
)
print(retrieved_content)

📌 模型规格说明(重复部分)

该模型基于一个 24 层的 Transformer 架构,参数配置定义在 config.json 中,总参数量约为 10 亿1.0B)。

关键设计选择:

  • 在嵌入层embedding与最终线性层之间进行权重绑定weight tying这是小型模型中常见的节省参数量的做法
  • 嵌入层优化:若不进行权重绑定,一个包含 10,000 个词的词汇表将消耗约 1.02 亿0.1B)参数

局限性:

  • 由于参数规模较小,可能在处理复杂语言现象时表现受限
  • 在特定领域的数据集上容易出现过拟合
  • 多语言能力有限

优势:

  • 可在低配置硬件上高效运行
  • 相较于大型模型,训练时间更短

训练流程:
该模型已完成预训练pre-training+ 监督微调SFT, Supervised Fine-Tuning+ 直接偏好优化DPO, Direct Preference Optimization的全流程。所有相关的训练代码均已包含在代码库中。