docs: 更新网络接口文档

This commit is contained in:
ViperEkura 2026-04-06 13:39:51 +08:00
parent 64b78ecce3
commit 408f0cb513
3 changed files with 169 additions and 1 deletions

View File

@ -84,6 +84,38 @@ python scripts/tools/train.py \
python scripts/tools/generate.py --param_path=/path/to/param_path
```
#### Start HTTP Server
Start the inference server with OpenAI-compatible HTTP API:
```bash
python -m scripts.tools.server --port 8000 --device cuda
```
Make requests:
```bash
# Chat API (OpenAI compatible)
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 512
}'
# Streaming response
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Tell a story"}],
"stream": true,
"max_tokens": 500
}'
# Health check
curl http://localhost:8000/health
```
#### Demo
Check out the demos in the `scripts/demo/` folder:

View File

@ -85,6 +85,38 @@ python scripts/tools/train.py \
python scripts/tools/generate.py --param_path=/path/to/param_path
```
#### 启动 HTTP 服务
启动推理服务器,支持 OpenAI 兼容的 HTTP API
```bash
python -m scripts.tools.server --port 8000 --device cuda
```
发起请求:
```bash
# Chat APIOpenAI 兼容)
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "你好"}],
"max_tokens": 512
}'
# 流式响应
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "讲个故事"}],
"stream": true,
"max_tokens": 500
}'
# 健康检查
curl http://localhost:8000/health
```
#### 演示
查看 `scripts/demo/` 文件夹中的演示:

View File

@ -190,4 +190,108 @@ for token in engine.generate_with_request(request):
print(token, end="", flush=True)
```
The continuous batching feature allows dynamic batch composition where new requests can join at any time and completed requests are released immediately.
The continuous batching feature allows dynamic batch composition where new requests can join at any time and completed requests are released immediately.
## HTTP API Usage
The inference server provides HTTP endpoints for remote inference. Start the server first:
```bash
python -m scripts.tools.server --port 8000
```
### OpenAI-Compatible Endpoint
The server provides an OpenAI-compatible chat completion endpoint at `/v1/chat/completions`:
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
],
"temperature": 0.8,
"max_tokens": 2048,
"stream": false
}'
```
**Request Parameters:**
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `messages` | List[dict] | Required | Chat messages with role and content |
| `temperature` | float | 0.8 | Sampling temperature (0.0-2.0) |
| `top_p` | float | 0.95 | Nucleus sampling threshold |
| `top_k` | int | 50 | Top-k sampling parameter |
| `max_tokens` | int | 2048 | Maximum tokens to generate |
| `stream` | bool | false | Enable streaming response |
| `system_prompt` | str | None | System prompt override |
**Response (non-streaming):**
```json
{
"id": "chatcmpl-1234567890",
"object": "chat.completion",
"created": 1234567890,
"model": "astrai",
"choices": [
{
"index": 0,
"message": {"role": "assistant", "content": "Hello! I'm doing well..."},
"finish_reason": "stop"
}
]
}
```
### Streaming Response
Enable streaming for real-time token-by-token output:
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Write a story"}],
"stream": true,
"max_tokens": 500
}'
```
The server uses Server-Sent Events (SSE) with content type `text/event-stream`.
### Simple Generation Endpoint
For basic text generation without chat format:
```bash
curl -X POST "http://localhost:8000/generate?query=Hello&max_len=1000" \
-H "Content-Type: application/json"
```
Or with conversation history:
```bash
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"query": "What is AI?",
"history": [["Hello", "Hi there!"], ["How are you?", "I'm doing well"]],
"temperature": 0.8,
"max_len": 2048
}'
```
### Health Check
Monitor server and model status:
```bash
curl http://localhost:8000/health
# {"status": "ok", "model_loaded": true, "engine_ready": true}
curl http://localhost:8000/stats
# {"requests_total": 10, "tokens_generated": 5000, ...}
```