nanoClaw/docs/CrawlerAndTools.md

33 KiB
Raw Blame History

爬虫流水线与工具系统设计

概述

本文档描述如何为 NanoClaw 构建爬虫流水线和工具系统,使 GLM 模型能够:

  1. 通过工具调用获取实时网络信息
  2. 执行结构化的数据采集任务
  3. 与外部系统进行交互

一、整体架构

flowchart TB
    subgraph Frontend["前端 (Vue 3)"]
        ChatView["ChatView 对话界面"]
        ToolCallUI["ToolCallUI 工具调用展示"]
        CrawlerPanel["CrawlerStatusPanel 爬虫状态面板"]
    end

    subgraph Backend["后端 API (Flask)"]
        ChatAPI["/api/chat 对话补全"]
        ToolsAPI["/api/tools 工具管理"]
        CrawlerAPI["/api/crawler 爬虫任务"]
        Orchestrator["Tool Orchestrator<br/>工具调用编排引擎"]
    end

    subgraph Services["服务层"]
        CrawlerService["Web Crawler 爬虫服务"]
        BuiltInTools["Built-in Tools 内置工具"]
        ExternalAPIs["External APIs 第三方服务"]
    end

    Frontend -->|"HTTP/SSE"| Backend
    ChatAPI --> Orchestrator
    ToolsAPI --> Orchestrator
    CrawlerAPI --> Orchestrator
    Orchestrator --> CrawlerService
    Orchestrator --> BuiltInTools
    Orchestrator --> ExternalAPIs

二、工具系统设计

2.1 工具定义规范

工具采用 JSON Schema 定义,与 OpenAI Function Calling 兼容:

# backend/tools/registry.py

from dataclasses import dataclass
from typing import Callable, Any
import json

@dataclass
class ToolDefinition:
    """工具定义"""
    name: str                    # 工具名称,如 "web_search"
    description: str             # 工具描述,供模型理解用途
    parameters: dict             # JSON Schema 格式的参数定义
    handler: Callable            # 实际执行函数

    def to_openai_format(self) -> dict:
        """转换为 GLM/OpenAI 兼容格式"""
        return {
            "type": "function",
            "function": {
                "name": self.name,
                "description": self.description,
                "parameters": self.parameters
            }
        }


# 工具注册表
class ToolRegistry:
    def __init__(self):
        self._tools: dict[str, ToolDefinition] = {}

    def register(self, tool: ToolDefinition):
        self._tools[tool.name] = tool

    def get(self, name: str) -> ToolDefinition | None:
        return self._tools.get(name)

    def list_all(self) -> list[dict]:
        return [t.to_openai_format() for t in self._tools.values()]

    def execute(self, name: str, arguments: dict) -> Any:
        tool = self.get(name)
        if not tool:
            raise ValueError(f"Tool not found: {name}")
        return tool.handler(**arguments)


# 全局注册表
registry = ToolRegistry()

2.2 内置工具定义

2.2.1 网页搜索工具

# backend/tools/builtin/web_search.py

from ..registry import registry, ToolDefinition

def web_search(query: str, max_results: int = 5) -> dict:
    """
    执行网页搜索

    Args:
        query: 搜索关键词
        max_results: 最大返回结果数

    Returns:
        搜索结果列表
    """
    # 调用爬虫服务执行搜索
    from ..crawler import search_service
    results = search_service.search(query, max_results)
    return {
        "success": True,
        "results": results
    }

# 注册工具
registry.register(ToolDefinition(
    name="web_search",
    description="搜索互联网获取实时信息。当用户询问时事、新闻、或需要最新数据时使用。",
    parameters={
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "搜索关键词"
            },
            "max_results": {
                "type": "integer",
                "description": "返回结果数量默认5",
                "default": 5
            }
        },
        "required": ["query"]
    },
    handler=web_search
))

2.2.2 网页内容抓取工具

# backend/tools/builtin/fetch_page.py

def fetch_page(url: str, extract_type: str = "text") -> dict:
    """
    抓取网页内容

    Args:
        url: 目标网页URL
        extract_type: 提取类型 (text/links/images/structured)

    Returns:
        提取的内容
    """
    from ..crawler import fetch_service
    return fetch_service.fetch(url, extract_type)

registry.register(ToolDefinition(
    name="fetch_page",
    description="抓取指定URL的网页内容提取文本、链接或结构化数据。",
    parameters={
        "type": "object",
        "properties": {
            "url": {
                "type": "string",
                "description": "要抓取的网页URL"
            },
            "extract_type": {
                "type": "string",
                "enum": ["text", "links", "images", "structured"],
                "description": "提取类型",
                "default": "text"
            }
        },
        "required": ["url"]
    },
    handler=fetch_page
))

2.2.3 批量爬虫任务工具

# backend/tools/builtin/crawl_batch.py

def crawl_batch(
    urls: list[str],
    extract_type: str = "text",
    parallel: int = 3
) -> dict:
    """
    批量爬取多个网页

    Args:
        urls: URL列表
        extract_type: 提取类型
        parallel: 并发数

    Returns:
        任务ID和状态
    """
    from ..crawler import crawl_manager
    task_id = crawl_manager.create_task(
        urls=urls,
        extract_type=extract_type,
        parallel=parallel
    )
    return {
        "task_id": task_id,
        "status": "pending",
        "message": f"已创建爬虫任务,共 {len(urls)} 个URL"
    }

registry.register(ToolDefinition(
    name="crawl_batch",
    description="批量爬取多个网页内容。适用于需要采集多个页面的场景。",
    parameters={
        "type": "object",
        "properties": {
            "urls": {
                "type": "array",
                "items": {"type": "string"},
                "description": "要爬取的URL列表"
            },
            "extract_type": {
                "type": "string",
                "enum": ["text", "links", "images", "structured"],
                "default": "text"
            },
            "parallel": {
                "type": "integer",
                "description": "并发数默认3",
                "default": 3
            }
        },
        "required": ["urls"]
    },
    handler=crawl_batch
))

2.2.4 爬虫任务查询工具

# backend/tools/builtin/query_task.py

def query_crawl_task(task_id: str) -> dict:
    """
    查询爬虫任务状态和结果

    Args:
        task_id: 任务ID

    Returns:
        任务状态和结果
    """
    from ..crawler import crawl_manager
    return crawl_manager.get_task_status(task_id)

registry.register(ToolDefinition(
    name="query_crawl_task",
    description="查询爬虫任务的执行状态和结果。",
    parameters={
        "type": "object",
        "properties": {
            "task_id": {
                "type": "string",
                "description": "任务ID"
            }
        },
        "required": ["task_id"]
    },
    handler=query_crawl_task
))

2.3 工具调用流程

flowchart TD
    A[用户消息] --> B[构建消息上下文 + 工具定义]
    B --> C[调用 GLM API<br/>启用工具调用]
    C --> D{判断响应类型}

    D -->|普通文本| E[返回用户]
    D -->|工具调用请求| F[执行工具调用<br/>registry.execute]

    F --> G[将工具结果追加到消息历史]
    G --> H[再次调用 GLM API<br/>带工具结果]
    H --> I[返回最终回复]

    style A fill:#e1f5fe
    style E fill:#c8e6c9
    style I fill:#c8e6c9
    style F fill:#fff3e0
    style D fill:#fce4ec

2.4 后端实现:工具调用处理

# backend/tools/executor.py

import json
from typing import Generator
from .registry import registry

class ToolExecutor:
    """工具调用执行器"""

    def __init__(self, api_url: str, api_key: str):
        self.api_url = api_url
        self.api_key = api_key

    def build_messages_with_tools(
        self, 
        messages: list[dict],
        tools: list[dict] | None = None
    ) -> dict:
        """构建带工具定义的请求体"""
        body = {
            "model": "glm-5",
            "messages": messages,
            "tools": tools or registry.list_all(),
            "tool_choice": "auto"
        }
        return body

    def process_tool_calls(
        self, 
        tool_calls: list[dict],
        messages: list[dict]
    ) -> list[dict]:
        """处理工具调用,返回工具结果消息"""
        results = []

        for call in tool_calls:
            tool_name = call["function"]["name"]
            tool_args = json.loads(call["function"]["arguments"])
            call_id = call["id"]

            try:
                # 执行工具
                result = registry.execute(tool_name, tool_args)
                content = json.dumps(result, ensure_ascii=False)
            except Exception as e:
                content = json.dumps({
                    "error": True,
                    "message": str(e)
                }, ensure_ascii=False)

            # 添加工具结果消息
            results.append({
                "role": "tool",
                "tool_call_id": call_id,
                "name": tool_name,
                "content": content
            })

        return results

    def chat_with_tools(
        self,
        messages: list[dict],
        model: str = "glm-5",
        max_iterations: int = 5,
        stream: bool = True
    ) -> Generator:
        """
        支持工具调用的对话补全

        Args:
            messages: 对话历史
            model: 模型名称
            max_iterations: 最大工具调用迭代次数
            stream: 是否流式输出

        Yields:
            SSE 格式的事件
        """
        import requests

        tools = registry.list_all()

        for iteration in range(max_iterations):
            # 调用模型
            body = self.build_messages_with_tools(messages, tools)
            body["model"] = model
            body["stream"] = stream

            resp = requests.post(
                self.api_url,
                headers={
                    "Content-Type": "application/json",
                    "Authorization": f"Bearer {self.api_key}"
                },
                json=body,
                stream=stream,
                timeout=120
            )

            if stream:
                # 流式处理
                tool_calls_buffer = {}
                full_content = ""

                for line in resp.iter_lines():
                    if not line:
                        continue
                    line = line.decode("utf-8")
                    if not line.startswith("data: "):
                        continue
                    data_str = line[6:]
                    if data_str == "[DONE]":
                        break

                    chunk = json.loads(data_str)
                    delta = chunk["choices"][0].get("delta", {})

                    # 处理工具调用
                    if "tool_calls" in delta:
                        for tc in delta["tool_calls"]:
                            idx = tc.get("index", 0)
                            if idx not in tool_calls_buffer:
                                tool_calls_buffer[idx] = {
                                    "id": tc.get("id", ""),
                                    "type": "function",
                                    "function": {"name": "", "arguments": ""}
                                }
                            if tc.get("id"):
                                tool_calls_buffer[idx]["id"] = tc["id"]
                            if "function" in tc:
                                if tc["function"].get("name"):
                                    tool_calls_buffer[idx]["function"]["name"] = tc["function"]["name"]
                                if tc["function"].get("arguments"):
                                    tool_calls_buffer[idx]["function"]["arguments"] += tc["function"]["arguments"]

                    # 处理文本内容
                    if "content" in delta and delta["content"]:
                        full_content += delta["content"]
                        yield f"event: message\ndata: {json.dumps({'content': delta['content']}, ensure_ascii=False)}\n\n"

                # 检查是否有工具调用
                if tool_calls_buffer:
                    tool_calls = list(tool_calls_buffer.values())

                    # 发送工具调用事件(供前端展示)
                    yield f"event: tool_call\ndata: {json.dumps({'calls': tool_calls}, ensure_ascii=False)}\n\n"

                    # 将助手消息添加到历史
                    messages.append({
                        "role": "assistant",
                        "content": full_content or None,
                        "tool_calls": tool_calls
                    })

                    # 执行工具调用
                    tool_results = self.process_tool_calls(tool_calls, messages)

                    # 发送工具结果事件
                    yield f"event: tool_result\ndata: {json.dumps({'results': tool_results}, ensure_ascii=False)}\n\n"

                    # 将工具结果添加到消息历史
                    messages.extend(tool_results)

                    # 继续下一轮对话
                    continue

                # 无工具调用,结束
                yield f"event: done\ndata: {json.dumps({})}\n\n"
                return

            else:
                # 非流式处理
                result = resp.json()
                choice = result["choices"][0]
                message = choice["message"]

                if "tool_calls" not in message:
                    # 无工具调用,直接返回
                    yield f"event: done\ndata: {json.dumps({'message': message}, ensure_ascii=False)}\n\n"
                    return

                # 有工具调用
                tool_calls = message["tool_calls"]

                # 将助手消息添加到历史
                messages.append(message)

                # 执行工具
                tool_results = self.process_tool_calls(tool_calls, messages)
                messages.extend(tool_results)

                # 继续下一轮
                continue

三、爬虫流水线设计

3.1 爬虫服务架构

flowchart TB
    subgraph CrawlerService["🕷️ Crawler Service"]
        subgraph Engines["核心引擎"]
            Search["Search Engine<br/>搜索引擎"]
            Fetcher["Fetcher Engine<br/>抓取引擎"]
            TaskMgr["Task Manager<br/>任务管理器"]
        end

        subgraph Pipeline["Content Pipeline (内容处理流水线)"]
            Parser["Parser<br/>解析器"] --> Cleaner["Cleaner<br/>清洗器"]
            Cleaner --> Extractor["Extractor<br/>提取器"]
            Extractor --> Structurer["Structurer<br/>结构化"]
        end

        subgraph Storage["Storage Layer (存储层)"]
            Cache["Cache<br/>(Redis)"]
            DB["DB<br/>(MySQL)"]
            FileStore["File Storage<br/>(本地/OSS)"]
        end

        Search --> Pipeline
        Fetcher --> Pipeline
        TaskMgr --> Pipeline
        Pipeline --> Storage
    end

    style Parser fill:#e3f2fd
    style Cleaner fill:#e8f5e9
    style Extractor fill:#fff3e0
    style Structurer fill:#fce4ec

3.2 核心模块设计

3.2.1 搜索服务

# backend/crawler/search.py

from dataclasses import dataclass
from typing import Protocol
import asyncio

@dataclass
class SearchResult:
    title: str
    url: str
    snippet: str
    source: str

class SearchEngine(Protocol):
    """搜索引擎协议"""
    async def search(self, query: str, max_results: int) -> list[SearchResult]:
        ...

class DuckDuckGoSearch:
    """DuckDuckGo 搜索实现"""

    async def search(self, query: str, max_results: int = 5) -> list[SearchResult]:
        from duckduckgo_search import DDGS

        results = []
        with DDGS() as ddgs:
            for r in ddgs.text(query, max_results=max_results):
                results.append(SearchResult(
                    title=r.get("title", ""),
                    url=r.get("href", ""),
                    snippet=r.get("body", ""),
                    source="duckduckgo"
                ))
        return results

class SearchService:
    """搜索服务"""

    def __init__(self, engine: SearchEngine | None = None):
        self.engine = engine or DuckDuckGoSearch()

    def search(self, query: str, max_results: int = 5) -> list[dict]:
        """同步搜索接口"""
        return asyncio.run(self._search_async(query, max_results))

    async def _search_async(self, query: str, max_results: int) -> list[dict]:
        results = await self.engine.search(query, max_results)
        return [
            {
                "title": r.title,
                "url": r.url,
                "snippet": r.snippet,
                "source": r.source
            }
            for r in results
        ]

3.2.2 网页抓取服务

# backend/crawler/fetcher.py

import asyncio
from dataclasses import dataclass
from typing import Literal
from bs4 import BeautifulSoup
import httpx
from urllib.parse import urljoin, urlparse

@dataclass
class FetchResult:
    url: str
    status: int
    content: dict
    metadata: dict

class FetchService:
    """网页抓取服务"""

    def __init__(
        self,
        timeout: float = 30.0,
        max_retries: int = 2,
        user_agent: str = "Mozilla/5.0 (compatible; NanoClawBot/1.0)"
    ):
        self.timeout = timeout
        self.max_retries = max_retries
        self.user_agent = user_agent

    async def fetch_async(
        self, 
        url: str, 
        extract_type: Literal["text", "links", "images", "structured"] = "text"
    ) -> FetchResult:
        """异步抓取网页"""
        headers = {"User-Agent": self.user_agent}

        async with httpx.AsyncClient(timeout=self.timeout) as client:
            for attempt in range(self.max_retries + 1):
                try:
                    resp = await client.get(url, headers=headers, follow_redirects=True)
                    resp.raise_for_status()
                    break
                except httpx.HTTPError as e:
                    if attempt == self.max_retries:
                        return FetchResult(
                            url=url,
                            status=500,
                            content={"error": str(e)},
                            metadata={}
                        )
                    await asyncio.sleep(1 * (attempt + 1))

        # 解析内容
        soup = BeautifulSoup(resp.text, "html.parser")
        content = self._extract(soup, url, extract_type)

        metadata = {
            "title": soup.title.string if soup.title else "",
            "status_code": resp.status_code,
            "content_type": resp.headers.get("content-type", ""),
            "final_url": str(resp.url)
        }

        return FetchResult(url=url, status=resp.status_code, content=content, metadata=metadata)

    def _extract(self, soup: BeautifulSoup, base_url: str, extract_type: str) -> dict:
        """提取内容"""
        if extract_type == "text":
            # 移除脚本和样式
            for tag in soup(["script", "style", "nav", "footer", "header"]):
                tag.decompose()
            text = soup.get_text(separator="\n", strip=True)
            return {"text": text[:10000]}  # 限制长度

        elif extract_type == "links":
            links = []
            for a in soup.find_all("a", href=True):
                href = urljoin(base_url, a["href"])
                if urlparse(href).scheme in ("http", "https"):
                    links.append({
                        "text": a.get_text(strip=True),
                        "url": href
                    })
            return {"links": links[:100]}

        elif extract_type == "images":
            images = []
            for img in soup.find_all("img", src=True):
                src = urljoin(base_url, img["src"])
                images.append({
                    "alt": img.get("alt", ""),
                    "src": src
                })
            return {"images": images[:50]}

        elif extract_type == "structured":
            # 提取结构化数据
            structured = {
                "title": soup.title.string if soup.title else "",
                "meta": {},
                "headings": [],
                "paragraphs": []
            }

            # Meta 信息
            for meta in soup.find_all("meta"):
                name = meta.get("name") or meta.get("property", "")
                if name:
                    structured["meta"][name] = meta.get("content", "")

            # 标题
            for i in range(1, 7):
                for h in soup.find_all(f"h{i}"):
                    structured["headings"].append({
                        "level": i,
                        "text": h.get_text(strip=True)
                    })

            # 段落
            for p in soup.find_all("p"):
                text = p.get_text(strip=True)
                if len(text) > 20:
                    structured["paragraphs"].append(text)

            return {"structured": structured}

        return {}

    def fetch(self, url: str, extract_type: str = "text") -> dict:
        """同步抓取接口"""
        result = asyncio.run(self.fetch_async(url, extract_type))
        return {
            "success": result.status == 200,
            "url": result.url,
            "content": result.content,
            "metadata": result.metadata
        }

3.2.3 任务管理器

# backend/crawler/task_manager.py

import asyncio
import uuid
from datetime import datetime
from typing import Literal
from dataclasses import dataclass, field
from enum import Enum
from concurrent.futures import ThreadPoolExecutor

class TaskStatus(Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"

@dataclass
class CrawlTask:
    id: str
    urls: list[str]
    extract_type: str
    parallel: int
    status: TaskStatus = TaskStatus.PENDING
    progress: int = 0
    total: int = 0
    results: list[dict] = field(default_factory=list)
    errors: list[dict] = field(default_factory=list)
    created_at: datetime = field(default_factory=datetime.utcnow)
    completed_at: datetime | None = None

class CrawlTaskManager:
    """爬虫任务管理器"""

    def __init__(self, max_workers: int = 3):
        self.tasks: dict[str, CrawlTask] = {}
        self.max_workers = max_workers
        self.executor = ThreadPoolExecutor(max_workers=max_workers)
        self._fetch_service = None

    @property
    def fetch_service(self):
        if self._fetch_service is None:
            from .fetcher import FetchService
            self._fetch_service = FetchService()
        return self._fetch_service

    def create_task(
        self,
        urls: list[str],
        extract_type: Literal["text", "links", "images", "structured"] = "text",
        parallel: int = 3
    ) -> str:
        """创建爬虫任务"""
        task_id = str(uuid.uuid4())[:8]
        task = CrawlTask(
            id=task_id,
            urls=urls,
            extract_type=extract_type,
            parallel=min(parallel, self.max_workers),
            total=len(urls)
        )
        self.tasks[task_id] = task

        # 异步执行
        self.executor.submit(self._execute_task, task_id)

        return task_id

    def _execute_task(self, task_id: str):
        """执行爬虫任务"""
        task = self.tasks.get(task_id)
        if not task:
            return

        task.status = TaskStatus.RUNNING

        async def run():
            semaphore = asyncio.Semaphore(task.parallel)

            async def fetch_one(url: str):
                async with semaphore:
                    try:
                        result = await self.fetch_service.fetch_async(url, task.extract_type)
                        return {"url": url, "data": result}
                    except Exception as e:
                        return {"url": url, "error": str(e)}

            tasks = [fetch_one(url) for url in task.urls]
            results = await asyncio.gather(*tasks)

            for r in results:
                task.progress += 1
                if "error" in r:
                    task.errors.append(r)
                else:
                    task.results.append(r)

        try:
            asyncio.run(run())
            task.status = TaskStatus.COMPLETED
        except Exception as e:
            task.status = TaskStatus.FAILED
            task.errors.append({"error": str(e)})
        finally:
            task.completed_at = datetime.utcnow()

    def get_task_status(self, task_id: str) -> dict:
        """获取任务状态"""
        task = self.tasks.get(task_id)
        if not task:
            return {"error": "Task not found"}

        return {
            "id": task.id,
            "status": task.status.value,
            "progress": task.progress,
            "total": task.total,
            "results": task.results if task.status == TaskStatus.COMPLETED else [],
            "errors": task.errors,
            "created_at": task.created_at.isoformat(),
            "completed_at": task.completed_at.isoformat() if task.completed_at else None
        }

# 全局任务管理器
crawl_manager = CrawlTaskManager()

3.3 数据模型扩展

# backend/models.py 新增模型

class CrawlTaskRecord(db.Model):
    """爬虫任务记录(持久化)"""
    __tablename__ = "crawl_tasks"

    id = db.Column(db.String(32), primary_key=True)
    user_id = db.Column(db.BigInteger, db.ForeignKey("users.id"))
    conversation_id = db.Column(db.String(64), db.ForeignKey("conversations.id"))
    urls = db.Column(db.JSON)  # URL 列表
    extract_type = db.Column(db.String(32))
    status = db.Column(db.String(16), default="pending")
    result_count = db.Column(db.Integer, default=0)
    error_count = db.Column(db.Integer, default=0)
    created_at = db.Column(db.DateTime, default=datetime.utcnow)
    completed_at = db.Column(db.DateTime)


class CrawlResult(db.Model):
    """爬虫结果"""
    __tablename__ = "crawl_results"

    id = db.Column(db.BigInteger, primary_key=True, autoincrement=True)
    task_id = db.Column(db.String(32), db.ForeignKey("crawl_tasks.id"))
    url = db.Column(db.String(1024))
    content = db.Column(db.JSON)  # 提取的内容
    metadata = db.Column(db.JSON)
    status_code = db.Column(db.Integer)
    created_at = db.Column(db.DateTime, default=datetime.utcnow)

四、API 接口设计

4.1 工具相关 API

获取可用工具列表

GET /api/tools

响应:

{
  "code": 0,
  "data": {
    "tools": [
      {
        "name": "web_search",
        "description": "搜索互联网获取实时信息",
        "parameters": { ... }
      }
    ]
  }
}

4.2 爬虫相关 API

创建爬虫任务

POST /api/crawler/tasks

请求体:

{
  "urls": ["https://example.com/page1", "https://example.com/page2"],
  "extract_type": "text",
  "parallel": 3
}

响应:

{
  "code": 0,
  "data": {
    "task_id": "abc12345",
    "status": "pending",
    "total": 2
  }
}

查询任务状态

GET /api/crawler/tasks/:task_id

响应:

{
  "code": 0,
  "data": {
    "id": "abc12345",
    "status": "completed",
    "progress": 2,
    "total": 2,
    "results": [
      {
        "url": "https://example.com/page1",
        "data": { "content": { "text": "..." }, "metadata": { ... } }
      }
    ]
  }
}

获取任务列表

GET /api/crawler/tasks?status=completed&limit=20

五、前端集成

5.1 工具调用 UI 组件

<!-- frontend/src/components/ToolCallDisplay.vue -->
<template>
  <div class="tool-call">
    <div class="tool-header">
      <span class="tool-icon">🔧</span>
      <span class="tool-name">{{ call.name }}</span>
      <span :class="['tool-status', status]">{{ statusText }}</span>
    </div>
    <div v-if="expanded" class="tool-details">
      <div class="tool-args">
        <strong>参数:</strong>
        <pre>{{ formattedArgs }}</pre>
      </div>
      <div v-if="result" class="tool-result">
        <strong>结果:</strong>
        <pre>{{ formattedResult }}</pre>
      </div>
    </div>
    <button class="expand-btn" @click="expanded = !expanded">
      {{ expanded ? '收起' : '展开' }}
    </button>
  </div>
</template>

<script setup>
import { ref, computed } from 'vue'

const props = defineProps({
  call: Object,
  result: Object,
  status: { type: String, default: 'pending' } // pending, running, completed, error
})

const expanded = ref(false)

const statusText = computed(() => ({
  pending: '等待中',
  running: '执行中',
  completed: '已完成',
  error: '失败'
}[props.status]))

const formattedArgs = computed(() => 
  JSON.stringify(props.call.arguments, null, 2)
)

const formattedResult = computed(() => 
  JSON.stringify(props.result, null, 2)
)
</script>

5.2 SSE 事件扩展

扩展消息 API 的 SSE 事件,新增工具调用相关事件:

事件 说明
tool_call 模型发起工具调用
tool_result 工具执行结果
thinking 思维链内容
message 回复内容片段
done 完成

六、配置与部署

6.1 配置文件扩展

# config.yml

# ... 现有配置 ...

# 爬虫配置
crawler:
  max_workers: 5
  timeout: 30
  max_retries: 2
  user_agent: "Mozilla/5.0 (compatible; NanoClawBot/1.0)"

# 工具配置  
tools:
  enabled:
    - web_search
    - fetch_page
    - crawl_batch
    - query_crawl_task
  max_iterations: 5  # 最大工具调用迭代次数

6.2 依赖安装

# pyproject.toml 新增依赖

dependencies = [
    # ... 现有依赖 ...
    "duckduckgo-search>=4.0.0",
    "beautifulsoup4>=4.12.0",
    "httpx>=0.25.0",
    "lxml>=4.9.0",
]

七、使用示例

7.1 用户对话示例

用户: 帮我搜索一下最近 AI 领域有什么重要新闻

助手: [调用 web_search 工具]
      query: "AI 人工智能 最新新闻 2024"
      → 返回搜索结果

助手: 根据搜索结果,最近 AI 领域有以下重要新闻:

1. **OpenAI 发布 GPT-5** - [链接]
   OpenAI 正式发布了新一代模型 GPT-5...

2. **Google Gemini 2.0 发布** - [链接]
   Google 宣布推出 Gemini 2.0...

---

用户: 帮我把这几个链接的内容都抓取下来

助手: [调用 crawl_batch 工具]
      urls: ["https://...", "https://..."]
      → 返回任务ID

助手: 已创建爬虫任务,正在抓取 2 个网页...

助手: [自动调用 query_crawl_task 查询结果]

助手: 抓取完成!以下是内容摘要:

**文章1: OpenAI 发布 GPT-5**
> 核心内容:新模型在推理能力上提升了 50%...

**文章2: Google Gemini 2.0 发布**
> 核心内容:多模态能力大幅增强...

7.2 API 调用示例

# 创建会话并启用工具
import requests

# 创建会话
resp = requests.post("http://localhost:3000/api/conversations", json={
    "title": "AI 新闻调研",
    "model": "glm-5"
})
conv_id = resp.json()["data"]["id"]

# 发送消息(自动触发工具调用)
resp = requests.post(
    f"http://localhost:3000/api/conversations/{conv_id}/messages",
    json={"content": "帮我搜索最新的 AI 新闻", "stream": true},
    stream=True
)

# 处理 SSE 事件
for line in resp.iter_lines():
    # event: tool_call
    # event: tool_result
    # event: message
    # event: done
    pass

八、安全与限制

8.1 安全措施

  1. URL 白名单/黑名单:限制可爬取的域名
  2. 速率限制:控制请求频率,避免被封禁
  3. 内容过滤:过滤敏感内容
  4. 用户隔离:任务按用户隔离

8.2 使用限制

# backend/tools/limits.py

TOOL_LIMITS = {
    "web_search": {
        "max_results": 10,
        "rate_limit": "10/minute"
    },
    "fetch_page": {
        "max_content_size": 1024 * 1024,  # 1MB
        "timeout": 30
    },
    "crawl_batch": {
        "max_urls": 50,
        "parallel_max": 5
    }
}

九、后续扩展

  1. 更多工具类型

    • 数据分析工具(图表生成、数据统计)
    • 文件处理工具PDF 解析、Excel 处理)
    • 代码执行工具(安全沙箱中运行代码)
  2. 爬虫增强

    • JavaScript 渲染Playwright/Selenium
    • 代理池支持
    • 分布式爬虫
  3. 智能调度

    • 基于对话上下文的工具推荐
    • 工具链组合执行
    • 异步任务通知

十、总结

本设计文档描述了 NanoClaw 的爬虫流水线和工具系统架构:

  1. 工具系统:采用 OpenAI 兼容的工具定义格式,通过工具注册表管理,支持 GLM 模型的自动工具调用。

  2. 爬虫流水线:包含搜索服务、抓取服务、任务管理器,支持单页抓取和批量任务,提供多种内容提取模式。

  3. API 设计:扩展现有 API 支持 SSE 工具调用事件,新增爬虫任务管理接口。

  4. 前端集成:提供工具调用可视化组件,支持工具执行过程的实时展示。

这套架构使 NanoClaw 能够突破模型知识截止日期的限制,获取实时网络信息,大幅扩展应用场景。