Llm Evaluator — 技能 — openclaw中文资讯站

技能详情（站内镜像，无评论）

LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical trac...

开发与 DevOps

许可证：MIT-0

MIT-0 ·免费使用、修改和重新分发。无需归因。

版本：v1.0.0

统计：⭐ 0 · 189 · 0 current installs · 0 all-time installs

⭐ 0

安装量（当前） 0

🛡 VirusTotal ：可疑 · OpenClaw ：可疑

Package：aiwithabidi/llm-evaluator

安全扫描（ClawHub）

VirusTotal ：可疑
OpenClaw ：可疑

OpenClaw 评估

The skill mostly fits its stated purpose (evaluating Langfuse traces via OpenRouter) but contains unexpected behaviors—hardcoded Langfuse API keys in code and an undeclared read of a user-local .env file—that don't align with the declared requirements and deserve careful review before use.

目的

Name/description (Langfuse + OpenRouter) matches the script's behavior: it evaluates traces and posts scores to Langfuse using an OpenRouter-backed judge model. However, the code embeds LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY values and a LANGFUSE_HOST that are not declared in requires.env or SKILL.md; shipping hardcoded service credentials is unexpected and disproportionate to the stated purpose.

说明范围

SKILL.md directs running the included Python script, which is expected. The script also attempts to read a user-local file (~/.openclaw/workspace/.env) to find OPENROUTER_API_KEY if the env var isn't set; this config-file access is not declared in requires.config_paths and is an additional data access surface that users should be aware of.

安装机制

No install spec (instruction-only with an included script). That keeps install risk low — nothing is downloaded or executed automatically beyond running the bundled Python script.

证书

The registry declares only OPENROUTER_API_KEY as required (which is appropriate). But the code embeds Langfuse public/secret keys and a Langfuse host URL; these are effectively credentials baked into the skill rather than requested from the environment. Also the script will make network calls to Langfuse and OpenRouter, which is expected but worth noting.

持久

The skill does not request always:true and does not request system-wide persistence. It runs network operations and writes scores to Langfuse, which is consistent with its purpose and not an unusual privilege level.

安装（复制给龙虾 AI）

将下方整段复制到龙虾中文库对话中，由龙虾按 SKILL.md 完成安装。

请把本段交给龙虾中文库（龙虾 AI）执行：为本机安装 OpenClaw 技能「Llm Evaluator」。简介：LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance,…。
请 fetch 以下地址读取 SKILL.md 并按文档完成安装：https://raw.githubusercontent.com/openclaw/skills/refs/heads/main/skills/aiwithabidi/llm-evaluator/SKILL.md
（来源：yingzhi8.cn 技能库）

SKILL.md

打开原始 SKILL.md（GitHub raw）

---
name: llm-evaluator
description: LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical traces. Uses GPT-5-nano for cost-efficient judging. Use when evaluating AI quality, building evals, or monitoring output accuracy.
homepage: https://www.agxntsix.ai
license: MIT
compatibility: Python 3.10+, Langfuse instance, OpenRouter API key
metadata: {"openclaw": {"emoji": "u2696ufe0f", "requires": {"env": ["OPENROUTER_API_KEY"]}, "primaryEnv": "OPENROUTER_API_KEY", "homepage": "https://www.agxntsix.ai"}}
---

# LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

## When to Use

- Evaluating quality of search results or AI responses
- Scoring traces for relevance, accuracy, hallucination detection
- Batch scoring recent unscored traces
- Quality assurance on agent outputs

## Usage

```bash
# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score <trace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score <trace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20
```

## Evaluators

| Evaluator | Measures | Scale |
|-----------|----------|-------|
| relevance | Response relevance to query | 0–1 |
| accuracy | Factual correctness | 0–1 |
| hallucination | Made-up information detection | 0–1 |
| helpfulness | Overall usefulness | 0–1 |

## Credits
Built by [M. Abidi](https://www.linkedin.com/in/mohammad-ali-abidi) | [agxntsix.ai](https://www.agxntsix.ai)
[YouTube](https://youtube.com/@aiwithabidi) | [GitHub](https://github.com/aiwithabidi)
Part of the **AgxntSix Skill Suite** for OpenClaw agents.

📅 **Need help setting up OpenClaw for your business?** [Book a free consultation](https://cal.com/agxntsix/abidi-openclaw)