LLM Evaluator Pro — 技能 — openclaw中文资讯站

技能详情（站内镜像，无评论）

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace...

开发与 DevOps

许可证：MIT-0

MIT-0 ·免费使用、修改和重新分发。无需归因。

版本：v1.0.0

统计：⭐ 1 · 532 · 0 current installs · 0 all-time installs

⭐ 1

安装量（当前） 0

🛡 VirusTotal ：可疑 · OpenClaw ：可疑

Package：aiwithabidi/llm-evaluator-pro

安全扫描（ClawHub）

VirusTotal ：可疑
OpenClaw ：可疑

OpenClaw 评估

The skill's purpose (scoring Langfuse traces using an OpenRouter judge) mostly matches what the code does, but there are multiple coherence and data-exfiltration risks — notably hardcoded Langfuse credentials, reading an undeclared .env file, and missing dependency/install guidance.

目的

Name/description match the code: it uses OpenRouter (GPT judge) and Langfuse to score traces. Requesting OPENROUTER_API_KEY and Langfuse keys is consistent with the described function. However the code contains hardcoded Langfuse keys and host values, which undermines the declared requirement model (the skill claims to require env vars but will fall back to embedded credentials).

说明范围

SKILL.md instructs running the included Python script. The script, however, attempts to read ~/.openclaw/workspace/.env for the OpenRouter key (a config path not declared in metadata) and uses hardcoded Langfuse credentials/host to call the Langfuse API. Reading an undeclared workspace .env can access other secrets; always-posting scores to a hardcoded Langfuse endpoint (with embedded keys) could transmit data to an unexpected/third-party acco…

安装机制

There is no install spec. The skill includes a Python script but does not declare Python package dependencies (requests, openai, langfuse). That is a coherence/usability issue (script may fail). Lack of an install step lowers installation auditability, but is not itself malicious — still increases risk because it's unclear what packages will be installed by users to run it.

证书

Declared env vars (OPENROUTER_API_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY) are appropriate for the stated purpose. However the script: (1) sets default LANGFUSE keys in code, (2) hardcodes LF_AUTH and LF_API values rather than reading the environment, and (3) attempts to parse ~/.openclaw/workspace/.env if OPENROUTER_API_KEY is not set. These behaviors mean the skill can use embedded credentials and read an undeclared local .env file, wh…

持久

The skill is not force-included (always=false) and does not request persistent platform privileges. It does not attempt to modify other skills or global agent configuration. Autonomy is enabled by default but is not an additional red flag here.

安装（复制给龙虾 AI）

将下方整段复制到龙虾中文库对话中，由龙虾按 SKILL.md 完成安装。

请把本段交给龙虾中文库（龙虾 AI）执行：为本机安装 OpenClaw 技能「LLM Evaluator Pro」。简介：LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, ha…。
请 fetch 以下地址读取 SKILL.md 并按文档完成安装：https://raw.githubusercontent.com/openclaw/skills/refs/heads/main/skills/aiwithabidi/llm-evaluator-pro/SKILL.md
（来源：yingzhi8.cn 技能库）

SKILL.md

打开原始 SKILL.md（GitHub raw）

---
name: llm-evaluator
version: 1.0.0
description: >
  LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy,
  hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace
  scoring, batch backfill, and test mode. Integrates with Langfuse dashboard for
  observability. Triggers: evaluate trace, score quality, check accuracy, backfill
  scores, test evaluator, LLM judge.
license: MIT
compatibility:
  openclaw: ">=0.10"
metadata:
  openclaw:
    requires:
      bins: ["python3"]
      env: ["OPENROUTER_API_KEY", "LANGFUSE_PUBLIC_KEY", "LANGFUSE_SECRET_KEY"]
---

# LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

## When to Use

- Evaluating quality of search results or AI responses
- Scoring traces for relevance, accuracy, hallucination detection
- Batch scoring recent unscored traces
- Quality assurance on agent outputs

## Usage

```bash
# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score <trace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score <trace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20
```

## Evaluators

| Evaluator | Measures | Scale |
|-----------|----------|-------|
| relevance | Response relevance to query | 0–1 |
| accuracy | Factual correctness | 0–1 |
| hallucination | Made-up information detection | 0–1 |
| helpfulness | Overall usefulness | 0–1 |

## Credits

Built by [M. Abidi](https://www.linkedin.com/in/mohammad-ali-abidi) | [agxntsix.ai](https://www.agxntsix.ai)
[YouTube](https://youtube.com/@aiwithabidi) | [GitHub](https://github.com/aiwithabidi)
Part of the **AgxntSix Skill Suite** for OpenClaw agents.

📅 **Need help setting up OpenClaw for your business?** [Book a free consultation](https://cal.com/agxntsix/abidi-openclaw)