技能详情(站内镜像,无评论)
作者:Kevin Anderson @anderskev
许可证:MIT-0
MIT-0 ·免费使用、修改和重新分发。无需归因。
版本:v1.0.0
统计:⭐ 0 · 38 · 1 current installs · 1 all-time installs
⭐ 0
安装量(当前) 1
🛡 VirusTotal :良性 · OpenClaw :良性
Package:anderskev/llm-judge
安全扫描(ClawHub)
- VirusTotal :良性
- OpenClaw :良性
OpenClaw 评估
The skill's instructions, required resources, and outputs are consistent with its stated purpose (automated LLM-based comparison of repositories); the main risk is that it runs repository tests and shell commands, which is expected for this task but dangerous for untrusted code.
目的
Name/description (LLM-as-judge across repos) matches the runtime instructions: spawning repo-gathering agents, collecting structured facts, running tests, and spawning judge agents to score using rubrics. No unrelated credentials, binaries, or install steps are requested.
说明范围
Instructions legitimately require reading repository files, running git commands, and executing tests to collect facts. This is coherent with the stated purpose. However, running tests (pytest, npm test, go test, etc.) and executing shell commands inside each repo can execute arbitrary code from the target repository — a safety risk if the repo is untrusted. The skill does not specify sandboxing, network restrictions, or limits on what tests m…
安装机制
Instruction-only skill with no install spec and no external downloads. This minimizes supply-chain risk and is proportionate to the task.
证书
No environment variables, credentials, or config paths are requested. The skill references other internal skills (e.g., @beagle:llm-artifacts-detection) which is expected for modular analysis; nothing asks for unrelated secrets or external service keys.
持久
always is false and the skill does not request elevated platform privileges. It will write its report to .beagle/llm-judge-report.json in the analyzed repo (expected behavior). Because the skill can be invoked autonomously by agents (platform default), combined with its behavior of running repo tests, the operational blast radius is higher if used on untrusted repos — consider restricting autonomy or using sandboxing.
综合结论
This skill appears to be what it claims: an LLM-based judge that inspects repositories, runs tests, gathers structured facts, and scores repos with rubrics. Before installing or using it, consider: 1) Running it only on repositories you trust or in an isolated/sandboxed environment (containers or VMs) because Step 1 executes tests and shell commands that can run arbitrary code. 2) Restrict network access for the test runs if you need to avoid …
安装(复制给龙虾 AI)
将下方整段复制到龙虾中文库对话中,由龙虾按 SKILL.md 完成安装。
请把本段交给龙虾中文库(龙虾 AI)执行:为本机安装 OpenClaw 技能「Llm Judge」。简介:LLM-as-judge methodology for comparing code implementations across repositories…。
请 fetch 以下地址读取 SKILL.md 并按文档完成安装:https://raw.githubusercontent.com/openclaw/skills/refs/heads/main/skills/anderskev/llm-judge/SKILL.md
(来源:yingzhi8.cn 技能库)
SKILL.md
---
name: llm-judge
description: LLM-as-judge methodology for comparing code implementations across repositories. Scores implementations on functionality, security, test quality, overengineering, and dead code using weighted rubrics. Used by /beagle:llm-judge command.
---
# LLM Judge Skill
Compare code implementations across 2+ repositories using structured evaluation.
## Overview
This skill implements a two-phase LLM-as-judge evaluation:
1. **Phase 1: Fact Gathering** - Parallel agents explore each repo and extract structured facts
2. **Phase 2: Judging** - Parallel judges score each dimension using consistent rubrics
## Reference Files
| File | Purpose |
|------|---------|
| [references/fact-schema.md](references/fact-schema.md) | JSON schema for Phase 1 facts |
| [references/scoring-rubrics.md](references/scoring-rubrics.md) | Detailed rubrics for each dimension |
| [references/repo-agent.md](references/repo-agent.md) | Instructions for Phase 1 agents |
| [references/judge-agents.md](references/judge-agents.md) | Instructions for Phase 2 judges |
## Scoring Dimensions
| Dimension | Default Weight | Evaluates |
|-----------|----------------|-----------|
| Functionality | 30% | Spec compliance, test pass rate |
| Security | 25% | Vulnerabilities, security patterns |
| Test Quality | 20% | Coverage, DRY, mock boundaries |
| Overengineering | 15% | Unnecessary complexity |
| Dead Code | 10% | Unused code, TODOs |
## Scoring Scale
| Score | Meaning |
|-------|---------|
| 5 | Excellent - Exceeds expectations |
| 4 | Good - Meets requirements, minor issues |
| 3 | Average - Functional but notable gaps |
| 2 | Below Average - Significant issues |
| 1 | Poor - Fails basic requirements |
## Phase 1: Spawning Repo Agents
For each repository, spawn a Task agent with:
```
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:** Read @beagle:llm-judge references/repo-agent.md
Gather facts and return a JSON object following the schema in references/fact-schema.md.
Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.
Return ONLY valid JSON, no markdown or explanations.
```
## Phase 2: Spawning Judge Agents
After all Phase 1 agents complete, spawn 5 judge agents (one per dimension):
```
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:** Read @beagle:llm-judge references/judge-agents.md
Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.
Return ONLY valid JSON following the judge output schema.
```
## Aggregation
After Phase 2 completes:
1. Collect scores from all 5 judges
2. For each repo, compute weighted total:
```
weighted_total = sum(score[dim] * weight[dim]) / 100
```
3. Rank repos by weighted total (descending)
4. Generate verdict explaining the ranking
## Output
Write results to `.beagle/llm-judge-report.json` and display markdown summary.
## Dependencies
- `@beagle:llm-artifacts-detection` - Reused by repo agents for dead code/overengineering