Skip to content

Conversation

@Cali0707
Copy link
Contributor

@Cali0707 Cali0707 commented Dec 2, 2025

This PR is a proposal for a way to improve the portability and maintainability of gevals tasks

Summary by CodeRabbit

  • Documentation
    • Added comprehensive proposal document for Portable Task Format specification, detailing task schema design, configuration and metadata, step definitions with control flow patterns, extension architecture for custom operations, variable templating, built-in operation types, cleanup semantics, and migration pathway from legacy formats.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Calum Murray <cmurray@redhat.com>
@coderabbitai
Copy link

coderabbitai bot commented Dec 2, 2025

Walkthrough

A new proposal document is added that specifies a Portable Task Format for MCP Server Evaluation. The document defines a declarative YAML-based task schema, extension architecture with packaged binaries and JSON-based protocols, built-in and control-flow step types, variable templating, cleanup semantics, and migration guidance from legacy bash-script tasks.

Changes

Cohort / File(s) Summary
Portable Task Format Proposal
docs/proposals/portable-task-format.md
Comprehensive proposal document introducing a declarative YAML task format to replace bash-script task definitions. Details problem statements (script complexity, portability, dual-verify design), requirements, schema specification (metadata, spec imports, environment, prompts, steps), extension architecture (binaries, manifests, sandboxing, JSON I/O protocol), built-in step types (command, http, file, llm, script), control-flow steps (foreach, anyOf, group), variable templating, cleanup semantics, and migration path with examples.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Schema validation: Verify the proposed YAML schema is well-defined, unambiguous, and covers stated requirements
  • Extension protocol clarity: Ensure the JSON-based I/O protocol and extension invocation mechanics are clearly and consistently documented
  • Migration feasibility: Review the migration path from legacy tasks for completeness and practical viability
  • Example consistency: Check that YAML snippets and JSON I/O examples align with described behavior and cover edge cases
  • Architectural soundness: Assess extension sandboxing approach, cleanup semantics (reverse order, continueOnError), and overall design coherence

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'proposal: portable task format' clearly and concisely summarizes the main change—a proposal document introducing a new portable task format for the MCP Server Evaluation system.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
docs/proposals/portable-task-format.md (4)

209-222: Clarify variable templating scope and precedence.

The variable templating section uses {name} syntax, which is custom to this proposal and differs from common shell/CI/CD conventions (${name}, ${{ name }}, $name). While this is defensible (avoids conflicts with nested templating), the section could benefit from:

  1. A note explaining why this syntax was chosen over alternatives.
  2. A clarification on precedence: What happens if spec.env defines NAMESPACE but the shell also has NAMESPACE set? The current wording ("if not overridden") is slightly ambiguous about direction.

535-535: Clarify the role distinction between actions, checks, and operations.

Line 535 states: "Actions are used in setup/cleanup phases. Checks are used in verify phase. Some operations (like exec) can be both." This is clear, but it would help readers to note in the extension manifest schema (lines 452–533) whether operations that can be both (like exec) are declared under both actions and checks, or under a separate operations section. The example manifest doesn't show an example of a dual-purpose operation.


314-323: LLM judge availability should be documented as a prerequisite or with a fallback.

The llm step type (lines 314–323) states: "This uses the LLM judge configured at the eval level, not the task level. If no judge is configured, this step fails." This is clear but raises a question: Should the task schema include a way to declare that an LLM judge is required? Or should tooling warn users when an llm step is used in a task but the eval does not configure a judge? This could prevent runtime surprises.


594-607: Sandboxing strategy needs security and operational guidance.

The sandboxing section (lines 594–607) mentions running extensions in containers with allowedSources for untrusted contexts, which is good. However, the proposal lacks detail on:

  1. What capabilities/permissions extensions have when sandboxed (network access, filesystem mounts, resource limits)?
  2. How to verify extension integrity (checksum validation, signing)?
  3. What happens if a sandboxed extension times out or crashes?
  4. How to roll back a malicious or broken extension without manual intervention?

These questions are likely out of scope for this proposal but should be tracked for the implementation phase.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fae087 and bba4c72.

📒 Files selected for processing (1)
  • docs/proposals/portable-task-format.md (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: Cali0707
Repo: genmcp/gevals PR: 39
File: .claude/skills/create-eval/SKILL.md:20-20
Timestamp: 2025-11-18T20:44:43.077Z
Learning: In the .claude/skills/create-eval/SKILL.md file, the eval creation instructions reference documentation files (.md) that explain each component (tasks.md, mcpConfig.md, agent.md, eval.md), not the actual YAML configuration files. The eval.md file contains documentation describing how to create eval.yaml files.
🪛 LanguageTool
docs/proposals/portable-task-format.md

[style] ~7-~7: To elevate your writing, try using a synonym here.
Context: ..., and error messages. These scripts are hard to read, debug, and maintain. See `exam...

(HARD_TO)

🪛 markdownlint-cli2 (0.18.1)
docs/proposals/portable-task-format.md

427-427: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


438-438: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (3)
docs/proposals/portable-task-format.md (3)

1-50: Well-structured proposal with clear problem framing and comprehensive requirements.

The problem statement (lines 3–15) clearly articulates pain points with the current bash-script approach. The requirements section (lines 17–58) is thoughtfully organized into four dimensions (Portability, Expressiveness, Extensibility, Compatibility) and avoids over-specifying. The solution overview example (lines 71–128) effectively demonstrates the value of the declarative format.


130-419: Detailed design is thorough and implementable.

The task schema definition, extension imports, variable templating, and control-flow steps are well-specified with clear examples. The extension invocation protocol (lines 537–592) is particularly strong: JSON input/output, clear success/failure semantics, and the script protocol (lines 609–677) provides a structured alternative to raw bash exit codes. The cleanup semantics (lines 679–689) thoughtfully address real-world orchestration concerns (always run, reverse order, continueOnError defaults).


726-741: Migration path is pragmatic and lowers adoption friction.

The opt-in approach via apiVersion: mcp-eval/v1 and preserving the legacy steps.setup.file format during transition is sensible. This avoids a big-bang migration and allows teams to adopt incrementally.

| `{steps.STEP_ID.outputs.NAME}` | Output from a previous step |
| `{agent.output}` | Agent's response (verify only) |

Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOMEi}` if not overridden in `spec.env`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in variable templating example.

Line 222 contains a typo: {env.HOMEi} should be {env.HOME}. The trailing i appears unintended.

-Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOMEi}` if not overridden in `spec.env`.
+Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOME}` if not overridden in `spec.env`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOMEi}` if not overridden in `spec.env`.
Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOME}` if not overridden in `spec.env`.
🤖 Prompt for AI Agents
In docs/proposals/portable-task-format.md around line 222, there's a typo in the
environment variable templating example: `{env.HOMEi}` should be `{env.HOME}`;
remove the trailing "i" so the example uses `{env.HOME}` to correctly reference
the HOME environment variable.

outputs:
body: "{response.body}"
status: "{response.status}"
header: "{response.headers.X-Custom}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix missing closing quote in YAML example.

Line 274 has a missing closing quote in the http step output capture:

     outputs:
       body: "{response.body}"
       status: "{response.status}"
-      header: "{response.headers.X-Custom}
+      header: "{response.headers.X-Custom}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
header: "{response.headers.X-Custom}
outputs:
body: "{response.body}"
status: "{response.status}"
header: "{response.headers.X-Custom}"
🤖 Prompt for AI Agents
In docs/proposals/portable-task-format.md at line 274, the YAML example for the
HTTP step output capture has a missing closing quote on the header value; fix it
by adding the closing double-quote so the header line reads header:
"{response.headers.X-Custom}" ensuring proper YAML string quoting.

Comment on lines +427 to +430
```
github.com/gevals/ext-kubernetes@v1.2.0
github.com/myorg/ext-postgres@v0.1.0
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Markdown linting requires language identifiers for code blocks. This block shows shell commands and should be marked accordingly.

-```
+```bash
 github.com/gevals/ext-kubernetes/releases/v1.2.0/
 ├── ext-kubernetes-darwin-amd64
 ├── ext-kubernetes-darwin-arm64
 ├── ext-kubernetes-linux-amd64
 ├── ext-kubernetes-linux-arm64
 ├── checksums.sha256
 └── extension.yaml              # Extension manifest
-```
+```
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

427-427: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In docs/proposals/portable-task-format.md around lines 427 to 430, the fenced
code block lacks a language identifier; update the opening fence from ``` to
```bash to mark the block as bash (leave the block contents and closing fence
unchanged) so Markdown linting recognizes the shell commands.

Comment on lines +438 to +446
```
github.com/gevals/ext-kubernetes/releases/v1.2.0/
├── ext-kubernetes-darwin-amd64
├── ext-kubernetes-darwin-arm64
├── ext-kubernetes-linux-amd64
├── ext-kubernetes-linux-arm64
├── checksums.sha256
└── extension.yaml # Extension manifest
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Markdown linting requires language identifiers for code blocks. This block displays a directory structure and should be marked as a code block with appropriate language hint.

-```
+```
 name: kubernetes
 version: 1.2.0
 description: Kubernetes resource verification

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

438-438: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In docs/proposals/portable-task-format.md around lines 438 to 446, the fenced
code block showing a directory tree lacks a language identifier which fails
markdown linting; update the opening fence to include an appropriate language
hint (e.g., ```text or ```bash) for the directory tree block, and for the
adjacent YAML snippet add ```yaml as the fence language so both code blocks have
correct language identifiers.


3. Tasks must not require a specific programming language runtime. A task authored by a Go developer should be runnable on a system that only has Python, and vice versa.

### Expressiveness
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this task definition language is a great idea for the reasons mentioned!

Please include "what current scripts express" as a measure of "is it expressive enough". For current scripts I'd point to the ones in: https://github.com/genmcp/gevals/pull/34/files#diff-c7596df0e00fb3e2fd5d79d3291a9f54cbe9c8807ef56a8bb917038ed4d9e544

The longest one of which does weigh in at 200+ lines 🤭

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!

To make sure I understand your idea - are you saying we should add a requirement to be able to port all of the existing eval scenarios there into the new format? Or is there something else here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants