proposal: portable task format #55

Cali0707 · 2025-12-02T15:53:55Z

This PR is a proposal for a way to improve the portability and maintainability of gevals tasks

Summary by CodeRabbit

Documentation
- Added comprehensive proposal document for Portable Task Format specification, detailing task schema design, configuration and metadata, step definitions with control flow patterns, extension architecture for custom operations, variable templating, built-in operation types, cleanup semantics, and migration pathway from legacy formats.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Calum Murray <cmurray@redhat.com>

coderabbitai · 2025-12-02T15:54:08Z

Walkthrough

A new proposal document is added that specifies a Portable Task Format for MCP Server Evaluation. The document defines a declarative YAML-based task schema, extension architecture with packaged binaries and JSON-based protocols, built-in and control-flow step types, variable templating, cleanup semantics, and migration guidance from legacy bash-script tasks.

Changes

Cohort / File(s)	Summary
Portable Task Format Proposal `docs/proposals/portable-task-format.md`	Comprehensive proposal document introducing a declarative YAML task format to replace bash-script task definitions. Details problem statements (script complexity, portability, dual-verify design), requirements, schema specification (metadata, spec imports, environment, prompts, steps), extension architecture (binaries, manifests, sandboxing, JSON I/O protocol), built-in step types (command, http, file, llm, script), control-flow steps (foreach, anyOf, group), variable templating, cleanup semantics, and migration path with examples.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Schema validation: Verify the proposed YAML schema is well-defined, unambiguous, and covers stated requirements
Extension protocol clarity: Ensure the JSON-based I/O protocol and extension invocation mechanics are clearly and consistently documented
Migration feasibility: Review the migration path from legacy tasks for completeness and practical viability
Example consistency: Check that YAML snippets and JSON I/O examples align with described behavior and cover edge cases
Architectural soundness: Assess extension sandboxing approach, cleanup semantics (reverse order, continueOnError), and overall design coherence

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'proposal: portable task format' clearly and concisely summarizes the main change—a proposal document introducing a new portable task format for the MCP Server Evaluation system.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

docs/proposals/portable-task-format.md (4)

209-222: Clarify variable templating scope and precedence.

The variable templating section uses {name} syntax, which is custom to this proposal and differs from common shell/CI/CD conventions (${name}, ${{ name }}, $name). While this is defensible (avoids conflicts with nested templating), the section could benefit from:

A note explaining why this syntax was chosen over alternatives.

A clarification on precedence: What happens if spec.env defines NAMESPACE but the shell also has NAMESPACE set? The current wording ("if not overridden") is slightly ambiguous about direction.

535-535: Clarify the role distinction between actions, checks, and operations.

Line 535 states: "Actions are used in setup/cleanup phases. Checks are used in verify phase. Some operations (like exec) can be both." This is clear, but it would help readers to note in the extension manifest schema (lines 452–533) whether operations that can be both (like exec) are declared under both actions and checks, or under a separate operations section. The example manifest doesn't show an example of a dual-purpose operation.

314-323: LLM judge availability should be documented as a prerequisite or with a fallback.

The llm step type (lines 314–323) states: "This uses the LLM judge configured at the eval level, not the task level. If no judge is configured, this step fails." This is clear but raises a question: Should the task schema include a way to declare that an LLM judge is required? Or should tooling warn users when an llm step is used in a task but the eval does not configure a judge? This could prevent runtime surprises.

594-607: Sandboxing strategy needs security and operational guidance.

The sandboxing section (lines 594–607) mentions running extensions in containers with allowedSources for untrusted contexts, which is good. However, the proposal lacks detail on:

What capabilities/permissions extensions have when sandboxed (network access, filesystem mounts, resource limits)?

How to verify extension integrity (checksum validation, signing)?

What happens if a sandboxed extension times out or crashes?

How to roll back a malicious or broken extension without manual intervention?

These questions are likely out of scope for this proposal but should be tracked for the implementation phase.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fae087 and bba4c72.

📒 Files selected for processing (1)

docs/proposals/portable-task-format.md (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: Cali0707
Repo: genmcp/gevals PR: 39
File: .claude/skills/create-eval/SKILL.md:20-20
Timestamp: 2025-11-18T20:44:43.077Z
Learning: In the .claude/skills/create-eval/SKILL.md file, the eval creation instructions reference documentation files (.md) that explain each component (tasks.md, mcpConfig.md, agent.md, eval.md), not the actual YAML configuration files. The eval.md file contains documentation describing how to create eval.yaml files.

🪛 LanguageTool

docs/proposals/portable-task-format.md

[style] ~7-~7: To elevate your writing, try using a synonym here.
Context: ..., and error messages. These scripts are hard to read, debug, and maintain. See `exam...

(HARD_TO)

🪛 markdownlint-cli2 (0.18.1)

docs/proposals/portable-task-format.md

427-427: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

438-438: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (3)

docs/proposals/portable-task-format.md (3)

1-50: Well-structured proposal with clear problem framing and comprehensive requirements.

The problem statement (lines 3–15) clearly articulates pain points with the current bash-script approach. The requirements section (lines 17–58) is thoughtfully organized into four dimensions (Portability, Expressiveness, Extensibility, Compatibility) and avoids over-specifying. The solution overview example (lines 71–128) effectively demonstrates the value of the declarative format.

130-419: Detailed design is thorough and implementable.

The task schema definition, extension imports, variable templating, and control-flow steps are well-specified with clear examples. The extension invocation protocol (lines 537–592) is particularly strong: JSON input/output, clear success/failure semantics, and the script protocol (lines 609–677) provides a structured alternative to raw bash exit codes. The cleanup semantics (lines 679–689) thoughtfully address real-world orchestration concerns (always run, reverse order, continueOnError defaults).

726-741: Migration path is pragmatic and lowers adoption friction.

The opt-in approach via apiVersion: mcp-eval/v1 and preserving the legacy steps.setup.file format during transition is sensible. This avoids a big-bang migration and allows teams to adopt incrementally.

coderabbitai · 2025-12-02T15:56:09Z

docs/proposals/portable-task-format.md

+| `{steps.STEP_ID.outputs.NAME}` | Output from a previous step |
+| `{agent.output}` | Agent's response (verify only) |
+
+Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOMEi}` if not overridden in `spec.env`.


⚠️ Potential issue | 🟡 Minor

Fix typo in variable templating example.

Line 222 contains a typo: {env.HOMEi} should be {env.HOME}. The trailing i appears unintended.

-Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOMEi}` if not overridden in `spec.env`. +Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOME}` if not overridden in `spec.env`.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOMEi}` if not overridden in `spec.env`.

Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOME}` if not overridden in `spec.env`.

🤖 Prompt for AI Agents

In docs/proposals/portable-task-format.md around line 222, there's a typo in the environment variable templating example: `{env.HOMEi}` should be `{env.HOME}`; remove the trailing "i" so the example uses `{env.HOME}` to correctly reference the HOME environment variable.

coderabbitai · 2025-12-02T15:56:09Z

docs/proposals/portable-task-format.md

+    outputs:
+      body: "{response.body}"
+      status: "{response.status}"
+      header: "{response.headers.X-Custom}


⚠️ Potential issue | 🟡 Minor

Fix missing closing quote in YAML example.

Line 274 has a missing closing quote in the http step output capture:

outputs: body: "{response.body}" status: "{response.status}" - header: "{response.headers.X-Custom} + header: "{response.headers.X-Custom}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

header: "{response.headers.X-Custom}

outputs:

body: "{response.body}"

status: "{response.status}"

header: "{response.headers.X-Custom}"

🤖 Prompt for AI Agents

In docs/proposals/portable-task-format.md at line 274, the YAML example for the HTTP step output capture has a missing closing quote on the header value; fix it by adding the closing double-quote so the header line reads header: "{response.headers.X-Custom}" ensuring proper YAML string quoting.

coderabbitai · 2025-12-02T15:56:09Z

docs/proposals/portable-task-format.md

+```
+github.com/gevals/ext-kubernetes@v1.2.0
+github.com/myorg/ext-postgres@v0.1.0
+```


⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Markdown linting requires language identifiers for code blocks. This block shows shell commands and should be marked accordingly.

-``` +```bash github.com/gevals/ext-kubernetes/releases/v1.2.0/ ├── ext-kubernetes-darwin-amd64 ├── ext-kubernetes-darwin-arm64 ├── ext-kubernetes-linux-amd64 ├── ext-kubernetes-linux-arm64 ├── checksums.sha256 └── extension.yaml # Extension manifest -``` +```

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

427-427: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In docs/proposals/portable-task-format.md around lines 427 to 430, the fenced code block lacks a language identifier; update the opening fence from ``` to ```bash to mark the block as bash (leave the block contents and closing fence unchanged) so Markdown linting recognizes the shell commands.

coderabbitai · 2025-12-02T15:56:09Z

docs/proposals/portable-task-format.md

+```
+github.com/gevals/ext-kubernetes/releases/v1.2.0/
+├── ext-kubernetes-darwin-amd64
+├── ext-kubernetes-darwin-arm64
+├── ext-kubernetes-linux-amd64
+├── ext-kubernetes-linux-arm64
+├── checksums.sha256
+└── extension.yaml              # Extension manifest
+```


⚠️ Potential issue | 🟡 Minor

Add language specification to fenced code block.

Markdown linting requires language identifiers for code blocks. This block displays a directory structure and should be marked as a code block with appropriate language hint.

-``` +``` name: kubernetes version: 1.2.0 description: Kubernetes resource verification

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

438-438: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In docs/proposals/portable-task-format.md around lines 438 to 446, the fenced code block showing a directory tree lacks a language identifier which fails markdown linting; update the opening fence to include an appropriate language hint (e.g., ```text or ```bash) for the directory tree block, and for the adjacent YAML snippet add ```yaml as the fence language so both code blocks have correct language identifiers.

bentito · 2025-12-02T16:40:47Z

docs/proposals/portable-task-format.md

+
+3. Tasks must not require a specific programming language runtime. A task authored by a Go developer should be runnable on a system that only has Python, and vice versa.
+
+### Expressiveness


I think this task definition language is a great idea for the reasons mentioned!

Please include "what current scripts express" as a measure of "is it expressive enough". For current scripts I'd point to the ones in: https://github.com/genmcp/gevals/pull/34/files#diff-c7596df0e00fb3e2fd5d79d3291a9f54cbe9c8807ef56a8bb917038ed4d9e544

The longest one of which does weigh in at 200+ lines 🤭

Thanks for the feedback!

To make sure I understand your idea - are you saying we should add a requirement to be able to port all of the existing eval scenarios there into the new format? Or is there something else here?

proposal: portable task format

bba4c72

Signed-off-by: Calum Murray <cmurray@redhat.com>

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

bentito reviewed Dec 2, 2025

View reviewed changes

This was referenced Dec 3, 2025

[WIP]: feat: add builtin steps #56

Draft

Create integration tests #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proposal: portable task format #55

proposal: portable task format #55

Uh oh!

Cali0707 commented Dec 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 2, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 2, 2025

Uh oh!

coderabbitai bot Dec 2, 2025

Uh oh!

coderabbitai bot Dec 2, 2025

Uh oh!

coderabbitai bot Dec 2, 2025

Uh oh!

bentito Dec 2, 2025

Uh oh!

Cali0707 Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOMEi}` if not overridden in `spec.env`.
	Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOME}` if not overridden in `spec.env`.


		3. Tasks must not require a specific programming language runtime. A task authored by a Go developer should be runnable on a system that only has Python, and vice versa.

		### Expressiveness

proposal: portable task format #55

Are you sure you want to change the base?

proposal: portable task format #55

Uh oh!

Conversation

Cali0707 commented Dec 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

bentito Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Cali0707 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cali0707 commented Dec 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 2, 2025 •

edited

Loading