-
Notifications
You must be signed in to change notification settings - Fork 10
proposal: portable task format #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Calum Murray <cmurray@redhat.com>
WalkthroughA new proposal document is added that specifies a Portable Task Format for MCP Server Evaluation. The document defines a declarative YAML-based task schema, extension architecture with packaged binaries and JSON-based protocols, built-in and control-flow step types, variable templating, cleanup semantics, and migration guidance from legacy bash-script tasks. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
docs/proposals/portable-task-format.md (4)
209-222: Clarify variable templating scope and precedence.The variable templating section uses
{name}syntax, which is custom to this proposal and differs from common shell/CI/CD conventions (${name},${{ name }},$name). While this is defensible (avoids conflicts with nested templating), the section could benefit from:
- A note explaining why this syntax was chosen over alternatives.
- A clarification on precedence: What happens if
spec.envdefinesNAMESPACEbut the shell also hasNAMESPACEset? The current wording ("if not overridden") is slightly ambiguous about direction.
535-535: Clarify the role distinction between actions, checks, and operations.Line 535 states: "Actions are used in setup/cleanup phases. Checks are used in verify phase. Some operations (like
exec) can be both." This is clear, but it would help readers to note in the extension manifest schema (lines 452–533) whether operations that can be both (likeexec) are declared under bothactionsandchecks, or under a separateoperationssection. The example manifest doesn't show an example of a dual-purpose operation.
314-323: LLM judge availability should be documented as a prerequisite or with a fallback.The
llmstep type (lines 314–323) states: "This uses the LLM judge configured at the eval level, not the task level. If no judge is configured, this step fails." This is clear but raises a question: Should the task schema include a way to declare that an LLM judge is required? Or should tooling warn users when anllmstep is used in a task but the eval does not configure a judge? This could prevent runtime surprises.
594-607: Sandboxing strategy needs security and operational guidance.The sandboxing section (lines 594–607) mentions running extensions in containers with
allowedSourcesfor untrusted contexts, which is good. However, the proposal lacks detail on:
- What capabilities/permissions extensions have when sandboxed (network access, filesystem mounts, resource limits)?
- How to verify extension integrity (checksum validation, signing)?
- What happens if a sandboxed extension times out or crashes?
- How to roll back a malicious or broken extension without manual intervention?
These questions are likely out of scope for this proposal but should be tracked for the implementation phase.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/proposals/portable-task-format.md(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: Cali0707
Repo: genmcp/gevals PR: 39
File: .claude/skills/create-eval/SKILL.md:20-20
Timestamp: 2025-11-18T20:44:43.077Z
Learning: In the .claude/skills/create-eval/SKILL.md file, the eval creation instructions reference documentation files (.md) that explain each component (tasks.md, mcpConfig.md, agent.md, eval.md), not the actual YAML configuration files. The eval.md file contains documentation describing how to create eval.yaml files.
🪛 LanguageTool
docs/proposals/portable-task-format.md
[style] ~7-~7: To elevate your writing, try using a synonym here.
Context: ..., and error messages. These scripts are hard to read, debug, and maintain. See `exam...
(HARD_TO)
🪛 markdownlint-cli2 (0.18.1)
docs/proposals/portable-task-format.md
427-427: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
438-438: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (3)
docs/proposals/portable-task-format.md (3)
1-50: Well-structured proposal with clear problem framing and comprehensive requirements.The problem statement (lines 3–15) clearly articulates pain points with the current bash-script approach. The requirements section (lines 17–58) is thoughtfully organized into four dimensions (Portability, Expressiveness, Extensibility, Compatibility) and avoids over-specifying. The solution overview example (lines 71–128) effectively demonstrates the value of the declarative format.
130-419: Detailed design is thorough and implementable.The task schema definition, extension imports, variable templating, and control-flow steps are well-specified with clear examples. The extension invocation protocol (lines 537–592) is particularly strong: JSON input/output, clear success/failure semantics, and the script protocol (lines 609–677) provides a structured alternative to raw bash exit codes. The cleanup semantics (lines 679–689) thoughtfully address real-world orchestration concerns (always run, reverse order, continueOnError defaults).
726-741: Migration path is pragmatic and lowers adoption friction.The opt-in approach via
apiVersion: mcp-eval/v1and preserving the legacysteps.setup.fileformat during transition is sensible. This avoids a big-bang migration and allows teams to adopt incrementally.
| | `{steps.STEP_ID.outputs.NAME}` | Output from a previous step | | ||
| | `{agent.output}` | Agent's response (verify only) | | ||
|
|
||
| Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOMEi}` if not overridden in `spec.env`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo in variable templating example.
Line 222 contains a typo: {env.HOMEi} should be {env.HOME}. The trailing i appears unintended.
-Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOMEi}` if not overridden in `spec.env`.
+Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOME}` if not overridden in `spec.env`.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOMEi}` if not overridden in `spec.env`. | |
| Environment variables from the shell (`$HOME`, etc.) are also available via `{env.HOME}` if not overridden in `spec.env`. |
🤖 Prompt for AI Agents
In docs/proposals/portable-task-format.md around line 222, there's a typo in the
environment variable templating example: `{env.HOMEi}` should be `{env.HOME}`;
remove the trailing "i" so the example uses `{env.HOME}` to correctly reference
the HOME environment variable.
| outputs: | ||
| body: "{response.body}" | ||
| status: "{response.status}" | ||
| header: "{response.headers.X-Custom} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix missing closing quote in YAML example.
Line 274 has a missing closing quote in the http step output capture:
outputs:
body: "{response.body}"
status: "{response.status}"
- header: "{response.headers.X-Custom}
+ header: "{response.headers.X-Custom}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| header: "{response.headers.X-Custom} | |
| outputs: | |
| body: "{response.body}" | |
| status: "{response.status}" | |
| header: "{response.headers.X-Custom}" |
🤖 Prompt for AI Agents
In docs/proposals/portable-task-format.md at line 274, the YAML example for the
HTTP step output capture has a missing closing quote on the header value; fix it
by adding the closing double-quote so the header line reads header:
"{response.headers.X-Custom}" ensuring proper YAML string quoting.
| ``` | ||
| github.com/gevals/ext-kubernetes@v1.2.0 | ||
| github.com/myorg/ext-postgres@v0.1.0 | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add language specification to fenced code block.
Markdown linting requires language identifiers for code blocks. This block shows shell commands and should be marked accordingly.
-```
+```bash
github.com/gevals/ext-kubernetes/releases/v1.2.0/
├── ext-kubernetes-darwin-amd64
├── ext-kubernetes-darwin-arm64
├── ext-kubernetes-linux-amd64
├── ext-kubernetes-linux-arm64
├── checksums.sha256
└── extension.yaml # Extension manifest
-```
+```🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
427-427: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In docs/proposals/portable-task-format.md around lines 427 to 430, the fenced
code block lacks a language identifier; update the opening fence from ``` to
```bash to mark the block as bash (leave the block contents and closing fence
unchanged) so Markdown linting recognizes the shell commands.
| ``` | ||
| github.com/gevals/ext-kubernetes/releases/v1.2.0/ | ||
| ├── ext-kubernetes-darwin-amd64 | ||
| ├── ext-kubernetes-darwin-arm64 | ||
| ├── ext-kubernetes-linux-amd64 | ||
| ├── ext-kubernetes-linux-arm64 | ||
| ├── checksums.sha256 | ||
| └── extension.yaml # Extension manifest | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add language specification to fenced code block.
Markdown linting requires language identifiers for code blocks. This block displays a directory structure and should be marked as a code block with appropriate language hint.
-```
+```
name: kubernetes
version: 1.2.0
description: Kubernetes resource verificationCommittable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
438-438: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In docs/proposals/portable-task-format.md around lines 438 to 446, the fenced
code block showing a directory tree lacks a language identifier which fails
markdown linting; update the opening fence to include an appropriate language
hint (e.g., ```text or ```bash) for the directory tree block, and for the
adjacent YAML snippet add ```yaml as the fence language so both code blocks have
correct language identifiers.
|
|
||
| 3. Tasks must not require a specific programming language runtime. A task authored by a Go developer should be runnable on a system that only has Python, and vice versa. | ||
|
|
||
| ### Expressiveness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this task definition language is a great idea for the reasons mentioned!
Please include "what current scripts express" as a measure of "is it expressive enough". For current scripts I'd point to the ones in: https://github.com/genmcp/gevals/pull/34/files#diff-c7596df0e00fb3e2fd5d79d3291a9f54cbe9c8807ef56a8bb917038ed4d9e544
The longest one of which does weigh in at 200+ lines 🤭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback!
To make sure I understand your idea - are you saying we should add a requirement to be able to port all of the existing eval scenarios there into the new format? Or is there something else here?
This PR is a proposal for a way to improve the portability and maintainability of gevals tasks
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.