Add comprehensive evaluation framework for measuring user story enhancement quality #5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements a comprehensive evaluation (evals) framework to systematically measure and monitor the quality of AI-enhanced user stories. The framework addresses the need for objective quality assessment of LLM outputs in this domain.
What's Added
🔍 Core Evaluation Framework
FormatComplianceEvaluator: Validates JSON structure and required fieldsUserStoryStructureEvaluator: Ensures proper "As a... I want... So that..." formatAcceptanceCriteriaQualityEvaluator: Assesses SMART criteria and testabilitySingleRequirementEvaluator: Validates adherence to single-requirement principle📊 Evaluation Suites & Test Cases
🛠 Developer Tools & Utilities
examples/runEvals.ts)scripts/run-evals.sh)npm run evalfor quick evaluation runs📈 Quality Metrics & Scoring
Usage Examples
Quick Start
Programmatic Usage
CI/CD Integration
The framework is designed for easy integration into continuous integration pipelines, enabling automatic quality monitoring and regression detection.
Quality Thresholds
Documentation
Comprehensive documentation is available in
docs/EVALUATIONS.md, covering:Testing
The evaluation framework includes extensive test coverage:
All tests pass and the framework is ready for production use.
Fixes #4.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.