Add comprehensive evaluation framework for measuring user story enhancement quality #5

Copilot · 2025-06-02T06:28:35Z

This PR implements a comprehensive evaluation (evals) framework to systematically measure and monitor the quality of AI-enhanced user stories. The framework addresses the need for objective quality assessment of LLM outputs in this domain.

What's Added

🔍 Core Evaluation Framework

4 Specialized Evaluators that assess different quality dimensions:
- FormatComplianceEvaluator: Validates JSON structure and required fields
- UserStoryStructureEvaluator: Ensures proper "As a... I want... So that..." format
- AcceptanceCriteriaQualityEvaluator: Assesses SMART criteria and testability
- SingleRequirementEvaluator: Validates adherence to single-requirement principle

📊 Evaluation Suites & Test Cases

Pre-built evaluation suites for different use cases:
- Basic Suite: Fast compliance checking (Format + Structure)
- Standard Suite: Comprehensive quality assessment (All evaluators)
7 diverse sample cases covering various scenarios (login, e-commerce, mobile, etc.)
Extensible framework for adding custom evaluators and test cases

🛠 Developer Tools & Utilities

EvalRunner: Orchestrates evaluation execution across test cases
Report utilities: Formatting, CSV export, failure analysis, trend tracking
Example script: Complete workflow demonstration (examples/runEvals.ts)
Shell script: Easy command-line usage (scripts/run-evals.sh)
NPM scripts: npm run eval for quick evaluation runs

📈 Quality Metrics & Scoring

Standardized 0-1 scoring scale across all evaluators
Configurable pass/fail thresholds for quality gates
Detailed reporting with per-case and per-evaluator breakdowns
Built-in analytics for identifying patterns and areas for improvement

Usage Examples

Quick Start

export OPENAI_API_KEY=your_api_key_here
npm run eval

Programmatic Usage

import { EvalRunner, createStandardEvalSuite, EvalReportUtils } from './src/evals';

const runner = new EvalRunner(process.env.OPENAI_API_KEY);
const suite = createStandardEvalSuite();
const report = await runner.runEvalSuite(suite);

console.log(`Overall Score: ${(report.summary.overallScore * 100).toFixed(1)}%`);
console.log(`Pass Rate: ${(report.summary.passRate * 100).toFixed(1)}%`);

CI/CD Integration

The framework is designed for easy integration into continuous integration pipelines, enabling automatic quality monitoring and regression detection.

Quality Thresholds

Excellent: Overall Score ≥ 80% AND Pass Rate ≥ 80%
Good: Overall Score ≥ 60% AND Pass Rate ≥ 60%
Needs Improvement: Below good thresholds

Documentation

Comprehensive documentation is available in docs/EVALUATIONS.md, covering:

Detailed evaluator specifications and scoring criteria
Integration examples for CI/CD and monitoring
Best practices for custom evaluator development
Troubleshooting guide

Testing

The evaluation framework includes extensive test coverage:

Unit tests for each evaluator
Integration tests for the complete workflow
Sample case validation
Report utility testing

All tests pass and the framework is ready for production use.

Fixes #4.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…lity Co-authored-by: nsmaassel <1653803+nsmaassel@users.noreply.github.com>

Initial plan for issue

61ff17d

Copilot AI assigned Copilot and nsmaassel Jun 2, 2025

Copilot started work on behalf of nsmaassel June 2, 2025 06:28 View session

Add comprehensive evaluation framework for user story enhancement qua…

d87c4ea

…lity Co-authored-by: nsmaassel <1653803+nsmaassel@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add evals to projects~~ Add comprehensive evaluation framework for measuring user story enhancement quality Jun 2, 2025

Copilot AI requested a review from nsmaassel June 2, 2025 06:45

Copilot stopped work on behalf of nsmaassel due to an error June 2, 2025 06:45
An unexpected error occurred. For more details, see the detailed logs in GitHub Actions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add comprehensive evaluation framework for measuring user story enhancement quality #5

Add comprehensive evaluation framework for measuring user story enhancement quality #5

Uh oh!

Copilot AI commented Jun 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add comprehensive evaluation framework for measuring user story enhancement quality #5

Are you sure you want to change the base?

Add comprehensive evaluation framework for measuring user story enhancement quality #5

Uh oh!

Conversation

Copilot AI commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's Added

🔍 Core Evaluation Framework

📊 Evaluation Suites & Test Cases

🛠 Developer Tools & Utilities

📈 Quality Metrics & Scoring

Usage Examples

Quick Start

Programmatic Usage

CI/CD Integration

Quality Thresholds

Documentation

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 2, 2025 •

edited

Loading