Skip to content

Conversation

Copy link

Copilot AI commented Jun 2, 2025

This PR implements a comprehensive evaluation (evals) framework to systematically measure and monitor the quality of AI-enhanced user stories. The framework addresses the need for objective quality assessment of LLM outputs in this domain.

What's Added

🔍 Core Evaluation Framework

  • 4 Specialized Evaluators that assess different quality dimensions:
    • FormatComplianceEvaluator: Validates JSON structure and required fields
    • UserStoryStructureEvaluator: Ensures proper "As a... I want... So that..." format
    • AcceptanceCriteriaQualityEvaluator: Assesses SMART criteria and testability
    • SingleRequirementEvaluator: Validates adherence to single-requirement principle

📊 Evaluation Suites & Test Cases

  • Pre-built evaluation suites for different use cases:
    • Basic Suite: Fast compliance checking (Format + Structure)
    • Standard Suite: Comprehensive quality assessment (All evaluators)
  • 7 diverse sample cases covering various scenarios (login, e-commerce, mobile, etc.)
  • Extensible framework for adding custom evaluators and test cases

🛠 Developer Tools & Utilities

  • EvalRunner: Orchestrates evaluation execution across test cases
  • Report utilities: Formatting, CSV export, failure analysis, trend tracking
  • Example script: Complete workflow demonstration (examples/runEvals.ts)
  • Shell script: Easy command-line usage (scripts/run-evals.sh)
  • NPM scripts: npm run eval for quick evaluation runs

📈 Quality Metrics & Scoring

  • Standardized 0-1 scoring scale across all evaluators
  • Configurable pass/fail thresholds for quality gates
  • Detailed reporting with per-case and per-evaluator breakdowns
  • Built-in analytics for identifying patterns and areas for improvement

Usage Examples

Quick Start

export OPENAI_API_KEY=your_api_key_here
npm run eval

Programmatic Usage

import { EvalRunner, createStandardEvalSuite, EvalReportUtils } from './src/evals';

const runner = new EvalRunner(process.env.OPENAI_API_KEY);
const suite = createStandardEvalSuite();
const report = await runner.runEvalSuite(suite);

console.log(`Overall Score: ${(report.summary.overallScore * 100).toFixed(1)}%`);
console.log(`Pass Rate: ${(report.summary.passRate * 100).toFixed(1)}%`);

CI/CD Integration

The framework is designed for easy integration into continuous integration pipelines, enabling automatic quality monitoring and regression detection.

Quality Thresholds

  • Excellent: Overall Score ≥ 80% AND Pass Rate ≥ 80%
  • Good: Overall Score ≥ 60% AND Pass Rate ≥ 60%
  • Needs Improvement: Below good thresholds

Documentation

Comprehensive documentation is available in docs/EVALUATIONS.md, covering:

  • Detailed evaluator specifications and scoring criteria
  • Integration examples for CI/CD and monitoring
  • Best practices for custom evaluator development
  • Troubleshooting guide

Testing

The evaluation framework includes extensive test coverage:

  • Unit tests for each evaluator
  • Integration tests for the complete workflow
  • Sample case validation
  • Report utility testing

All tests pass and the framework is ready for production use.

Fixes #4.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…lity

Co-authored-by: nsmaassel <1653803+nsmaassel@users.noreply.github.com>
Copilot AI changed the title [WIP] Add evals to projects Add comprehensive evaluation framework for measuring user story enhancement quality Jun 2, 2025
Copilot AI requested a review from nsmaassel June 2, 2025 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add evals to projects

2 participants