Skip to content

PairOfCleats builds a hybrid semantic index for a repo (code + docs) and exposes a CLI/MCP server for fast, filterable search. It is designed for agent workflows, with artifacts stored outside the repo by default so they can be shared across runs, containers, and CI while keeping working trees clean.

Notifications You must be signed in to change notification settings

doublemover/PairOfCleats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PairOfCleats

Give your coding agents a pair of cleats, so they can sprint through your codebase.

What is PairOfCleats?

PairOfCleats builds a hybrid semantic index for a repo (code + docs) and exposes a CLI/MCP server for fast, filterable search. It is designed for agent workflows, with artifacts stored outside the repo by default so they can be shared across runs, containers, and CI while keeping working trees clean.

The index captures rich structure and metadata: language-aware chunking across code, configs, and docs; docstrings/signatures/annotations; call/import/usage relations; control-flow and dataflow summaries; type inference (intra-file with optional cross-file); git-aware churn metadata; and embeddings for semantic search. Search combines BM25 token/phrase scoring, MinHash similarity, dense vectors, and optional SQLite backends (including FTS5 and ANN via sqlite-vec) with filters and human/JSON output. The tooling also includes incremental indexing, cache management, dictionary bootstrapping, CI artifact restore/build, optional language tooling detection/installation, and triage workflows for ingesting vulnerability records plus generating context packs.

Status

Active development. Current execution status lives in COMPLETE_PLAN.md; ROADMAP.md is historical.

Requirements

  • Node.js 18+
  • Optional: Python 3 for AST-based metadata on .py files (fallbacks to heuristics; worker pool via indexing.pythonAst.*)
  • Optional: SQLite backend (via better-sqlite3)
  • Optional: SQLite vector extension (sqlite-vec) for ANN acceleration

Quick start

  • npm run setup
    • Guided prompts for install, dictionaries, models, extensions, tooling, and indexes.
    • Add --non-interactive for CI or automated runs.
    • Add --with-sqlite to build SQLite indexes.
    • Add --incremental to reuse per-file cache bundles.
  • npm run bootstrap (fast, no prompts)
    • Add --with-sqlite to build SQLite indexes.
    • Add --incremental to reuse per-file cache bundles.
  • npm run watch-index (FS events by default; add --watch-poll to enable polling)
  • npm run api-server (local HTTP JSON API for status/search)
  • npm run indexer-service (multi-repo sync + queue; see docs/service-mode.md)
  • Cache is outside the repo by default; set cache.root in .pairofcleats.json to override.
  • CLI commands auto-detect repo roots; use --repo <path> to override.
  • Local CLI entrypoint: node bin/pairofcleats.js <command> (mirrors npm run scripts).

Index features

  • Languages: JavaScript/TypeScript, Python, Swift, Rust, C/C++/ObjC, Go, Java, C#, Kotlin, Ruby, PHP, Lua, SQL (dialects), Perl, Shell
  • LSP enrichment (clangd/sourcekit-lsp) is best-effort; clangd uses compile_commands.json when available and can be required via tooling.clangd.requireCompilationDatabase
  • Config formats: JSON, TOML, INI/CFG/CONF, XML, YAML, Dockerfile, Makefile, GitHub Actions YAML
  • Docs: Markdown, RST, AsciiDoc
  • Chunking:
    • Code declarations (functions, classes, methods, types)
    • Config sections (keys/blocks)
    • Doc headings/sections
  • Ignore files: .pairofcleatsignore (gitignore-style) and .gitignore
  • Large file guardrails: indexing.maxFileBytes (default 5 MB; set to 0 to disable)
  • Metadata per chunk:
    • docstrings, signatures, params, decorators/annotations
    • modifiers + visibility + inheritance
    • code relations (calls/imports/exports/usages)
    • interprocedural call summaries (args + return hints)
    • dataflow (reads/writes/mutations/aliases) + control-flow summaries
    • risk signals (sources/sinks/flows + tags, with cross-file call correlation)
    • type inference (intra-file, optional cross-file)
    • git metadata (last author/date, churn = added+deleted lines), JS complexity/lint, headline + neighbor context
  • Triage records (findings + decisions) indexed outside the repo
  • Index artifacts:
    • token postings (always)
    • phrase/chargram postings (configurable via indexing.postings.*)
    • MinHash signatures
    • dense vectors (merged + doc/code variants; MiniLM)
    • repo map (symbols + signatures + file paths)
    • incremental per-file cache bundles
    • optional ctags ingest (npm run ctags-ingest) (docs/ctags.md)
    • optional SCIP ingest (npm run scip-ingest) (docs/scip.md)
    • optional LSIF ingest (npm run lsif-ingest) (docs/lsif.md)
    • optional GNU Global ingest (npm run gtags-ingest) (docs/gtags.md)
  • Symbol source precedence: docs/symbol-sources.md

Search features

  • BM25 token/phrase search + n-grams/chargrams
  • MinHash similarity fallback
  • Dense vectors (optional, ANN-aware when enabled)
  • Query syntax: -term excludes tokens, "exact phrase" boosts phrase matches, -"phrase" excludes phrases
  • File/path regex and substring filters use a chargram prefilter before exact matching.
  • Symbol-aware ranking boosts for declarations/exports (configurable via search.symbolBoost.*, default def=1.2, export=1.1).
  • Modes: code, prose, both, records, all
  • Backends:
    • memory (file-backed JSON)
    • sqlite (same scoring, shared artifacts)
    • sqlite-fts (SQLite-only FTS5 scoring)
  • Structural search CLI for rule packs (Semgrep/ast-grep/Comby): docs/structural-search.md
  • Common filters (ext/kind/author/visibility) use precomputed indexes for speed.
  • Filters (high-signal subset):
    • --type, --signature, --param, --decorator, --inferred-type, --return-type
    • --throws, --reads, --writes, --mutates, --awaits
    • --alias
    • --risk, --risk-tag, --risk-source, --risk-sink, --risk-category, --risk-flow
    • --branches, --loops, --breaks, --continues
    • --async, --generator, --returns
    • --author, --chunk-author, --modified-after, --modified-since, --churn [min] (git numstat added+deleted), --lint, --calls, --import, --uses, --extends
    • --path/--file (substring or /regex/), --ext, --lang, --branch
    • --case, --case-file, --case-tokens (case-sensitive matching)
    • --meta, --meta-json (records metadata filters)
  • Output:
    • human-readable (color), --json (full), or --json-compact (lean tooling payload)
    • full JSON includes score (selected), scoreType, sparseScore, annScore, and scoreBreakdown (sparse/ann/phrase/symbol/selected)
    • --explain / --why prints a score breakdown in human output (selected/sparse/ANN/phrase)
  • Optional query cache (search.queryCache.* in .pairofcleats.json)

Triage records + context packs

  • Ingest findings into cache-backed records:
    • node tools/triage/ingest.js --source dependabot --in dependabot.json --meta service=api --meta env=prod
    • node tools/triage/ingest.js --source aws_inspector --in inspector.json --meta service=api --meta env=prod
    • node tools/triage/ingest.js --source generic --in record.json --meta service=api --meta env=prod
  • Build the records index: node build_index.js --mode records --incremental
  • Search records with metadata filters:
    • node search.js "CVE-2024-0001" --mode records --meta service=api --meta env=prod --json
  • Create decision records:
    • node tools/triage/decision.js --finding <recordId> --status accept --justification "..."
  • Generate a context pack:
    • node tools/triage/context-pack.js --record <recordId> --out context.json
  • Docs: docs/triage-records.md

Dictionaries

  • Default English wordlist: npm run download-dicts -- --lang en (setup/ bootstrap runs this)
  • Cache dir: <cache>/dictionaries (override with dictionary.dir or PAIROFCLEATS_DICT_DIR)
  • Update dictionaries with ETag/Last-Modified: npm run download-dicts -- --update
  • Add custom lists: npm run download-dicts -- --url mylist=https://example.com/words.txt
  • Slang support: drop .txt files into the slang/ folder in the dictionary cache
  • Repo-specific dictionary (opt-in):
    • npm run generate-repo-dict -- --min-count 3
    • enable via { "dictionary": { "enableRepoDictionary": true } }

Model cache

  • Models live under <cache>/models by default
  • Download: npm run download-models
  • Override in .pairofcleats.json:
    { "models": { "id": "Xenova/all-MiniLM-L12-v2", "dir": "C:/cache/pairofcleats/models" } }
  • Env overrides: PAIROFCLEATS_MODELS_DIR, PAIROFCLEATS_MODEL

SQLite backend

  • Build: npm run build-sqlite-index
  • Uses split DBs (index-code.db + index-prose.db) for concurrency
  • search.js auto-uses SQLite when sqlite.use is not disabled and DBs exist, unless search.sqliteAutoChunkThreshold keeps small repos on file-backed indexes (default 0; set higher to keep small repos on file-backed indexes)
  • FTS5 scoring (optional): set sqlite.scoreMode to fts
  • ANN extension (optional): set sqlite.annMode = "extension" and install sqlite-vec
    • ANN is on by default when search.annDefault is true; use --no-ann or set search.annDefault: false to disable
    • Install: npm run download-extensions
    • Verify: npm run verify-extensions

Installation

  • Guided setup: npm run setup (prompts)
  • CI/automation: npm run setup -- --non-interactive --json (summary JSON on stdout)
  • Manual steps:
    • Install dependencies: npm install
    • Optional extras:
      • Dictionaries: npm run download-dicts -- --lang en
      • Models: npm run download-models
      • SQLite ANN extension: npm run download-extensions
      • Verify extension: npm run verify-extensions
      • Detect tooling: npm run tooling-detect
      • Install tooling: npm run tooling-install -- --scope cache
      • Tooling targets: tsserver, typescript-language-server, clangd, sourcekit-lsp, rust-analyzer, gopls, jdtls, kotlin-language-server, kotlin-lsp, omnisharp, csharp-ls, ruby-lsp, solargraph, phpactor, intelephense, lua-language-server, bash-language-server, sqls
      • Git hooks: npm run git-hooks -- --install
      • Validate config: npm run config-validate -- --config .pairofcleats.json
    • Build indexes:
      • File-backed + SQLite (default): node build_index.js (add --incremental if desired; add --no-sqlite to skip SQLite)
      • SQLite only: npm run build-sqlite-index
      • Validate: npm run index-validate

API server

Run: npm run api-server or node bin/pairofcleats.js server

Endpoints:

  • GET /health
  • GET /status?repo=<path>
  • POST /search (JSON payload mirrors CLI filters)
  • GET /status/stream (SSE)
  • POST /search/stream (SSE)
  • Docs: docs/api-server.md

Editor integration

  • VS Code extension (CLI shell-out) under extensions/vscode
  • Command: PairOfCleats: Search
  • Uses pairofcleats search --json-compact with file/line hints
  • Docs: docs/editor-integration.md

MCP server

Run: npm run mcp-server

Tools:

  • index_status
  • config_status
  • build_index
  • search
  • triage_ingest
  • triage_decision
  • triage_context_pack
  • download_models
  • download_dictionaries
  • download_extensions
  • verify_extensions
  • build_sqlite_index
  • compact_sqlite_index
  • cache_gc
  • clean_artifacts
  • bootstrap
  • report_artifacts
  • search defaults to compact JSON payloads (set output: "full" for full JSON).
  • Progress: long-running tools emit notifications/progress with { id, tool, message, stream, phase }.
  • Errors: tools/call responses set isError=true and return a JSON payload with message plus optional code, stdout, stderr, hint.
  • Docs: docs/mcp-server.md

Tests

All-in-one (runs everything it can):

  • npm run test-all
  • npm run test-all-no-bench (skips the benchmark run)
  • npm run test-all -- --skip-bench (same as above)

Core:

  • npm run verify
  • npm run fixture-smoke
  • npm run fixture-parity
  • npm run fixture-eval
  • npm run search-explain-test

Fidelity:

  • npm run language-fidelity-test
  • npm run format-fidelity-test
  • npm run type-inference-crossfile-test

SQLite + extensions:

  • npm run sqlite-incremental-test
  • npm run sqlite-compact-test
  • npm run sqlite-ann-extension-test
  • npm run download-extensions-test

Tooling + caches:

  • npm run download-dicts-test
  • npm run setup-test
  • npm run tooling-detect-test
  • npm run tooling-install-test
  • npm run query-cache-test
  • npm run index-validate-test
  • npm run clean-artifacts-test
  • npm run uninstall-test
  • npm run cache-gc-test
  • npm run git-hooks-test

Triage:

  • npm run triage-test

Reports + MCP:

  • npm run repometrics-dashboard-test
  • npm run summary-report-test
  • npm run mcp-server-test
  • npm run api-server-test
  • npm run api-server-stream-test
  • npm run vscode-extension-test

Meta:

  • npm run script-coverage-test
  • npm run docs-consistency-test
  • npm run bench / npm run bench-ann / npm run bench-language

Maintenance

  • Report cache sizes: npm run report-artifacts (add -- --all for all repos)
  • Validate index artifacts: npm run index-validate
  • Cache GC (age/size): npm run cache-gc -- --max-gb 10 or --max-age-days 30
  • Clean repo artifacts: npm run clean-artifacts (add -- --all to clear repo caches; keeps models/dictionaries/extensions)
  • Uninstall caches + models + extensions: npm run uninstall
  • Compact SQLite indexes: npm run compact-sqlite-index
  • Dependency policy: versions are pinned in package.json; update via npm install and commit package-lock.json.
  • Repometrics dashboard: npm run repometrics-dashboard
  • Model comparison: npm run compare-models
  • Combined summary report: npm run summary-report (add -- --json for JSON output)
  • Tooling detect/install: npm run tooling-detect, npm run tooling-install
  • Git hooks (post-commit/post-merge): npm run git-hooks -- --install
  • CI artifacts: node tools/ci-build-artifacts.js --out ci-artifacts, node tools/ci-restore-artifacts.js --from ci-artifacts

Design docs

Cache layout

  • <cache>/repos/<repoId>/index-code
  • <cache>/repos/<repoId>/index-prose
  • <cache>/repos/<repoId>/index-records
  • <cache>/repos/<repoId>/incremental/<mode>
  • <cache>/repos/<repoId>/repometrics
  • <cache>/repos/<repoId>/triage/records
  • <cache>/repos/<repoId>/triage/context-packs
  • <cache>/repos/<repoId>/index-sqlite/index-code.db
  • <cache>/repos/<repoId>/index-sqlite/index-prose.db
  • <cache>/dictionaries
  • <cache>/models
  • <cache>/extensions
  • <cache>/tooling

Default cache root:

  • Windows: %LOCALAPPDATA%\\PairOfCleats
  • Linux/macOS: $XDG_CACHE_HOME/pairofcleats or ~/.cache/pairofcleats
  • Override with cache.root, PAIROFCLEATS_CACHE_ROOT, or PAIROFCLEATS_HOME

About

PairOfCleats builds a hybrid semantic index for a repo (code + docs) and exposes a CLI/MCP server for fast, filterable search. It is designed for agent workflows, with artifacts stored outside the repo by default so they can be shared across runs, containers, and CI while keeping working trees clean.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published