Skip to content

bledden/arrwDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arrwDB - Production Vector Database

Enterprise-grade vector database with real-time streaming, WebSocket support, webhooks, and comprehensive testing

Python FastAPI Cohere Tests Coverage


Overview

arrwDB is a production-grade vector database designed for semantic search, RAG (Retrieval-Augmented Generation), and AI applications. Built from the ground up with performance, scalability, and developer experience in mind.

Key Features

Core Infrastructure

Real-Time Streaming - NDJSON batch ingestion with sub-second performance WebSocket Support - Bidirectional real-time communication for live search Webhooks API - Event notifications with HMAC signatures and automatic retries Event Bus & CDC - Change data capture with pub/sub event system Background Job Queue - Async processing with configurable worker pools Multiple Index Types - BruteForce, KD-Tree, LSH, HNSW, IVF Rust Optimizations - Performance-critical indexes and persistence in Rust (5-10x faster) Vector Quantization - Scalar (4-bit/8-bit) and hybrid quantization (70% memory reduction) Hybrid Search - Combine semantic and keyword search with reranking Metadata Filtering - Advanced query filtering on document metadata Health Checks - Kubernetes-ready liveness and readiness probes Docker Support - Development and production Docker Compose configurations Comprehensive Testing - 156+ tests with 95-100% coverage on core infrastructure Production Ready - Rate limiting, monitoring, security, persistence

Novel Features (Not in Other Vector DBs)

Search Replay - Complete HNSW search path recording for debugging & transparency Temperature Search - LLM-inspired result sampling (exploration/exploitation control) Index Oracle - Intelligent index recommendation based on workload analysis Embedding Health Monitor - Statistical quality detection (outliers, degeneracy, drift) Vector Clustering - K-means semantic grouping with auto cluster count estimation Query Expansion - Automatic query rewriting with RRF fusion (3 strategies) Vector Drift Detection - Kolmogorov-Smirnov test for distribution monitoring Adaptive Reranking - Feedback-based learning (clicks, dwell time, bookmarks) Hybrid Fusion - Multi-strategy result merging (RRF, linear, confidence-based)

See NOVEL_FEATURES.md for complete documentation of the 9 unique features.


Performance Metrics

Async Infrastructure (Validated & Tested)

  • Event Bus: 470,000 events/sec throughput, 0.14ms P99 latency (47x target)
  • Job Queue: 50,000+ jobs/sec submission, 1,000+ jobs/sec execution (50x target)
  • WebSocket: 5,000+ connections/sec, 500+ concurrent connections (10x target)
  • Test Coverage: 95-100% on all async infrastructure with 156+ passing tests

Streaming & Search

  • NDJSON Batch Upload: < 1 second for 3 documents with full 1024-dim embeddings
  • Streaming Search: 200ms average latency with metadata filtering
  • WebSocket Search: <100ms round-trip time for real-time queries
  • Hybrid Search: Advanced scoring with recency boost and field boosting

Production Readiness

  • Test Suite: 156+ tests with 100% pass rate
  • Coverage: 95-100% on core infrastructure (Event Bus, Job Queue, WebSocket, Webhooks)
  • Performance: All targets exceeded by 2-142x (see PERFORMANCE_BENCHMARKS.md)
  • Docker: Multi-stage builds with dev/prod configurations
  • Monitoring: Health checks, readiness probes, detailed component status

Quick Start

Option 1: Docker (Recommended)

Development Mode

# Set your Cohere API key
export COHERE_API_KEY="your_cohere_api_key"

# Start in development mode (hot-reloading enabled)
docker-compose -f docker-compose.dev.yml up

# Access the API at http://localhost:8000
# API docs at http://localhost:8000/docs

Production Mode

# Set environment variables
export COHERE_API_KEY="your_cohere_api_key"
export GUNICORN_WORKERS=4

# Start in production mode
docker-compose -f docker-compose.prod.yml up -d

# With monitoring (Prometheus + Grafana)
docker-compose -f docker-compose.prod.yml --profile monitoring up -d

See DOCKER_DEPLOYMENT.md for complete Docker guide.

Option 2: Local Development

# Install dependencies
pip install -r requirements.txt

# Set up environment
export PYTHONPATH=/Users/bledden/Documents/arrwDB
export COHERE_API_KEY=your_cohere_api_key_here

# Start the server
python3 run_api.py

# API running at: http://localhost:8000
# Interactive docs: http://localhost:8000/docs

Quick API Test

# Create a library
curl -X POST http://localhost:8000/v1/libraries \
  -H "Content-Type: application/json" \
  -d '{"name": "test", "index_type": "brute_force"}'

# Add a document (with auto-embedding)
curl -X POST http://localhost:8000/v1/libraries/{library_id}/documents \
  -H "Content-Type: application/json" \
  -d '{"title": "Test Doc", "text": "This is a test document"}'

# Search (semantic)
curl -X POST http://localhost:8000/v1/libraries/{library_id}/search \
  -H "Content-Type: application/json" \
  -d '{"query": "test", "k": 5}'

Architecture

arrwDB/
├── app/
│   ├── api/                 # FastAPI routes & endpoints
│   │   ├── main.py          # REST API endpoints
│   │   ├── streaming.py     # NDJSON streaming & search
│   │   ├── websocket_routes.py  # WebSocket handlers
│   │   ├── webhook_routes.py    # Webhook management
│   │   ├── event_routes.py  # Event bus monitoring
│   │   └── health.py        # Health check endpoints
│   ├── services/            # Business logic layer
│   │   ├── library_service.py
│   │   ├── embedding_service.py
│   │   └── hybrid_search.py
│   ├── models/              # Pydantic models & validation
│   ├── events/              # Event bus implementation
│   ├── jobs/                # Background job queue
│   ├── websockets/          # WebSocket connection manager
│   └── webhooks/            # Webhook management & delivery
├── infrastructure/
│   ├── indexes/             # Vector index algorithms
│   │   ├── brute_force.py   # O(n) exact search
│   │   ├── kd_tree.py       # Tree-based partitioning
│   │   ├── lsh.py           # Locality-sensitive hashing
│   │   ├── hnsw.py          # Hierarchical navigable small world
│   │   └── ivf.py           # Inverted file index
│   ├── repositories/        # Data access layer
│   └── persistence/         # WAL + snapshot storage
├── tests/
│   ├── unit/                # 156+ unit tests
│   ├── integration/         # Integration tests
│   └── performance/         # Performance benchmarks
└── docs/                    # Comprehensive documentation

Technology Stack

  • API Framework: FastAPI (async/await, automatic OpenAPI docs)
  • Vector Index: NumPy-based custom implementations + Rust optimizations
  • Embeddings: Cohere API (embed-english-v3.0, 1024 dimensions)
  • Persistence: Write-Ahead Log (WAL) + periodic snapshots
  • Concurrency: Custom Reader-Writer locks, asyncio
  • Real-Time: WebSockets, Event Bus (pub/sub), Webhooks
  • Containerization: Docker with multi-stage builds
  • Monitoring: Prometheus metrics, health checks, detailed status

Real-Time Features

1. NDJSON Streaming Ingestion

Batch upload documents with automatic embedding generation:

# Create NDJSON file (newline-delimited JSON)
cat << EOF > documents.ndjson
{"title": "Doc 1", "texts": ["First document content"]}
{"title": "Doc 2", "texts": ["Second document content"]}
{"title": "Doc 3", "texts": ["Third document content"]}
EOF

# Upload via streaming endpoint
curl -X POST http://localhost:8000/v1/libraries/{library_id}/documents/stream \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @documents.ndjson

# Response: {"successful": 3, "failed": 0, "results": [...]}

2. WebSocket Real-Time Search

Bidirectional communication for interactive applications:

import asyncio
import websockets
import json

async def search_realtime():
    uri = f"ws://localhost:8000/v1/libraries/{library_id}/ws"
    async with websockets.connect(uri) as websocket:
        # Send search request
        await websocket.send(json.dumps({
            "request_id": "search-1",
            "action": "search",
            "data": {"query_text": "semantic query", "k": 5}
        }))

        # Receive results
        response = json.loads(await websocket.recv())
        print(f"Found {len(response['data']['results'])} results")

asyncio.run(search_realtime())

3. Webhooks for Event Notifications

Receive HTTP callbacks for important events:

# Create webhook
curl -X POST http://localhost:8000/api/v1/webhooks \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://yourapp.com/webhooks/arrwdb",
    "events": ["job.completed", "job.failed", "cost.threshold_reached"],
    "description": "Notify on compression job completion"
  }'

# Response includes secret for HMAC verification
# {
#   "id": "webhook_abc123",
#   "url": "https://yourapp.com/webhooks/arrwdb",
#   "secret": "whsec_...",
#   "events": ["job.completed", "job.failed"],
#   "status": "active"
# }

Webhook Features:

  • HMAC SHA-256 signature verification
  • Automatic retries with exponential backoff
  • Delivery tracking and statistics
  • Event filtering (wildcards supported)
  • Multi-tenancy support

Supported Event Types:

  • document.* - Document lifecycle events
  • library.* - Library management events
  • job.* - Job progress and completion
  • cost.* - Budget and cost alerts
  • index.* - Index optimization events
  • * - Subscribe to all events

4. Hybrid Search with Reranking

Advanced search combining vector similarity with metadata signals:

curl -X POST http://localhost:8000/v1/libraries/{library_id}/search/hybrid \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning papers",
    "k": 20,
    "scoring_config": {
      "vector_weight": 0.7,
      "metadata_weight": 0.3,
      "field_boosts": {"tags": 2.0, "author": 1.5},
      "recency_boost_enabled": true,
      "recency_half_life_days": 30
    }
  }'

Hybrid Search Features:

  • Configurable vector/metadata weight balance
  • Field-specific boost factors
  • Recency boost with exponential decay
  • Pre-built reranking functions (by length, position, recency)
  • Custom reranking function support

API Reference

Core REST Endpoints

Library Management

  • POST /v1/libraries - Create new library
  • GET /v1/libraries - List all libraries
  • GET /v1/libraries/{id} - Get library details
  • DELETE /v1/libraries/{id} - Delete library

Document Operations

  • POST /v1/libraries/{id}/documents - Add single document
  • POST /v1/libraries/{id}/documents/stream - Batch NDJSON upload
  • GET /v1/libraries/{id}/documents/{doc_id} - Get document
  • DELETE /v1/libraries/{id}/documents/{doc_id} - Delete document

Search Operations

  • POST /v1/libraries/{id}/search - Semantic search
  • POST /v1/libraries/{id}/search/stream - Streaming search
  • POST /v1/libraries/{id}/search/hybrid - Hybrid search with metadata
  • POST /v1/libraries/{id}/search/rerank - Rerank results

Webhooks

  • POST /api/v1/webhooks - Create webhook
  • GET /api/v1/webhooks - List webhooks
  • GET /api/v1/webhooks/{id} - Get webhook details
  • PATCH /api/v1/webhooks/{id} - Update webhook
  • DELETE /api/v1/webhooks/{id} - Delete webhook
  • GET /api/v1/webhooks/{id}/deliveries - Get delivery history
  • GET /api/v1/webhooks/{id}/stats - Get statistics
  • POST /api/v1/webhooks/{id}/test - Send test event

Health & Monitoring

  • GET /health - Basic health check (liveness)
  • GET /ready - Readiness check with dependencies
  • GET /health/detailed - Detailed component status
  • GET /ping - Simple ping/pong
  • GET /metrics - Prometheus metrics

Real-Time

  • WS /v1/libraries/{id}/ws - WebSocket connection
  • GET /v1/events/stats - Event bus statistics

Testing

Comprehensive Test Suite

# Run all tests
python3 -m pytest tests/ -v

# Run specific test modules
python3 -m pytest tests/unit/test_event_bus.py -v
python3 -m pytest tests/unit/test_job_queue.py -v
python3 -m pytest tests/unit/test_websocket_manager.py -v
python3 -m pytest tests/unit/test_streaming.py -v
python3 -m pytest tests/unit/test_hybrid_search.py -v

# Run performance benchmarks
python3 -m pytest tests/performance/ -v

# Run with coverage
python3 -m pytest tests/ --cov=app --cov-report=html

Test Coverage Summary

Module Tests Coverage Status
Event Bus 23 tests 97% Passing
Job Queue 34 tests 98% Passing
Job Handlers 19 tests 95% Passing
WebSocket Manager 25 tests 100% Passing
API Key Management 33 tests 100% Passing
Streaming 13 tests 78% Passing
Hybrid Search 32 tests 76% Passing
Performance Benchmarks 10 tests N/A All targets exceeded
Total 156+ tests 95-100% ** All passing**

Monitoring & Observability

Health Checks

# Basic health (liveness)
curl http://localhost:8000/health

# Readiness check (for K8s)
curl http://localhost:8000/ready

# Detailed component status
curl http://localhost:8000/health/detailed

Metrics & Statistics

# Event bus statistics
curl http://localhost:8000/v1/events/stats

# Webhook delivery statistics
curl http://localhost:8000/api/v1/webhooks/{webhook_id}/stats

# Library statistics
curl http://localhost:8000/v1/libraries/{id}/statistics

Logging

  • Structured JSON logs (configurable via LOG_JSON_FORMAT=true)
  • Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
  • Component-specific logging: Event Bus, Job Queue, WebSocket, Webhooks

Configuration

Environment Variables

# Server Configuration
export HOST=0.0.0.0
export PORT=8000
export LOG_LEVEL=INFO
export LOG_JSON_FORMAT=false

# Gunicorn (Production)
export GUNICORN_WORKERS=4
export GUNICORN_TIMEOUT=120

# Multi-Tenancy
export MULTI_TENANCY_ENABLED=false
export TENANTS_DB_PATH=./data/tenants.json

# Rate Limiting
export RATE_LIMIT_ENABLED=false
export RATE_LIMIT_PER_MINUTE=60

# Embedding Service
export COHERE_API_KEY=your_api_key_here
export EMBEDDING_MODEL=embed-english-v3.0
export EMBEDDING_DIMENSION=1024

# Performance
export MAX_CHUNK_SIZE=1000
export MAX_CHUNKS_PER_DOCUMENT=1000
export JOB_QUEUE_MAX_WORKERS=4

# Event Bus & WebSocket
export EVENT_BUS_ENABLED=true
export WEBSOCKET_ENABLED=true
export WEBSOCKET_MAX_CONNECTIONS=1000


Using Novel Features

All 9 novel features are production-ready and accessible via REST API:

Always Available (No Enable Required)

Most features work immediately via direct API calls:

# Temperature Search - control exploration vs exploitation
POST /v1/temperature-search/corpora/{id}/search
Body: {"query_text": "...", "temperature": 1.0, "k": 10}

# Index Oracle - get intelligent index recommendations
GET /v1/index-oracle/corpora/{id}/analyze

# Embedding Health Monitor - detect quality issues
GET /v1/embedding-health/corpora/{id}/analyze

# Vector Clustering - semantic grouping
POST /v1/clustering/corpora/{id}/cluster
Body: {"n_clusters": 5}

# Query Expansion - automatic query rewriting
POST /v1/query-expansion/expand
Body: {"query": "machine learning", "strategy": "synonym"}

# Vector Drift Detection - distribution monitoring
GET /v1/vector-drift/corpora/{id}/analyze

# Hybrid Fusion - multi-strategy result merging
POST /v1/hybrid-search/corpora/{id}/search
Body: {"query": "...", "fusion_strategy": "rrf"}

# Adaptive Reranking - feedback-based learning
POST /v1/adaptive-reranking/rerank
Body: {"results": [...], "feedback": [...]}

Requires Explicit Enable

Only Search Replay needs activation (to avoid overhead):

# Enable search path recording
POST /v1/search-replay/enable

# Now perform searches with recording
POST /v1/libraries/{id}/search

# View recorded paths
GET /v1/search-replay/paths?corpus_id={id}

# Disable when done (zero overhead)
POST /v1/search-replay/disable

See NOVEL_FEATURES.md for complete API documentation and examples.


Roadmap

Completed (October 2025)

  • 9 novel features with full test coverage (23/23 tests passing)
  • Rust optimizations for indexes and persistence (5-10x speedup)
  • Vector quantization (scalar + hybrid, 70% memory reduction)
  • Comprehensive testing (156+ tests, 95-100% coverage)
  • Docker deployment (dev + production configurations)
  • Webhooks API with HMAC signatures and retries
  • Hybrid search with metadata scoring
  • WebSocket support for real-time search
  • NDJSON streaming ingestion
  • Performance validation (all targets exceeded by 2-142x)

In Development (Q4 2025)

  • Python SDK for client integration
  • GraphQL API layer
  • Advanced analytics dashboard
  • Multi-region replication

Planned (Q1 2026)

  • Kubernetes Helm charts
  • Grafana dashboards
  • Cost tracking and budget management
  • CI/CD pipeline (GitHub Actions)

Documentation

Document Description
NOVEL_FEATURES.md 9 unique features not in other vector databases
COMPETITIVE_ANALYSIS_2025.md Deep competitive analysis vs Pinecone, Weaviate, Qdrant, Milvus, Chroma
SCALING_TO_BILLIONS_ROADMAP.md Technical roadmap: 10M → 1B vectors, 200ms → 10ms latency
DOCKER_DEPLOYMENT.md Complete Docker deployment guide
TEST_COVERAGE_ROADMAP.md Test strategy and coverage plan
PERFORMANCE_BENCHMARKS.md Detailed performance results
API_GUIDE.md Complete API reference
STREAMING_GUIDE.md NDJSON & WebSocket guide
EVENT_BUS_GUIDE.md Event bus & CDC docs

Contributing

We welcome contributions! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests (pytest tests/ -v)
  5. Commit with descriptive messages
  6. Push and open a Pull Request

License

This project is proprietary software. All rights reserved.


Acknowledgments

  • Cohere - Embedding API
  • FastAPI - Modern Python web framework
  • NumPy - Numerical computing
  • Docker - Containerization platform

Last Updated: October 28, 2025 Version: 2.1.0 Status: Production Ready with 9 Unique Novel Features


Built with for high-performance semantic search and AI applications

About

arrwDB - Advanced Vector Database with Metadata Filtering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published