arrwDB - Production Vector Database

Enterprise-grade vector database with real-time streaming, WebSocket support, webhooks, and comprehensive testing

Overview

arrwDB is a production-grade vector database designed for semantic search, RAG (Retrieval-Augmented Generation), and AI applications. Built from the ground up with performance, scalability, and developer experience in mind.

Key Features

Core Infrastructure

Real-Time Streaming - NDJSON batch ingestion with sub-second performance WebSocket Support - Bidirectional real-time communication for live search Webhooks API - Event notifications with HMAC signatures and automatic retries Event Bus & CDC - Change data capture with pub/sub event system Background Job Queue - Async processing with configurable worker pools Multiple Index Types - BruteForce, KD-Tree, LSH, HNSW, IVF Rust Optimizations - Performance-critical indexes and persistence in Rust (5-10x faster) Vector Quantization - Scalar (4-bit/8-bit) and hybrid quantization (70% memory reduction) Hybrid Search - Combine semantic and keyword search with reranking Metadata Filtering - Advanced query filtering on document metadata Health Checks - Kubernetes-ready liveness and readiness probes Docker Support - Development and production Docker Compose configurations Comprehensive Testing - 156+ tests with 95-100% coverage on core infrastructure Production Ready - Rate limiting, monitoring, security, persistence

Novel Features (Not in Other Vector DBs)

Search Replay - Complete HNSW search path recording for debugging & transparency Temperature Search - LLM-inspired result sampling (exploration/exploitation control) Index Oracle - Intelligent index recommendation based on workload analysis Embedding Health Monitor - Statistical quality detection (outliers, degeneracy, drift) Vector Clustering - K-means semantic grouping with auto cluster count estimation Query Expansion - Automatic query rewriting with RRF fusion (3 strategies) Vector Drift Detection - Kolmogorov-Smirnov test for distribution monitoring Adaptive Reranking - Feedback-based learning (clicks, dwell time, bookmarks) Hybrid Fusion - Multi-strategy result merging (RRF, linear, confidence-based)

See NOVEL_FEATURES.md for complete documentation of the 9 unique features.

Performance Metrics

Async Infrastructure (Validated & Tested)

Event Bus: 470,000 events/sec throughput, 0.14ms P99 latency (47x target)
Job Queue: 50,000+ jobs/sec submission, 1,000+ jobs/sec execution (50x target)
WebSocket: 5,000+ connections/sec, 500+ concurrent connections (10x target)
Test Coverage: 95-100% on all async infrastructure with 156+ passing tests

Streaming & Search

NDJSON Batch Upload: < 1 second for 3 documents with full 1024-dim embeddings
Streaming Search: 200ms average latency with metadata filtering
WebSocket Search: <100ms round-trip time for real-time queries
Hybrid Search: Advanced scoring with recency boost and field boosting

Production Readiness

Test Suite: 156+ tests with 100% pass rate
Coverage: 95-100% on core infrastructure (Event Bus, Job Queue, WebSocket, Webhooks)
Performance: All targets exceeded by 2-142x (see PERFORMANCE_BENCHMARKS.md)
Docker: Multi-stage builds with dev/prod configurations
Monitoring: Health checks, readiness probes, detailed component status

Quick Start

Option 1: Docker (Recommended)

Development Mode

# Set your Cohere API key
export COHERE_API_KEY="your_cohere_api_key"

# Start in development mode (hot-reloading enabled)
docker-compose -f docker-compose.dev.yml up

# Access the API at http://localhost:8000
# API docs at http://localhost:8000/docs

Production Mode

# Set environment variables
export COHERE_API_KEY="your_cohere_api_key"
export GUNICORN_WORKERS=4

# Start in production mode
docker-compose -f docker-compose.prod.yml up -d

# With monitoring (Prometheus + Grafana)
docker-compose -f docker-compose.prod.yml --profile monitoring up -d

See DOCKER_DEPLOYMENT.md for complete Docker guide.

Option 2: Local Development

# Install dependencies
pip install -r requirements.txt

# Set up environment
export PYTHONPATH=/Users/bledden/Documents/arrwDB
export COHERE_API_KEY=your_cohere_api_key_here

# Start the server
python3 run_api.py

# API running at: http://localhost:8000
# Interactive docs: http://localhost:8000/docs

Quick API Test

# Create a library
curl -X POST http://localhost:8000/v1/libraries \
  -H "Content-Type: application/json" \
  -d '{"name": "test", "index_type": "brute_force"}'

# Add a document (with auto-embedding)
curl -X POST http://localhost:8000/v1/libraries/{library_id}/documents \
  -H "Content-Type: application/json" \
  -d '{"title": "Test Doc", "text": "This is a test document"}'

# Search (semantic)
curl -X POST http://localhost:8000/v1/libraries/{library_id}/search \
  -H "Content-Type: application/json" \
  -d '{"query": "test", "k": 5}'

Architecture

arrwDB/
├── app/
│   ├── api/                 # FastAPI routes & endpoints
│   │   ├── main.py          # REST API endpoints
│   │   ├── streaming.py     # NDJSON streaming & search
│   │   ├── websocket_routes.py  # WebSocket handlers
│   │   ├── webhook_routes.py    # Webhook management
│   │   ├── event_routes.py  # Event bus monitoring
│   │   └── health.py        # Health check endpoints
│   ├── services/            # Business logic layer
│   │   ├── library_service.py
│   │   ├── embedding_service.py
│   │   └── hybrid_search.py
│   ├── models/              # Pydantic models & validation
│   ├── events/              # Event bus implementation
│   ├── jobs/                # Background job queue
│   ├── websockets/          # WebSocket connection manager
│   └── webhooks/            # Webhook management & delivery
├── infrastructure/
│   ├── indexes/             # Vector index algorithms
│   │   ├── brute_force.py   # O(n) exact search
│   │   ├── kd_tree.py       # Tree-based partitioning
│   │   ├── lsh.py           # Locality-sensitive hashing
│   │   ├── hnsw.py          # Hierarchical navigable small world
│   │   └── ivf.py           # Inverted file index
│   ├── repositories/        # Data access layer
│   └── persistence/         # WAL + snapshot storage
├── tests/
│   ├── unit/                # 156+ unit tests
│   ├── integration/         # Integration tests
│   └── performance/         # Performance benchmarks
└── docs/                    # Comprehensive documentation

Technology Stack

API Framework: FastAPI (async/await, automatic OpenAPI docs)
Vector Index: NumPy-based custom implementations + Rust optimizations
Embeddings: Cohere API (embed-english-v3.0, 1024 dimensions)
Persistence: Write-Ahead Log (WAL) + periodic snapshots
Concurrency: Custom Reader-Writer locks, asyncio
Real-Time: WebSockets, Event Bus (pub/sub), Webhooks
Containerization: Docker with multi-stage builds
Monitoring: Prometheus metrics, health checks, detailed status

Real-Time Features

1. NDJSON Streaming Ingestion

Batch upload documents with automatic embedding generation:

# Create NDJSON file (newline-delimited JSON)
cat << EOF > documents.ndjson
{"title": "Doc 1", "texts": ["First document content"]}
{"title": "Doc 2", "texts": ["Second document content"]}
{"title": "Doc 3", "texts": ["Third document content"]}
EOF

# Upload via streaming endpoint
curl -X POST http://localhost:8000/v1/libraries/{library_id}/documents/stream \
  -H "Content-Type: application/x-ndjson" \
  --data-binary @documents.ndjson

# Response: {"successful": 3, "failed": 0, "results": [...]}

2. WebSocket Real-Time Search

Bidirectional communication for interactive applications:

import asyncio
import websockets
import json

async def search_realtime():
    uri = f"ws://localhost:8000/v1/libraries/{library_id}/ws"
    async with websockets.connect(uri) as websocket:
        # Send search request
        await websocket.send(json.dumps({
            "request_id": "search-1",
            "action": "search",
            "data": {"query_text": "semantic query", "k": 5}
        }))

        # Receive results
        response = json.loads(await websocket.recv())
        print(f"Found {len(response['data']['results'])} results")

asyncio.run(search_realtime())

3. Webhooks for Event Notifications

Receive HTTP callbacks for important events:

# Create webhook
curl -X POST http://localhost:8000/api/v1/webhooks \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://yourapp.com/webhooks/arrwdb",
    "events": ["job.completed", "job.failed", "cost.threshold_reached"],
    "description": "Notify on compression job completion"
  }'

# Response includes secret for HMAC verification
# {
#   "id": "webhook_abc123",
#   "url": "https://yourapp.com/webhooks/arrwdb",
#   "secret": "whsec_...",
#   "events": ["job.completed", "job.failed"],
#   "status": "active"
# }

Webhook Features:

HMAC SHA-256 signature verification
Automatic retries with exponential backoff
Delivery tracking and statistics
Event filtering (wildcards supported)
Multi-tenancy support

Supported Event Types:

document.* - Document lifecycle events
library.* - Library management events
job.* - Job progress and completion
cost.* - Budget and cost alerts
index.* - Index optimization events
* - Subscribe to all events

4. Hybrid Search with Reranking

Advanced search combining vector similarity with metadata signals:

curl -X POST http://localhost:8000/v1/libraries/{library_id}/search/hybrid \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning papers",
    "k": 20,
    "scoring_config": {
      "vector_weight": 0.7,
      "metadata_weight": 0.3,
      "field_boosts": {"tags": 2.0, "author": 1.5},
      "recency_boost_enabled": true,
      "recency_half_life_days": 30
    }
  }'

Hybrid Search Features:

Configurable vector/metadata weight balance
Field-specific boost factors
Recency boost with exponential decay
Pre-built reranking functions (by length, position, recency)
Custom reranking function support

API Reference

Core REST Endpoints

Library Management

POST /v1/libraries - Create new library
GET /v1/libraries - List all libraries
GET /v1/libraries/{id} - Get library details
DELETE /v1/libraries/{id} - Delete library

Document Operations

POST /v1/libraries/{id}/documents - Add single document
POST /v1/libraries/{id}/documents/stream - Batch NDJSON upload
GET /v1/libraries/{id}/documents/{doc_id} - Get document
DELETE /v1/libraries/{id}/documents/{doc_id} - Delete document

Search Operations

POST /v1/libraries/{id}/search - Semantic search
POST /v1/libraries/{id}/search/stream - Streaming search
POST /v1/libraries/{id}/search/hybrid - Hybrid search with metadata
POST /v1/libraries/{id}/search/rerank - Rerank results

Webhooks

POST /api/v1/webhooks - Create webhook
GET /api/v1/webhooks - List webhooks
GET /api/v1/webhooks/{id} - Get webhook details
PATCH /api/v1/webhooks/{id} - Update webhook
DELETE /api/v1/webhooks/{id} - Delete webhook
GET /api/v1/webhooks/{id}/deliveries - Get delivery history
GET /api/v1/webhooks/{id}/stats - Get statistics
POST /api/v1/webhooks/{id}/test - Send test event

Health & Monitoring

GET /health - Basic health check (liveness)
GET /ready - Readiness check with dependencies
GET /health/detailed - Detailed component status
GET /ping - Simple ping/pong
GET /metrics - Prometheus metrics

Real-Time

WS /v1/libraries/{id}/ws - WebSocket connection
GET /v1/events/stats - Event bus statistics

Testing

Comprehensive Test Suite

# Run all tests
python3 -m pytest tests/ -v

# Run specific test modules
python3 -m pytest tests/unit/test_event_bus.py -v
python3 -m pytest tests/unit/test_job_queue.py -v
python3 -m pytest tests/unit/test_websocket_manager.py -v
python3 -m pytest tests/unit/test_streaming.py -v
python3 -m pytest tests/unit/test_hybrid_search.py -v

# Run performance benchmarks
python3 -m pytest tests/performance/ -v

# Run with coverage
python3 -m pytest tests/ --cov=app --cov-report=html

Test Coverage Summary

Module	Tests	Coverage	Status
Event Bus	23 tests	97%	Passing
Job Queue	34 tests	98%	Passing
Job Handlers	19 tests	95%	Passing
WebSocket Manager	25 tests	100%	Passing
API Key Management	33 tests	100%	Passing
Streaming	13 tests	78%	Passing
Hybrid Search	32 tests	76%	Passing
Performance Benchmarks	10 tests	N/A	All targets exceeded
Total	156+ tests	95-100%	All passing

Monitoring & Observability

Health Checks

# Basic health (liveness)
curl http://localhost:8000/health

# Readiness check (for K8s)
curl http://localhost:8000/ready

# Detailed component status
curl http://localhost:8000/health/detailed

Metrics & Statistics

# Event bus statistics
curl http://localhost:8000/v1/events/stats

# Webhook delivery statistics
curl http://localhost:8000/api/v1/webhooks/{webhook_id}/stats

# Library statistics
curl http://localhost:8000/v1/libraries/{id}/statistics

Logging

Structured JSON logs (configurable via LOG_JSON_FORMAT=true)
Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
Component-specific logging: Event Bus, Job Queue, WebSocket, Webhooks

Configuration

Environment Variables

# Server Configuration
export HOST=0.0.0.0
export PORT=8000
export LOG_LEVEL=INFO
export LOG_JSON_FORMAT=false

# Gunicorn (Production)
export GUNICORN_WORKERS=4
export GUNICORN_TIMEOUT=120

# Multi-Tenancy
export MULTI_TENANCY_ENABLED=false
export TENANTS_DB_PATH=./data/tenants.json

# Rate Limiting
export RATE_LIMIT_ENABLED=false
export RATE_LIMIT_PER_MINUTE=60

# Embedding Service
export COHERE_API_KEY=your_api_key_here
export EMBEDDING_MODEL=embed-english-v3.0
export EMBEDDING_DIMENSION=1024

# Performance
export MAX_CHUNK_SIZE=1000
export MAX_CHUNKS_PER_DOCUMENT=1000
export JOB_QUEUE_MAX_WORKERS=4

# Event Bus & WebSocket
export EVENT_BUS_ENABLED=true
export WEBSOCKET_ENABLED=true
export WEBSOCKET_MAX_CONNECTIONS=1000

Using Novel Features

All 9 novel features are production-ready and accessible via REST API:

Always Available (No Enable Required)

Most features work immediately via direct API calls:

# Temperature Search - control exploration vs exploitation
POST /v1/temperature-search/corpora/{id}/search
Body: {"query_text": "...", "temperature": 1.0, "k": 10}

# Index Oracle - get intelligent index recommendations
GET /v1/index-oracle/corpora/{id}/analyze

# Embedding Health Monitor - detect quality issues
GET /v1/embedding-health/corpora/{id}/analyze

# Vector Clustering - semantic grouping
POST /v1/clustering/corpora/{id}/cluster
Body: {"n_clusters": 5}

# Query Expansion - automatic query rewriting
POST /v1/query-expansion/expand
Body: {"query": "machine learning", "strategy": "synonym"}

# Vector Drift Detection - distribution monitoring
GET /v1/vector-drift/corpora/{id}/analyze

# Hybrid Fusion - multi-strategy result merging
POST /v1/hybrid-search/corpora/{id}/search
Body: {"query": "...", "fusion_strategy": "rrf"}

# Adaptive Reranking - feedback-based learning
POST /v1/adaptive-reranking/rerank
Body: {"results": [...], "feedback": [...]}

Requires Explicit Enable

Only Search Replay needs activation (to avoid overhead):

# Enable search path recording
POST /v1/search-replay/enable

# Now perform searches with recording
POST /v1/libraries/{id}/search

# View recorded paths
GET /v1/search-replay/paths?corpus_id={id}

# Disable when done (zero overhead)
POST /v1/search-replay/disable

See NOVEL_FEATURES.md for complete API documentation and examples.

Roadmap

Completed (October 2025)

9 novel features with full test coverage (23/23 tests passing)
Rust optimizations for indexes and persistence (5-10x speedup)
Vector quantization (scalar + hybrid, 70% memory reduction)
Comprehensive testing (156+ tests, 95-100% coverage)
Docker deployment (dev + production configurations)
Webhooks API with HMAC signatures and retries
Hybrid search with metadata scoring
WebSocket support for real-time search
NDJSON streaming ingestion
Performance validation (all targets exceeded by 2-142x)

In Development (Q4 2025)

Python SDK for client integration
GraphQL API layer
Advanced analytics dashboard
Multi-region replication

Planned (Q1 2026)

Kubernetes Helm charts
Grafana dashboards
Cost tracking and budget management
CI/CD pipeline (GitHub Actions)

Documentation

Document	Description
NOVEL_FEATURES.md	9 unique features not in other vector databases
COMPETITIVE_ANALYSIS_2025.md	Deep competitive analysis vs Pinecone, Weaviate, Qdrant, Milvus, Chroma
SCALING_TO_BILLIONS_ROADMAP.md	Technical roadmap: 10M → 1B vectors, 200ms → 10ms latency
DOCKER_DEPLOYMENT.md	Complete Docker deployment guide
TEST_COVERAGE_ROADMAP.md	Test strategy and coverage plan
PERFORMANCE_BENCHMARKS.md	Detailed performance results
API_GUIDE.md	Complete API reference
STREAMING_GUIDE.md	NDJSON & WebSocket guide
EVENT_BUS_GUIDE.md	Event bus & CDC docs

Contributing

We welcome contributions! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests (pytest tests/ -v)
Commit with descriptive messages
Push and open a Pull Request

License

Acknowledgments

Cohere - Embedding API
FastAPI - Modern Python web framework
NumPy - Numerical computing
Docker - Containerization platform

Last Updated: October 28, 2025 Version: 2.1.0 Status: Production Ready with 9 Unique Novel Features

Built with for high-performance semantic search and AI applications

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
app		app
benchmarks		benchmarks
core		core
deploy/helm/arrwdb		deploy/helm/arrwdb
docs		docs
infrastructure		infrastructure
mcp-server		mcp-server
packages		packages
rust		rust
scripts		scripts
sdk		sdk
target/rust-analyzer/flycheck0		target/rust-analyzer/flycheck0
temporal		temporal
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Dockerfile.singlestage		Dockerfile.singlestage
Procfile		Procfile
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
gunicorn_conf.py		gunicorn_conf.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run_api.py		run_api.py

bledden/arrwDB

Folders and files

Latest commit

History

Repository files navigation