Skip to content

teamchong/zell

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zell

Fast, Memory-Efficient LLM Inference Engine with Automatic Long-Term Memory

Zell is a high-performance local inference engine for running large language models (LLMs) on your machine. Unlike traditional inference tools, Zell automatically learns from every conversation through an event-sourced memory system and LoRA fine-tuning - no manual intervention required.

Key Features

🚀 Performance

  • Pure Zig MoE inference - No llama.cpp dependency for Mixtral
  • Runs 26GB models on 16GB RAM - Two-tier architecture (GPU resident + SSD streaming)
  • 43.1% faster inference with system prompt caching (1.76x speedup)
  • 3-4x server throughput with continuous batching (4 parallel users)
  • Adaptive context sizing (512 → 131K tokens) based on workload
  • Optimized Metal GPU acceleration for macOS

💾 Adaptive Memory Optimization - Run Oversized Models

Zell automatically profiles your system and model to choose the optimal inference strategy:

zell build <model>
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  AUTOMATIC PROFILER                                         │
│  ─────────────────                                          │
│  • Detects system RAM (e.g., 16GB, 64GB, 128GB)            │
│  • Measures model size from GGUF                            │
│  • Identifies architecture (Dense vs MoE)                   │
│  • Saves optimal config to ~/.zell/profiles/                │
└─────────────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  DECISION MATRIX                                            │
│  ───────────────                                            │
│                                                             │
│  Model fits in RAM?                                         │
│    YES → Standard inference (fastest)                       │
│    NO  → Oversized model optimization ↓                     │
│                                                             │
│  Architecture?                                              │
│    DENSE (Llama, Qwen) → Layer-Split Speculative Decoding  │
│    MoE (Mixtral)       → Batched Expert Processing          │
└─────────────────────────────────────────────────────────────┘

For Dense Models (Llama 70B, Qwen 72B)

Layer-Split Speculative Decoding - Single model acts as both draft and target:

  • First 20-25% layers stay on GPU (fast draft generation)
  • Remaining 75-80% layers stream from SSD (batched verification)
  • Draft 8 tokens → Verify all 8 in one SSD stream → Accept/reject

For MoE Models (Mixtral 8x7B, Qwen2-MoE)

Native Zig MoE Engine - Runs 26GB Mixtral on 16GB Macs with two-tier architecture:

Component Location Size
Attention (Q/K/V/O) GPU VRAM (resident) ~850 MB
Token Embeddings RAM (lazy loaded) ~500 MB
Expert Cache RAM (LRU eviction) 8 GB budget
Expert Weights SSD (mmap zero-copy) ~24 GB

Current Performance (Mixtral 8x7B Q4_K_M on M2 Pro 16GB):

  • 0.7 tok/s measured with SSD streaming
  • 3.1 tok/s SOL ceiling (pure GPU, cache warm)
  • ~50% VIP cache hit rate after warmup
Optimization What It Does Status
VIP Expert Pinning Pins frequently-used experts in RAM ✅ Implemented
Speculative Prefetch Predicts experts 1 layer ahead ✅ Implemented
Zero-copy mmap OS-managed SSD streaming ✅ Implemented
Batched GPU Compute Single command buffer per token ✅ Implemented
# Download Mixtral model (25GB Q4_K_M quantization)
# Models are stored in ~/.zell/models/blobs/

# Just use the same commands - MoE is auto-detected
zell build ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf
zell serve ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf  # All optimizations auto-enabled

Additional Optimizations

  • Zero-copy SSD streaming - OS-managed mmap, no explicit loading overhead
  • KV cache quantization - Q8_0 (50% savings) or Q4_0 (75% savings)
  • System prompt deduplication - Share KV cache across parallel requests

vs Ollama: Zell can run models 3x larger than your RAM (46GB model on 16GB Mac). Ollama requires model to fit in memory.

Note: On high-RAM systems (64GB+), oversized model optimizations auto-disable because weights fit in VRAM. This is intentional - Zell adapts to give you the fastest inference for YOUR hardware.

🧠 Automatic Long-Term Memory ("Sleep" Cycle)

  • Event-sourced conversation logging - Every interaction is recorded with monotonic event IDs
  • Automatic LoRA fine-tuning - After 100 conversations, trains in background while you work
  • Hippocampus architecture - Unconsolidated memories (recent) vs consolidated (trained)
  • Zero-config learning - No manual dataset preparation or training steps

🔧 Advanced Features

  • Layer-Split Inference - Run large models efficiently with automatic GPU/SSD split
  • System Prompt Caching - Instant startup with cached KV states
  • Hardware Safety Gates - Thermal and power monitoring for Mac hardware
  • Adaptive Context Sizing - Dynamic context window based on workload

⚡ Instant Startup

  • System prompt caching - Cache KV state during build for instant restoration
  • Profile-based optimization - Pre-analyzes models to prevent OOM crashes
  • Shader pre-compilation - Metal shaders compiled ahead of time

⚠️ Early Development

  • Experimental features under active development
  • Safety-first design with automatic OOM prevention
  • 30-day cache retention with automatic cleanup
  • Per-model memory isolation
  • Full observability via zell memory command

Quick Start

Installation

# Clone the repository
git clone https://github.com/teamchong/zell.git
cd zell

# Build
zig build

# Add to PATH (optional)
export PATH="$PATH:$(pwd)/zig-out/bin"

Note: Models are not included in the repository due to their large size. See the Models section below for download instructions.

Basic Usage

# 1. Build model profile (one-time, ~60s)
zell build ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

# 2. Cache system prompt for instant startup (optional but recommended)
zell build ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
  --system-prompt "You are a helpful Zig programming assistant."

# 3. Run inference
zell run ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
  "Write a hello world program in Zig"

# 4. Check learning progress
zell memory status ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

Server Mode (Recommended for Multiple Requests)

For interactive use or integrating with tools like Continue/aider, use server mode to keep the model loaded:

# Start server (model loads once, ~40s)
zell serve ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
# Server running on http://localhost:11434

# In another terminal, send requests (instant, <100ms each)
zell ask "What is comptime in Zig?"

# Or use curl with OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello"}]}'

Benefits:

  • First request: ~40s (model load)
  • Subsequent requests: <100ms (model already in memory)
  • OpenAI-compatible API for tool integration
  • Same port as Ollama (11434) for drop-in replacement

Server Optimizations (Automatic):

  • Continuous Batching - Process 4 users in parallel (3-4x throughput on M-series Macs)
  • System Prompt Caching - Share KV cache when users have same system prompt
  • JIT mmap Streaming - Let OS manage memory for 32B+ models on 16GB Macs
  • KV Cache Quantization - Q8_0/Q4_0 based on memory pressure

Using with Claude Code

Zell supports the Anthropic Messages API (/v1/messages), making it compatible with Claude Code:

# Start Zell server
zell serve ~/.zell/models/blobs/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf

# In another terminal, configure Claude Code to use Zell
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="dummy"  # Zell ignores auth

# Start Claude Code - it will use Zell as the backend
claude

Or add to ~/.claude/settings.json:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "dummy"
  }
}

API Endpoints:

  • POST /v1/messages - Anthropic Messages API (Claude Code)
  • POST /v1/chat/completions - OpenAI Chat API (Ollama-compatible)
  • GET /v1/models - List available models
  • GET /health - Health check with queue status

Benchmarks

Layer-Split Architecture: Run Models 3x Larger Than Your RAM

The Problem with Ollama:

  • Ollama requires models to fit entirely in memory
  • 16GB Mac → max ~14GB model (Qwen-14B)
  • Can't run Mixtral 8x7B (46GB) or Qwen-32B (34GB)

Zell's Solution: Automatic Layer Splitting

When you load a model > 80% of RAM, Zell automatically:

  1. Calculates optimal GPU/SSD split based on available memory
  2. Loads draft layers (20-25%) in GPU/RAM for instant inference
  3. Streams remaining target layers from SSD on demand
  4. Zero configuration - works out of the box

Real-World Examples:

# Dense model: 34GB Qwen-32B on 16GB Mac
zell build ~/.zell/models/blobs/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf
zell run ~/.zell/models/blobs/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf "Explain closures"

# Zell automatically:
# - Detects: 34GB model, 14.4GB available
# - Splits: 16 layers on GPU (instant), 48 layers on SSD
# - Preprocesses: Dense FFN (gate/up/down) per layer

# MoE model: 46GB Mixtral 8x7B on 16GB Mac
zell build ~/.zell/models/blobs/mixtral-8x7b-instruct-v0.1.Q8_0.gguf
zell run ~/.zell/models/blobs/mixtral-8x7b-instruct-v0.1.Q8_0.gguf "Write a Zig function"

# Zell automatically:
# - Detects: 46GB model, 14.4GB available, 8 experts
# - Splits: 8 layers on GPU (instant), 24 layers on SSD
# - Preprocesses: Router + 8 experts per layer
# - Runs: Native Zig MoE with expert routing

Architecture support:

  • Dense models: Qwen, Llama, Codestral, etc. (standard FFN)
  • MoE models: Mixtral, DeepSeek-MoE, etc. (native Zig 8-expert routing)

Key Advantage:

Platform 16GB Mac Can Run
Ollama ~14GB models max
Zell 46GB+ models

System Prompt Caching Performance

Tested on Apple M2 Pro with Qwen2.5-Coder-7B-Instruct-Q4_K_M (4.4GB):

Metric Cold Start Warm Start Improvement
Inference Time 11,269ms 6,413ms 4,856ms saved
Speedup 1.00x 1.76x 43.1% faster
Tokens Generated 5,875 5,875 Same quality

One-time cost: 13 seconds to cache system prompt during build Break-even: After just 2 requests Every request after: Nearly 5 seconds faster

See BENCHMARKS.md for detailed methodology and results.


Architecture: Event Sourcing + "Sleep" Cycle

Zell implements a novel approach to LLM memory inspired by biological sleep cycles and event sourcing:

The Event Log (Conversation Logging)

Every conversation is appended to an immutable JSONL log with monotonic event IDs:

{"event_id":1,"timestamp":1702934400,"model":"6d71222ef4099f00","prompt":"Hello","response":"Hi there!","tokens":12,"consolidated":false}
{"event_id":2,"timestamp":1702934460,"model":"6d71222ef4099f00","prompt":"What is Zig?","response":"Zig is...","tokens":156,"consolidated":false}

Storage: ~/.cache/zell/conversations/{model_hash}/{date}.jsonl

The "Reduce" Function (LoRA Training)

After 100 conversations, Zell automatically enters a "sleep" cycle:

  1. Fetch unconsolidated memories - Recent conversations not yet trained on
  2. Sample consolidated memories - Random replay buffer to prevent forgetting
  3. Train LoRA adapter - Gradient descent on mixed batch (80% new, 20% old)
  4. Mark as consolidated - Update consolidated: true in event log
  5. Save projection - Write LoRA weights to ~/.cache/zell/lora/{model_hash}_auto.lora

The Math: Instead of retraining the entire base model $W$ (10GB), we learn a low-rank adaptation:

$$ \text{Output} = \text{Input} \times (W_{\text{base}} + A \times B) $$

Where $A \times B$ is a tiny 10MB "sticky note" on top of the frozen base model.

Auto-Detection and Loading

Next inference automatically detects and loads the trained adapter with priority:

  1. Check for {model_hash}_auto.lora (auto-trained)
  2. Fall back to manual .lora files in current directory
  3. Continue without adapter if none found

How LoRA Integration Works (Big Model + Tiny Model)

A common misconception is that LoRA creates "two models working together." In reality, LoRA modifies the base model's weights directly - it's more like a brain implant than a separate advisor.

The Math: Weight Matrix Fusion

During inference, LoRA doesn't "consult" a separate model. Instead, it merges into the base model:

Original:    Output = Input × W_base           (7B parameters)
With LoRA:   Output = Input × (W_base + ΔW)    (7B + 4MB parameters)
                              └────────────┘
                               Your "memory"

Where ΔW = A × B is your tiny adapter (two small matrices multiplied together).

Visual: The Fusion Process

┌─────────────────────────────────────────────────────┐
│                    INFERENCE (Single Pass)          │
├─────────────────────────────────────────────────────┤
│                                                     │
│   Input Token: "Write a function"                   │
│        │                                            │
│        ▼                                            │
│   ┌─────────────────────────────────────────────┐   │
│   │           TRANSFORMER LAYER                 │   │
│   │                                             │   │
│   │   W_base (frozen)    +    ΔW (your adapter) │   │
│   │   ┌────────────┐         ┌────────────┐     │   │
│   │   │ 7 Billion  │    +    │  4 MB      │     │   │
│   │   │ Parameters │         │ A × B      │     │   │
│   │   │ (Generic)  │         │ (Personal) │     │   │
│   │   └────────────┘         └────────────┘     │   │
│   │         │                      │            │   │
│   │         └──────────┬───────────┘            │   │
│   │                    │                        │   │
│   │                    ▼                        │   │
│   │            W_effective                      │   │
│   │     (Model that "knows" your style)         │   │
│   └─────────────────────────────────────────────┘   │
│        │                                            │
│        ▼                                            │
│   Output: Uses YOUR coding conventions              │
│                                                     │
└─────────────────────────────────────────────────────┘

Key Insight: Not Two Models, One Modified Model

What You Might Think What Actually Happens
"Two models talking to each other" One model with modified weights
"LoRA looks up your memories" Memories are baked into the neurons
"Slower because of extra model" Same speed (single forward pass)
"7B + 7B = 14B parameters" 7B + 4MB ≈ 7.004B parameters

Why This Matters: Model Compatibility

Because LoRA physically modifies weight matrices, an adapter trained on Llama-3 will NOT work on Mistral:

Llama-3 adapter:  Trained to "correct" specific Llama-3 neurons
Mistral model:    Different neuron values, different architecture
Result:           Math produces garbage (or crash)

Zell handles this automatically:

  1. Adapters are stored per-model: ~/.cache/zell/lora/{model_hash}/
  2. Validation on load: Checks model hash before applying adapter
  3. Auto-bootstrap: When you switch models, Zell re-trains from universal logs

Commands

Core Operations

# Build model profile
zell build <model.gguf> [--force] [--system-prompt "..."]

# Run inference
zell run <model.gguf> [--image <path>]... <prompt>
zell run <model.gguf> --max-tokens 500 "Your prompt"

# Batch inference
zell batch <model.gguf> "Prompt 1" "Prompt 2" "Prompt 3"

Memory Management

# Check learning progress
zell memory status [model]

# Clear conversation logs
zell memory clear [model]

# List conversations (coming soon)
zell memory list [model]

Cache Management

# Clean old caches (>30 days) - DEFAULT
zell clear-cache

# Explicit old cleanup
zell clear-cache --old

# Clear ALL caches (caution!)
zell clear-cache --all

# Clear specific model
zell clear-cache <model.gguf>

Model Information

# Show model details
zell info <model.gguf>

# Show optimization profile
zell profile <model.gguf>

# LoRA adapter management
zell lora <model.gguf> <load|list|unload|info>

Memory Management

Storage Locations

~/.zell/
├── models/
│   └── blobs/              # GGUF model files storage
│       └── {sha256}.gguf   # Content-addressed model files
├── system-prompts/         # KV caches for instant startup
│   └── {hash}.kv          # Content-addressed by prompt hash
├── conversations/          # Event-sourced conversation logs
│   └── {model_hash}/
│       ├── 19723.jsonl    # Daily logs (days since epoch)
│       └── 19724.jsonl
├── lora/                   # Auto-trained adapters
│   └── {model_hash}_auto.lora
├── profiles/               # Model optimization profiles
│   └── {model}.gguf.json
└── shaders/                # Pre-compiled Metal shaders
    └── {model}.gguf.metallib

Cache Retention Policy

  • System prompts: 30 days (content-addressed, auto-invalidates on change)
  • Conversations: 30 days (event-sourced, immutable)
  • Profiles: Forever (small, model-specific)
  • Shaders: Forever (small, model-specific)

Automatic cleanup: Runs during zell build Manual cleanup: zell clear-cache --old

See CACHE.md for detailed cache management documentation.


Event Sourcing API

The conversation logger provides an event sourcing API for the Hippocampus:

const logger = try ConversationLogger.init(allocator, model_hash);
defer logger.deinit();

// Log new event (auto-increments event_id)
try logger.log(prompt, response, tokens_used);

// Get latest event ID
const latest_id = try logger.getLatestEventId();

// Fetch unconsolidated events (new memories)
const new_memories = try logger.fetchUnconsolidated(100);
defer allocator.free(new_memories);

// Sample consolidated events (replay buffer)
const old_memories = try logger.sampleConsolidated(20);
defer allocator.free(old_memories);

// Mark events as consolidated (after training)
try logger.markConsolidated(&[_]u64{ 1, 2, 3, 4, 5 });

Models

Zell supports any GGUF-format model. Due to their large size, models are not included in the repository.

Recommended Models

  • Mixtral-8x7B-Instruct-v0.1-Q4_K_M (26GB) - MoE model, runs on 16GB Mac with native Zig engine
  • Qwen2.5-Coder-7B-Instruct-Q4_K_M (4.4GB) - Recommended for coding tasks
  • Codestral-22B-v0.1-Q4_K_M (12GB) - Larger, more capable for complex tasks
  • Qwen2.5-Coder-32B-Instruct-Q8_0 (34GB) - Largest, runs on 16GB Mac with layer-split

Downloading Models

Option 1: Download from Hugging Face

# Create models directory
mkdir -p ~/.zell/models/blobs

# Download Mixtral 8x7B (26GB - runs on 16GB Mac with native Zig MoE engine)
curl -L -C - -o ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf \
  "https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf"

# Download Qwen2.5-Coder-7B (smaller, faster)
wget https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  -O ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

# Or download Codestral-22B (larger)
wget https://huggingface.co/bartowski/Codestral-22B-v0.1-GGUF/resolve/main/Codestral-22B-v0.1-Q4_K_M.gguf \
  -O ~/.zell/models/blobs/Codestral-22B-v0.1-Q4_K_M.gguf

Option 2: Use any GGUF model

Zell works with any GGUF-format model. Download from Hugging Face or convert your own using llama.cpp.

# Use any model file
zell build /path/to/your/model.gguf
zell run /path/to/your/model.gguf "Your prompt here"

System Requirements

  • OS: macOS (Metal GPU required), Linux (CPU only)
  • Memory: 8GB RAM minimum (16GB+ recommended)
  • Disk: 20-150MB for caches + model size
  • Zig: 0.15.2 or later

Development

Project Structure

zell/
├── src/
│   ├── main.zig              # CLI and command routing
│   ├── mixtral_lazy.zig      # Two-tier Mixtral engine (attention resident, experts streaming)
│   ├── moe_metal.zig         # Metal GPU MoE kernels and expert cache
│   ├── llama.zig             # llama.cpp bindings (legacy)
│   ├── scheduler.zig         # Multi-cell parallel inference + prefix caching
│   ├── server.zig            # HTTP server with continuous batching
│   ├── optimizer.zig         # Model profiling and optimization
│   ├── conversation_logger.zig   # Event sourcing for conversations
│   ├── auto_trainer.zig      # Background LoRA training spawner
│   ├── lora_trainer.zig      # LoRA fine-tuning implementation
│   ├── snapshot.zig          # KV cache serialization
│   ├── budget_manager.zig    # Adaptive context sizing
│   ├── hippocampus.zig       # Sleep cycle state machine
│   ├── hardware_monitor.zig  # Power, thermal, rate limiting
│   ├── memory_monitor.zig    # Runtime memory pressure detection
│   ├── dashboard.zig         # TUI brain scan display
│   ├── test_paris.zig        # Paris test - inference correctness validation
│   └── inference/
│       ├── gguf_loader.zig   # GGUF file parsing and tensor loading
│       ├── gguf_tokenizer.zig # Pure Zig BPE tokenizer from GGUF vocab
│       └── metal/
│           └── moe_kernels.metal  # GPU kernels (attention, RoPE, MoE, softmax)
├── models/                   # GGUF models (stored in ~/.zell/models/blobs/)
├── vendor/llama.cpp/         # Upstream llama.cpp
└── build.zig                 # Build configuration

Building from Source

# Debug build
zig build

# Release build (optimized)
zig build -Doptimize=ReleaseFast

# Run tests
zig build test

# Clean build artifacts
rm -rf zig-cache zig-out

Testing Native Zig MoE Inference

# The "Paris Test" - validates inference correctness
# Expected: "Paris" (token 5465) in top predictions
./zig-out/bin/test-paris ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf

# Performance benchmark (measures tok/s)
./zig-out/bin/bench-mixtral ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf [num_tokens]

Paris Test Results:

  • Model correctly predicts "Paris" in top-5 for "What is the capital of France?"
  • Instruction-tuned models prefer sentence responses ("The capital of France is Paris.") over direct answers
  • Token 415 ("The") as top prediction is expected behavior for INSTRUCT format

Native Zig MoE Architecture

The native Zig MoE engine uses a two-tier architecture optimized for 16GB Macs:

┌─────────────────────────────────────────────────────────────┐
│  TIER 1: GPU RESIDENT (~850 MB)                            │
│  ─────────────────────────────────────────────────────────  │
│  • Attention weights (Q/K/V/O) for all 32 layers           │
│  • Uploaded once at startup, stays in VRAM                 │
│  • RoPE, GQA attention, softmax run on GPU                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  TIER 2: EXPERT CACHE (8 GB budget)                        │
│  ─────────────────────────────────────────────────────────  │
│  • LRU cache with VIP pinning for hot experts              │
│  • 256 total experts (32 layers × 8 experts)               │
│  • ~50% cache hit rate after warmup                        │
│  • Speculative prefetch predicts next layer's experts      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  SSD STORAGE (~24 GB mmap)                                 │
│  ─────────────────────────────────────────────────────────  │
│  • Zero-copy mmap for expert weights                       │
│  • OS-managed page cache                                   │
│  • ~266ms per expert load from SSD                         │
└─────────────────────────────────────────────────────────────┘

Performance on M2 Pro 16GB with Mixtral 8x7B Q4_K_M:

  • Cold start: ~10 min to fill 8GB expert cache
  • Measured: 0.7 tok/s (with SSD streaming)
  • SOL ceiling: 3.1 tok/s (pure GPU, cache warm)

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Code Style

  • Follow Zig standard library conventions
  • Use 4-space indentation
  • Add comments for complex logic
  • Write tests for new features

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

See LICENSE for the full license text.


Acknowledgments

  • llama.cpp - Core inference engine by Georgi Gerganov
  • LoRA - Low-Rank Adaptation by Microsoft Research
  • Event Sourcing - Architecture pattern by Martin Fowler
  • Biological Sleep Cycles - Inspiration for memory consolidation

Citation

If you use Zell in your research, please cite:

@software{zell2024,
  title = {Zell: Fast LLM Inference with Automatic Long-Term Memory},
  author = {Steven Chong},
  year = {2024},
  url = {https://github.com/teamchong/zell}
}

Support

About

zzZz

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published