Zell

Fast, Memory-Efficient LLM Inference Engine with Automatic Long-Term Memory

Zell is a high-performance local inference engine for running large language models (LLMs) on your machine. Unlike traditional inference tools, Zell automatically learns from every conversation through an event-sourced memory system and LoRA fine-tuning - no manual intervention required.

Key Features

🚀 Performance

Pure Zig MoE inference - No llama.cpp dependency for Mixtral
Runs 26GB models on 16GB RAM - Two-tier architecture (GPU resident + SSD streaming)
43.1% faster inference with system prompt caching (1.76x speedup)
3-4x server throughput with continuous batching (4 parallel users)
Adaptive context sizing (512 → 131K tokens) based on workload
Optimized Metal GPU acceleration for macOS

💾 Adaptive Memory Optimization - Run Oversized Models

Zell automatically profiles your system and model to choose the optimal inference strategy:

zell build <model>
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  AUTOMATIC PROFILER                                         │
│  ─────────────────                                          │
│  • Detects system RAM (e.g., 16GB, 64GB, 128GB)            │
│  • Measures model size from GGUF                            │
│  • Identifies architecture (Dense vs MoE)                   │
│  • Saves optimal config to ~/.zell/profiles/                │
└─────────────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  DECISION MATRIX                                            │
│  ───────────────                                            │
│                                                             │
│  Model fits in RAM?                                         │
│    YES → Standard inference (fastest)                       │
│    NO  → Oversized model optimization ↓                     │
│                                                             │
│  Architecture?                                              │
│    DENSE (Llama, Qwen) → Layer-Split Speculative Decoding  │
│    MoE (Mixtral)       → Batched Expert Processing          │
└─────────────────────────────────────────────────────────────┘

For Dense Models (Llama 70B, Qwen 72B)

Layer-Split Speculative Decoding - Single model acts as both draft and target:

First 20-25% layers stay on GPU (fast draft generation)
Remaining 75-80% layers stream from SSD (batched verification)
Draft 8 tokens → Verify all 8 in one SSD stream → Accept/reject

For MoE Models (Mixtral 8x7B, Qwen2-MoE)

Native Zig MoE Engine - Runs 26GB Mixtral on 16GB Macs with two-tier architecture:

Component	Location	Size
Attention (Q/K/V/O)	GPU VRAM (resident)	~850 MB
Token Embeddings	RAM (lazy loaded)	~500 MB
Expert Cache	RAM (LRU eviction)	8 GB budget
Expert Weights	SSD (mmap zero-copy)	~24 GB

Current Performance (Mixtral 8x7B Q4_K_M on M2 Pro 16GB):

0.7 tok/s measured with SSD streaming
3.1 tok/s SOL ceiling (pure GPU, cache warm)
~50% VIP cache hit rate after warmup

Optimization	What It Does	Status
VIP Expert Pinning	Pins frequently-used experts in RAM	✅ Implemented
Speculative Prefetch	Predicts experts 1 layer ahead	✅ Implemented
Zero-copy mmap	OS-managed SSD streaming	✅ Implemented
Batched GPU Compute	Single command buffer per token	✅ Implemented

# Download Mixtral model (25GB Q4_K_M quantization)
# Models are stored in ~/.zell/models/blobs/

# Just use the same commands - MoE is auto-detected
zell build ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf
zell serve ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf  # All optimizations auto-enabled

Additional Optimizations

Zero-copy SSD streaming - OS-managed mmap, no explicit loading overhead
KV cache quantization - Q8_0 (50% savings) or Q4_0 (75% savings)
System prompt deduplication - Share KV cache across parallel requests

vs Ollama: Zell can run models 3x larger than your RAM (46GB model on 16GB Mac). Ollama requires model to fit in memory.

Note: On high-RAM systems (64GB+), oversized model optimizations auto-disable because weights fit in VRAM. This is intentional - Zell adapts to give you the fastest inference for YOUR hardware.

🧠 Automatic Long-Term Memory ("Sleep" Cycle)

Event-sourced conversation logging - Every interaction is recorded with monotonic event IDs
Automatic LoRA fine-tuning - After 100 conversations, trains in background while you work
Hippocampus architecture - Unconsolidated memories (recent) vs consolidated (trained)
Zero-config learning - No manual dataset preparation or training steps

🔧 Advanced Features

Layer-Split Inference - Run large models efficiently with automatic GPU/SSD split
System Prompt Caching - Instant startup with cached KV states
Hardware Safety Gates - Thermal and power monitoring for Mac hardware
Adaptive Context Sizing - Dynamic context window based on workload

⚡ Instant Startup

System prompt caching - Cache KV state during build for instant restoration
Profile-based optimization - Pre-analyzes models to prevent OOM crashes
Shader pre-compilation - Metal shaders compiled ahead of time

⚠️ Early Development

Experimental features under active development
Safety-first design with automatic OOM prevention
30-day cache retention with automatic cleanup
Per-model memory isolation
Full observability via zell memory command

Quick Start

Installation

# Clone the repository
git clone https://github.com/teamchong/zell.git
cd zell

# Build
zig build

# Add to PATH (optional)
export PATH="$PATH:$(pwd)/zig-out/bin"

Note: Models are not included in the repository due to their large size. See the Models section below for download instructions.

Basic Usage

# 1. Build model profile (one-time, ~60s)
zell build ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

# 2. Cache system prompt for instant startup (optional but recommended)
zell build ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
  --system-prompt "You are a helpful Zig programming assistant."

# 3. Run inference
zell run ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
  "Write a hello world program in Zig"

# 4. Check learning progress
zell memory status ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

Server Mode (Recommended for Multiple Requests)

For interactive use or integrating with tools like Continue/aider, use server mode to keep the model loaded:

# Start server (model loads once, ~40s)
zell serve ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
# Server running on http://localhost:11434

# In another terminal, send requests (instant, <100ms each)
zell ask "What is comptime in Zig?"

# Or use curl with OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello"}]}'

Benefits:

First request: ~40s (model load)
Subsequent requests: <100ms (model already in memory)
OpenAI-compatible API for tool integration
Same port as Ollama (11434) for drop-in replacement

Server Optimizations (Automatic):

Continuous Batching - Process 4 users in parallel (3-4x throughput on M-series Macs)
System Prompt Caching - Share KV cache when users have same system prompt
JIT mmap Streaming - Let OS manage memory for 32B+ models on 16GB Macs
KV Cache Quantization - Q8_0/Q4_0 based on memory pressure

Using with Claude Code

Zell supports the Anthropic Messages API (/v1/messages), making it compatible with Claude Code:

# Start Zell server
zell serve ~/.zell/models/blobs/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf

# In another terminal, configure Claude Code to use Zell
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="dummy"  # Zell ignores auth

# Start Claude Code - it will use Zell as the backend
claude

Or add to ~/.claude/settings.json:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "dummy"
  }
}

API Endpoints:

POST /v1/messages - Anthropic Messages API (Claude Code)
POST /v1/chat/completions - OpenAI Chat API (Ollama-compatible)
GET /v1/models - List available models
GET /health - Health check with queue status

Benchmarks

Layer-Split Architecture: Run Models 3x Larger Than Your RAM

The Problem with Ollama:

Ollama requires models to fit entirely in memory
16GB Mac → max ~14GB model (Qwen-14B)
Can't run Mixtral 8x7B (46GB) or Qwen-32B (34GB)

Zell's Solution: Automatic Layer Splitting

When you load a model > 80% of RAM, Zell automatically:

Calculates optimal GPU/SSD split based on available memory
Loads draft layers (20-25%) in GPU/RAM for instant inference
Streams remaining target layers from SSD on demand
Zero configuration - works out of the box

Real-World Examples:

# Dense model: 34GB Qwen-32B on 16GB Mac
zell build ~/.zell/models/blobs/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf
zell run ~/.zell/models/blobs/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf "Explain closures"

# Zell automatically:
# - Detects: 34GB model, 14.4GB available
# - Splits: 16 layers on GPU (instant), 48 layers on SSD
# - Preprocesses: Dense FFN (gate/up/down) per layer

# MoE model: 46GB Mixtral 8x7B on 16GB Mac
zell build ~/.zell/models/blobs/mixtral-8x7b-instruct-v0.1.Q8_0.gguf
zell run ~/.zell/models/blobs/mixtral-8x7b-instruct-v0.1.Q8_0.gguf "Write a Zig function"

# Zell automatically:
# - Detects: 46GB model, 14.4GB available, 8 experts
# - Splits: 8 layers on GPU (instant), 24 layers on SSD
# - Preprocesses: Router + 8 experts per layer
# - Runs: Native Zig MoE with expert routing

Architecture support:

Dense models: Qwen, Llama, Codestral, etc. (standard FFN)
MoE models: Mixtral, DeepSeek-MoE, etc. (native Zig 8-expert routing)

Key Advantage:

Platform	16GB Mac Can Run
Ollama	~14GB models max
Zell	46GB+ models

System Prompt Caching Performance

Tested on Apple M2 Pro with Qwen2.5-Coder-7B-Instruct-Q4_K_M (4.4GB):

Metric	Cold Start	Warm Start	Improvement
Inference Time	11,269ms	6,413ms	4,856ms saved
Speedup	1.00x	1.76x	43.1% faster
Tokens Generated	5,875	5,875	Same quality

One-time cost: 13 seconds to cache system prompt during build Break-even: After just 2 requests Every request after: Nearly 5 seconds faster

See BENCHMARKS.md for detailed methodology and results.

Architecture: Event Sourcing + "Sleep" Cycle

Zell implements a novel approach to LLM memory inspired by biological sleep cycles and event sourcing:

The Event Log (Conversation Logging)

Every conversation is appended to an immutable JSONL log with monotonic event IDs:

{"event_id":1,"timestamp":1702934400,"model":"6d71222ef4099f00","prompt":"Hello","response":"Hi there!","tokens":12,"consolidated":false}
{"event_id":2,"timestamp":1702934460,"model":"6d71222ef4099f00","prompt":"What is Zig?","response":"Zig is...","tokens":156,"consolidated":false}

Storage: ~/.cache/zell/conversations/{model_hash}/{date}.jsonl

The "Reduce" Function (LoRA Training)

After 100 conversations, Zell automatically enters a "sleep" cycle:

Fetch unconsolidated memories - Recent conversations not yet trained on
Sample consolidated memories - Random replay buffer to prevent forgetting
Train LoRA adapter - Gradient descent on mixed batch (80% new, 20% old)
Mark as consolidated - Update consolidated: true in event log
Save projection - Write LoRA weights to ~/.cache/zell/lora/{model_hash}_auto.lora

The Math: Instead of retraining the entire base model $W$ (10GB), we learn a low-rank adaptation:

$$ \text{Output} = \text{Input} \times (W_{\text{base}} + A \times B) $$

Where $A \times B$ is a tiny 10MB "sticky note" on top of the frozen base model.

Auto-Detection and Loading

Next inference automatically detects and loads the trained adapter with priority:

Check for {model_hash}_auto.lora (auto-trained)
Fall back to manual .lora files in current directory
Continue without adapter if none found

How LoRA Integration Works (Big Model + Tiny Model)

A common misconception is that LoRA creates "two models working together." In reality, LoRA modifies the base model's weights directly - it's more like a brain implant than a separate advisor.

The Math: Weight Matrix Fusion

During inference, LoRA doesn't "consult" a separate model. Instead, it merges into the base model:

Original:    Output = Input × W_base           (7B parameters)
With LoRA:   Output = Input × (W_base + ΔW)    (7B + 4MB parameters)
                              └────────────┘
                               Your "memory"

Where ΔW = A × B is your tiny adapter (two small matrices multiplied together).

Visual: The Fusion Process

┌─────────────────────────────────────────────────────┐
│                    INFERENCE (Single Pass)          │
├─────────────────────────────────────────────────────┤
│                                                     │
│   Input Token: "Write a function"                   │
│        │                                            │
│        ▼                                            │
│   ┌─────────────────────────────────────────────┐   │
│   │           TRANSFORMER LAYER                 │   │
│   │                                             │   │
│   │   W_base (frozen)    +    ΔW (your adapter) │   │
│   │   ┌────────────┐         ┌────────────┐     │   │
│   │   │ 7 Billion  │    +    │  4 MB      │     │   │
│   │   │ Parameters │         │ A × B      │     │   │
│   │   │ (Generic)  │         │ (Personal) │     │   │
│   │   └────────────┘         └────────────┘     │   │
│   │         │                      │            │   │
│   │         └──────────┬───────────┘            │   │
│   │                    │                        │   │
│   │                    ▼                        │   │
│   │            W_effective                      │   │
│   │     (Model that "knows" your style)         │   │
│   └─────────────────────────────────────────────┘   │
│        │                                            │
│        ▼                                            │
│   Output: Uses YOUR coding conventions              │
│                                                     │
└─────────────────────────────────────────────────────┘

Key Insight: Not Two Models, One Modified Model

What You Might Think	What Actually Happens
"Two models talking to each other"	One model with modified weights
"LoRA looks up your memories"	Memories are baked into the neurons
"Slower because of extra model"	Same speed (single forward pass)
"7B + 7B = 14B parameters"	7B + 4MB ≈ 7.004B parameters

Why This Matters: Model Compatibility

Because LoRA physically modifies weight matrices, an adapter trained on Llama-3 will NOT work on Mistral:

Llama-3 adapter:  Trained to "correct" specific Llama-3 neurons
Mistral model:    Different neuron values, different architecture
Result:           Math produces garbage (or crash)

Zell handles this automatically:

Adapters are stored per-model: ~/.cache/zell/lora/{model_hash}/
Validation on load: Checks model hash before applying adapter
Auto-bootstrap: When you switch models, Zell re-trains from universal logs

Commands

Core Operations

# Build model profile
zell build <model.gguf> [--force] [--system-prompt "..."]

# Run inference
zell run <model.gguf> [--image <path>]... <prompt>
zell run <model.gguf> --max-tokens 500 "Your prompt"

# Batch inference
zell batch <model.gguf> "Prompt 1" "Prompt 2" "Prompt 3"

Memory Management

# Check learning progress
zell memory status [model]

# Clear conversation logs
zell memory clear [model]

# List conversations (coming soon)
zell memory list [model]

Cache Management

# Clean old caches (>30 days) - DEFAULT
zell clear-cache

# Explicit old cleanup
zell clear-cache --old

# Clear ALL caches (caution!)
zell clear-cache --all

# Clear specific model
zell clear-cache <model.gguf>

Model Information

# Show model details
zell info <model.gguf>

# Show optimization profile
zell profile <model.gguf>

# LoRA adapter management
zell lora <model.gguf> <load|list|unload|info>

Memory Management

Storage Locations

~/.zell/
├── models/
│   └── blobs/              # GGUF model files storage
│       └── {sha256}.gguf   # Content-addressed model files
├── system-prompts/         # KV caches for instant startup
│   └── {hash}.kv          # Content-addressed by prompt hash
├── conversations/          # Event-sourced conversation logs
│   └── {model_hash}/
│       ├── 19723.jsonl    # Daily logs (days since epoch)
│       └── 19724.jsonl
├── lora/                   # Auto-trained adapters
│   └── {model_hash}_auto.lora
├── profiles/               # Model optimization profiles
│   └── {model}.gguf.json
└── shaders/                # Pre-compiled Metal shaders
    └── {model}.gguf.metallib

Cache Retention Policy

System prompts: 30 days (content-addressed, auto-invalidates on change)
Conversations: 30 days (event-sourced, immutable)
Profiles: Forever (small, model-specific)
Shaders: Forever (small, model-specific)

Automatic cleanup: Runs during zell build Manual cleanup: zell clear-cache --old

See CACHE.md for detailed cache management documentation.

Event Sourcing API

The conversation logger provides an event sourcing API for the Hippocampus:

const logger = try ConversationLogger.init(allocator, model_hash);
defer logger.deinit();

// Log new event (auto-increments event_id)
try logger.log(prompt, response, tokens_used);

// Get latest event ID
const latest_id = try logger.getLatestEventId();

// Fetch unconsolidated events (new memories)
const new_memories = try logger.fetchUnconsolidated(100);
defer allocator.free(new_memories);

// Sample consolidated events (replay buffer)
const old_memories = try logger.sampleConsolidated(20);
defer allocator.free(old_memories);

// Mark events as consolidated (after training)
try logger.markConsolidated(&[_]u64{ 1, 2, 3, 4, 5 });

Models

Zell supports any GGUF-format model. Due to their large size, models are not included in the repository.

Recommended Models

Mixtral-8x7B-Instruct-v0.1-Q4_K_M (26GB) - MoE model, runs on 16GB Mac with native Zig engine
Qwen2.5-Coder-7B-Instruct-Q4_K_M (4.4GB) - Recommended for coding tasks
Codestral-22B-v0.1-Q4_K_M (12GB) - Larger, more capable for complex tasks
Qwen2.5-Coder-32B-Instruct-Q8_0 (34GB) - Largest, runs on 16GB Mac with layer-split

Downloading Models

Option 1: Download from Hugging Face

# Create models directory
mkdir -p ~/.zell/models/blobs

# Download Mixtral 8x7B (26GB - runs on 16GB Mac with native Zig MoE engine)
curl -L -C - -o ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf \
  "https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf"

# Download Qwen2.5-Coder-7B (smaller, faster)
wget https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  -O ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

# Or download Codestral-22B (larger)
wget https://huggingface.co/bartowski/Codestral-22B-v0.1-GGUF/resolve/main/Codestral-22B-v0.1-Q4_K_M.gguf \
  -O ~/.zell/models/blobs/Codestral-22B-v0.1-Q4_K_M.gguf

Option 2: Use any GGUF model

Zell works with any GGUF-format model. Download from Hugging Face or convert your own using llama.cpp.

# Use any model file
zell build /path/to/your/model.gguf
zell run /path/to/your/model.gguf "Your prompt here"

System Requirements

OS: macOS (Metal GPU required), Linux (CPU only)
Memory: 8GB RAM minimum (16GB+ recommended)
Disk: 20-150MB for caches + model size
Zig: 0.15.2 or later

Development

Project Structure

zell/
├── src/
│   ├── main.zig              # CLI and command routing
│   ├── mixtral_lazy.zig      # Two-tier Mixtral engine (attention resident, experts streaming)
│   ├── moe_metal.zig         # Metal GPU MoE kernels and expert cache
│   ├── llama.zig             # llama.cpp bindings (legacy)
│   ├── scheduler.zig         # Multi-cell parallel inference + prefix caching
│   ├── server.zig            # HTTP server with continuous batching
│   ├── optimizer.zig         # Model profiling and optimization
│   ├── conversation_logger.zig   # Event sourcing for conversations
│   ├── auto_trainer.zig      # Background LoRA training spawner
│   ├── lora_trainer.zig      # LoRA fine-tuning implementation
│   ├── snapshot.zig          # KV cache serialization
│   ├── budget_manager.zig    # Adaptive context sizing
│   ├── hippocampus.zig       # Sleep cycle state machine
│   ├── hardware_monitor.zig  # Power, thermal, rate limiting
│   ├── memory_monitor.zig    # Runtime memory pressure detection
│   ├── dashboard.zig         # TUI brain scan display
│   ├── test_paris.zig        # Paris test - inference correctness validation
│   └── inference/
│       ├── gguf_loader.zig   # GGUF file parsing and tensor loading
│       ├── gguf_tokenizer.zig # Pure Zig BPE tokenizer from GGUF vocab
│       └── metal/
│           └── moe_kernels.metal  # GPU kernels (attention, RoPE, MoE, softmax)
├── models/                   # GGUF models (stored in ~/.zell/models/blobs/)
├── vendor/llama.cpp/         # Upstream llama.cpp
└── build.zig                 # Build configuration

Building from Source

# Debug build
zig build

# Release build (optimized)
zig build -Doptimize=ReleaseFast

# Run tests
zig build test

# Clean build artifacts
rm -rf zig-cache zig-out

Testing Native Zig MoE Inference

# The "Paris Test" - validates inference correctness
# Expected: "Paris" (token 5465) in top predictions
./zig-out/bin/test-paris ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf

# Performance benchmark (measures tok/s)
./zig-out/bin/bench-mixtral ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf [num_tokens]

Paris Test Results:

Model correctly predicts "Paris" in top-5 for "What is the capital of France?"
Instruction-tuned models prefer sentence responses ("The capital of France is Paris.") over direct answers
Token 415 ("The") as top prediction is expected behavior for INSTRUCT format

Native Zig MoE Architecture

The native Zig MoE engine uses a two-tier architecture optimized for 16GB Macs:

┌─────────────────────────────────────────────────────────────┐
│  TIER 1: GPU RESIDENT (~850 MB)                            │
│  ─────────────────────────────────────────────────────────  │
│  • Attention weights (Q/K/V/O) for all 32 layers           │
│  • Uploaded once at startup, stays in VRAM                 │
│  • RoPE, GQA attention, softmax run on GPU                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  TIER 2: EXPERT CACHE (8 GB budget)                        │
│  ─────────────────────────────────────────────────────────  │
│  • LRU cache with VIP pinning for hot experts              │
│  • 256 total experts (32 layers × 8 experts)               │
│  • ~50% cache hit rate after warmup                        │
│  • Speculative prefetch predicts next layer's experts      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  SSD STORAGE (~24 GB mmap)                                 │
│  ─────────────────────────────────────────────────────────  │
│  • Zero-copy mmap for expert weights                       │
│  • OS-managed page cache                                   │
│  • ~266ms per expert load from SSD                         │
└─────────────────────────────────────────────────────────────┘

Performance on M2 Pro 16GB with Mixtral 8x7B Q4_K_M:

Cold start: ~10 min to fill 8GB expert cache
Measured: 0.7 tok/s (with SSD streaming)
SOL ceiling: 3.1 tok/s (pure GPU, cache warm)

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Style

Follow Zig standard library conventions
Use 4-space indentation
Add comments for complex logic
Write tests for new features

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

See LICENSE for the full license text.

Acknowledgments

llama.cpp - Core inference engine by Georgi Gerganov
LoRA - Low-Rank Adaptation by Microsoft Research
Event Sourcing - Architecture pattern by Martin Fowler
Biological Sleep Cycles - Inspiration for memory consolidation

Citation

If you use Zell in your research, please cite:

@software{zell2024,
  title = {Zell: Fast LLM Inference with Automatic Long-Term Memory},
  author = {Steven Chong},
  year = {2024},
  url = {https://github.com/teamchong/zell}
}

Support

Issues: GitHub Issues
Discussions: GitHub Discussions

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
.github/workflows		.github/workflows
app		app
data		data
docs		docs
include		include
scripts		scripts
src		src
tools		tools
vendor		vendor
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
BATCHED_METAL_MOE_IMPLEMENTATION.md		BATCHED_METAL_MOE_IMPLEMENTATION.md
BENCHMARKS.md		BENCHMARKS.md
CACHE.md		CACHE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LAYER_SPLIT_README.md		LAYER_SPLIT_README.md
LICENSE		LICENSE
MODELS.md		MODELS.md
MOE_IMPLEMENTATION.md		MOE_IMPLEMENTATION.md
MOE_OPTIMIZATION_SUMMARY.md		MOE_OPTIMIZATION_SUMMARY.md
MoE_TODO.md		MoE_TODO.md
README.md		README.md
benchmark.sh		benchmark.sh
benchmark_decode_only.sh		benchmark_decode_only.sh
benchmark_detailed.py		benchmark_detailed.py
benchmark_draft_model.sh		benchmark_draft_model.sh
benchmark_ollama_vs_zell.sh		benchmark_ollama_vs_zell.sh
benchmark_zell_only.sh		benchmark_zell_only.sh
benchmark_zell_results.json		benchmark_zell_results.json
build.zig		build.zig
build.zig.zon		build.zig.zon
calibration_set.txt		calibration_set.txt
ggml-metal.metal		ggml-metal.metal
mixtral		mixtral
test_embeddings		test_embeddings
test_gguf		test_gguf
test_npu_power		test_npu_power
test_output.log		test_output.log
test_spec_decode.sh		test_spec_decode.sh
training_log.csv		training_log.csv

License

teamchong/zell

Folders and files

Latest commit

History

Repository files navigation

Zell

Key Features

🚀 Performance

💾 Adaptive Memory Optimization - Run Oversized Models

For Dense Models (Llama 70B, Qwen 72B)

For MoE Models (Mixtral 8x7B, Qwen2-MoE)

Additional Optimizations

🧠 Automatic Long-Term Memory ("Sleep" Cycle)

🔧 Advanced Features

⚡ Instant Startup

⚠️ Early Development

Quick Start

Installation

Basic Usage

Server Mode (Recommended for Multiple Requests)

Using with Claude Code

Benchmarks

Layer-Split Architecture: Run Models 3x Larger Than Your RAM

System Prompt Caching Performance

Architecture: Event Sourcing + "Sleep" Cycle

The Event Log (Conversation Logging)

The "Reduce" Function (LoRA Training)

Auto-Detection and Loading

How LoRA Integration Works (Big Model + Tiny Model)

The Math: Weight Matrix Fusion

Visual: The Fusion Process

Key Insight: Not Two Models, One Modified Model

Why This Matters: Model Compatibility

Commands

Core Operations

Memory Management

Cache Management

Model Information

Memory Management

Storage Locations

Cache Retention Policy

Event Sourcing API

Models

Recommended Models

Downloading Models

System Requirements

Development

Project Structure

Building from Source

Testing Native Zig MoE Inference

Native Zig MoE Architecture

Contributing

Code Style

License

Acknowledgments

Citation

Support

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages