Fast, Memory-Efficient LLM Inference Engine with Automatic Long-Term Memory
Zell is a high-performance local inference engine for running large language models (LLMs) on your machine. Unlike traditional inference tools, Zell automatically learns from every conversation through an event-sourced memory system and LoRA fine-tuning - no manual intervention required.
- Pure Zig MoE inference - No llama.cpp dependency for Mixtral
- Runs 26GB models on 16GB RAM - Two-tier architecture (GPU resident + SSD streaming)
- 43.1% faster inference with system prompt caching (1.76x speedup)
- 3-4x server throughput with continuous batching (4 parallel users)
- Adaptive context sizing (512 → 131K tokens) based on workload
- Optimized Metal GPU acceleration for macOS
Zell automatically profiles your system and model to choose the optimal inference strategy:
zell build <model>
│
▼
┌─────────────────────────────────────────────────────────────┐
│ AUTOMATIC PROFILER │
│ ───────────────── │
│ • Detects system RAM (e.g., 16GB, 64GB, 128GB) │
│ • Measures model size from GGUF │
│ • Identifies architecture (Dense vs MoE) │
│ • Saves optimal config to ~/.zell/profiles/ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ DECISION MATRIX │
│ ─────────────── │
│ │
│ Model fits in RAM? │
│ YES → Standard inference (fastest) │
│ NO → Oversized model optimization ↓ │
│ │
│ Architecture? │
│ DENSE (Llama, Qwen) → Layer-Split Speculative Decoding │
│ MoE (Mixtral) → Batched Expert Processing │
└─────────────────────────────────────────────────────────────┘
Layer-Split Speculative Decoding - Single model acts as both draft and target:
- First 20-25% layers stay on GPU (fast draft generation)
- Remaining 75-80% layers stream from SSD (batched verification)
- Draft 8 tokens → Verify all 8 in one SSD stream → Accept/reject
Native Zig MoE Engine - Runs 26GB Mixtral on 16GB Macs with two-tier architecture:
| Component | Location | Size |
|---|---|---|
| Attention (Q/K/V/O) | GPU VRAM (resident) | ~850 MB |
| Token Embeddings | RAM (lazy loaded) | ~500 MB |
| Expert Cache | RAM (LRU eviction) | 8 GB budget |
| Expert Weights | SSD (mmap zero-copy) | ~24 GB |
Current Performance (Mixtral 8x7B Q4_K_M on M2 Pro 16GB):
- 0.7 tok/s measured with SSD streaming
- 3.1 tok/s SOL ceiling (pure GPU, cache warm)
- ~50% VIP cache hit rate after warmup
| Optimization | What It Does | Status |
|---|---|---|
| VIP Expert Pinning | Pins frequently-used experts in RAM | ✅ Implemented |
| Speculative Prefetch | Predicts experts 1 layer ahead | ✅ Implemented |
| Zero-copy mmap | OS-managed SSD streaming | ✅ Implemented |
| Batched GPU Compute | Single command buffer per token | ✅ Implemented |
# Download Mixtral model (25GB Q4_K_M quantization)
# Models are stored in ~/.zell/models/blobs/
# Just use the same commands - MoE is auto-detected
zell build ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf
zell serve ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf # All optimizations auto-enabled- Zero-copy SSD streaming - OS-managed mmap, no explicit loading overhead
- KV cache quantization - Q8_0 (50% savings) or Q4_0 (75% savings)
- System prompt deduplication - Share KV cache across parallel requests
vs Ollama: Zell can run models 3x larger than your RAM (46GB model on 16GB Mac). Ollama requires model to fit in memory.
Note: On high-RAM systems (64GB+), oversized model optimizations auto-disable because weights fit in VRAM. This is intentional - Zell adapts to give you the fastest inference for YOUR hardware.
- Event-sourced conversation logging - Every interaction is recorded with monotonic event IDs
- Automatic LoRA fine-tuning - After 100 conversations, trains in background while you work
- Hippocampus architecture - Unconsolidated memories (recent) vs consolidated (trained)
- Zero-config learning - No manual dataset preparation or training steps
- Layer-Split Inference - Run large models efficiently with automatic GPU/SSD split
- System Prompt Caching - Instant startup with cached KV states
- Hardware Safety Gates - Thermal and power monitoring for Mac hardware
- Adaptive Context Sizing - Dynamic context window based on workload
- System prompt caching - Cache KV state during build for instant restoration
- Profile-based optimization - Pre-analyzes models to prevent OOM crashes
- Shader pre-compilation - Metal shaders compiled ahead of time
- Experimental features under active development
- Safety-first design with automatic OOM prevention
- 30-day cache retention with automatic cleanup
- Per-model memory isolation
- Full observability via
zell memorycommand
# Clone the repository
git clone https://github.com/teamchong/zell.git
cd zell
# Build
zig build
# Add to PATH (optional)
export PATH="$PATH:$(pwd)/zig-out/bin"Note: Models are not included in the repository due to their large size. See the Models section below for download instructions.
# 1. Build model profile (one-time, ~60s)
zell build ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
# 2. Cache system prompt for instant startup (optional but recommended)
zell build ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
--system-prompt "You are a helpful Zig programming assistant."
# 3. Run inference
zell run ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
"Write a hello world program in Zig"
# 4. Check learning progress
zell memory status ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.ggufFor interactive use or integrating with tools like Continue/aider, use server mode to keep the model loaded:
# Start server (model loads once, ~40s)
zell serve ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
# Server running on http://localhost:11434
# In another terminal, send requests (instant, <100ms each)
zell ask "What is comptime in Zig?"
# Or use curl with OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello"}]}'Benefits:
- First request: ~40s (model load)
- Subsequent requests: <100ms (model already in memory)
- OpenAI-compatible API for tool integration
- Same port as Ollama (11434) for drop-in replacement
Server Optimizations (Automatic):
- Continuous Batching - Process 4 users in parallel (3-4x throughput on M-series Macs)
- System Prompt Caching - Share KV cache when users have same system prompt
- JIT mmap Streaming - Let OS manage memory for 32B+ models on 16GB Macs
- KV Cache Quantization - Q8_0/Q4_0 based on memory pressure
Zell supports the Anthropic Messages API (/v1/messages), making it compatible with Claude Code:
# Start Zell server
zell serve ~/.zell/models/blobs/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf
# In another terminal, configure Claude Code to use Zell
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="dummy" # Zell ignores auth
# Start Claude Code - it will use Zell as the backend
claudeOr add to ~/.claude/settings.json:
{
"env": {
"ANTHROPIC_BASE_URL": "http://localhost:11434",
"ANTHROPIC_AUTH_TOKEN": "dummy"
}
}API Endpoints:
POST /v1/messages- Anthropic Messages API (Claude Code)POST /v1/chat/completions- OpenAI Chat API (Ollama-compatible)GET /v1/models- List available modelsGET /health- Health check with queue status
The Problem with Ollama:
- Ollama requires models to fit entirely in memory
- 16GB Mac → max ~14GB model (Qwen-14B)
- Can't run Mixtral 8x7B (46GB) or Qwen-32B (34GB)
Zell's Solution: Automatic Layer Splitting
When you load a model > 80% of RAM, Zell automatically:
- Calculates optimal GPU/SSD split based on available memory
- Loads draft layers (20-25%) in GPU/RAM for instant inference
- Streams remaining target layers from SSD on demand
- Zero configuration - works out of the box
Real-World Examples:
# Dense model: 34GB Qwen-32B on 16GB Mac
zell build ~/.zell/models/blobs/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf
zell run ~/.zell/models/blobs/Qwen2.5-Coder-32B-Instruct-Q8_0.gguf "Explain closures"
# Zell automatically:
# - Detects: 34GB model, 14.4GB available
# - Splits: 16 layers on GPU (instant), 48 layers on SSD
# - Preprocesses: Dense FFN (gate/up/down) per layer
# MoE model: 46GB Mixtral 8x7B on 16GB Mac
zell build ~/.zell/models/blobs/mixtral-8x7b-instruct-v0.1.Q8_0.gguf
zell run ~/.zell/models/blobs/mixtral-8x7b-instruct-v0.1.Q8_0.gguf "Write a Zig function"
# Zell automatically:
# - Detects: 46GB model, 14.4GB available, 8 experts
# - Splits: 8 layers on GPU (instant), 24 layers on SSD
# - Preprocesses: Router + 8 experts per layer
# - Runs: Native Zig MoE with expert routingArchitecture support:
- Dense models: Qwen, Llama, Codestral, etc. (standard FFN)
- MoE models: Mixtral, DeepSeek-MoE, etc. (native Zig 8-expert routing)
Key Advantage:
| Platform | 16GB Mac Can Run |
|---|---|
| Ollama | ~14GB models max |
| Zell | 46GB+ models |
Tested on Apple M2 Pro with Qwen2.5-Coder-7B-Instruct-Q4_K_M (4.4GB):
| Metric | Cold Start | Warm Start | Improvement |
|---|---|---|---|
| Inference Time | 11,269ms | 6,413ms | 4,856ms saved |
| Speedup | 1.00x | 1.76x | 43.1% faster |
| Tokens Generated | 5,875 | 5,875 | Same quality |
One-time cost: 13 seconds to cache system prompt during build Break-even: After just 2 requests Every request after: Nearly 5 seconds faster
See BENCHMARKS.md for detailed methodology and results.
Zell implements a novel approach to LLM memory inspired by biological sleep cycles and event sourcing:
Every conversation is appended to an immutable JSONL log with monotonic event IDs:
{"event_id":1,"timestamp":1702934400,"model":"6d71222ef4099f00","prompt":"Hello","response":"Hi there!","tokens":12,"consolidated":false}
{"event_id":2,"timestamp":1702934460,"model":"6d71222ef4099f00","prompt":"What is Zig?","response":"Zig is...","tokens":156,"consolidated":false}Storage: ~/.cache/zell/conversations/{model_hash}/{date}.jsonl
After 100 conversations, Zell automatically enters a "sleep" cycle:
- Fetch unconsolidated memories - Recent conversations not yet trained on
- Sample consolidated memories - Random replay buffer to prevent forgetting
- Train LoRA adapter - Gradient descent on mixed batch (80% new, 20% old)
- Mark as consolidated - Update
consolidated: truein event log - Save projection - Write LoRA weights to
~/.cache/zell/lora/{model_hash}_auto.lora
The Math: Instead of retraining the entire base model
Where
Next inference automatically detects and loads the trained adapter with priority:
- Check for
{model_hash}_auto.lora(auto-trained) - Fall back to manual
.lorafiles in current directory - Continue without adapter if none found
A common misconception is that LoRA creates "two models working together." In reality, LoRA modifies the base model's weights directly - it's more like a brain implant than a separate advisor.
During inference, LoRA doesn't "consult" a separate model. Instead, it merges into the base model:
Original: Output = Input × W_base (7B parameters)
With LoRA: Output = Input × (W_base + ΔW) (7B + 4MB parameters)
└────────────┘
Your "memory"
Where ΔW = A × B is your tiny adapter (two small matrices multiplied together).
┌─────────────────────────────────────────────────────┐
│ INFERENCE (Single Pass) │
├─────────────────────────────────────────────────────┤
│ │
│ Input Token: "Write a function" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ TRANSFORMER LAYER │ │
│ │ │ │
│ │ W_base (frozen) + ΔW (your adapter) │ │
│ │ ┌────────────┐ ┌────────────┐ │ │
│ │ │ 7 Billion │ + │ 4 MB │ │ │
│ │ │ Parameters │ │ A × B │ │ │
│ │ │ (Generic) │ │ (Personal) │ │ │
│ │ └────────────┘ └────────────┘ │ │
│ │ │ │ │ │
│ │ └──────────┬───────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ W_effective │ │
│ │ (Model that "knows" your style) │ │
│ └─────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Output: Uses YOUR coding conventions │
│ │
└─────────────────────────────────────────────────────┘
| What You Might Think | What Actually Happens |
|---|---|
| "Two models talking to each other" | One model with modified weights |
| "LoRA looks up your memories" | Memories are baked into the neurons |
| "Slower because of extra model" | Same speed (single forward pass) |
| "7B + 7B = 14B parameters" | 7B + 4MB ≈ 7.004B parameters |
Because LoRA physically modifies weight matrices, an adapter trained on Llama-3 will NOT work on Mistral:
Llama-3 adapter: Trained to "correct" specific Llama-3 neurons
Mistral model: Different neuron values, different architecture
Result: Math produces garbage (or crash)
Zell handles this automatically:
- Adapters are stored per-model:
~/.cache/zell/lora/{model_hash}/ - Validation on load: Checks model hash before applying adapter
- Auto-bootstrap: When you switch models, Zell re-trains from universal logs
# Build model profile
zell build <model.gguf> [--force] [--system-prompt "..."]
# Run inference
zell run <model.gguf> [--image <path>]... <prompt>
zell run <model.gguf> --max-tokens 500 "Your prompt"
# Batch inference
zell batch <model.gguf> "Prompt 1" "Prompt 2" "Prompt 3"# Check learning progress
zell memory status [model]
# Clear conversation logs
zell memory clear [model]
# List conversations (coming soon)
zell memory list [model]# Clean old caches (>30 days) - DEFAULT
zell clear-cache
# Explicit old cleanup
zell clear-cache --old
# Clear ALL caches (caution!)
zell clear-cache --all
# Clear specific model
zell clear-cache <model.gguf># Show model details
zell info <model.gguf>
# Show optimization profile
zell profile <model.gguf>
# LoRA adapter management
zell lora <model.gguf> <load|list|unload|info>~/.zell/
├── models/
│ └── blobs/ # GGUF model files storage
│ └── {sha256}.gguf # Content-addressed model files
├── system-prompts/ # KV caches for instant startup
│ └── {hash}.kv # Content-addressed by prompt hash
├── conversations/ # Event-sourced conversation logs
│ └── {model_hash}/
│ ├── 19723.jsonl # Daily logs (days since epoch)
│ └── 19724.jsonl
├── lora/ # Auto-trained adapters
│ └── {model_hash}_auto.lora
├── profiles/ # Model optimization profiles
│ └── {model}.gguf.json
└── shaders/ # Pre-compiled Metal shaders
└── {model}.gguf.metallib
- System prompts: 30 days (content-addressed, auto-invalidates on change)
- Conversations: 30 days (event-sourced, immutable)
- Profiles: Forever (small, model-specific)
- Shaders: Forever (small, model-specific)
Automatic cleanup: Runs during zell build
Manual cleanup: zell clear-cache --old
See CACHE.md for detailed cache management documentation.
The conversation logger provides an event sourcing API for the Hippocampus:
const logger = try ConversationLogger.init(allocator, model_hash);
defer logger.deinit();
// Log new event (auto-increments event_id)
try logger.log(prompt, response, tokens_used);
// Get latest event ID
const latest_id = try logger.getLatestEventId();
// Fetch unconsolidated events (new memories)
const new_memories = try logger.fetchUnconsolidated(100);
defer allocator.free(new_memories);
// Sample consolidated events (replay buffer)
const old_memories = try logger.sampleConsolidated(20);
defer allocator.free(old_memories);
// Mark events as consolidated (after training)
try logger.markConsolidated(&[_]u64{ 1, 2, 3, 4, 5 });Zell supports any GGUF-format model. Due to their large size, models are not included in the repository.
- Mixtral-8x7B-Instruct-v0.1-Q4_K_M (26GB) - MoE model, runs on 16GB Mac with native Zig engine
- Qwen2.5-Coder-7B-Instruct-Q4_K_M (4.4GB) - Recommended for coding tasks
- Codestral-22B-v0.1-Q4_K_M (12GB) - Larger, more capable for complex tasks
- Qwen2.5-Coder-32B-Instruct-Q8_0 (34GB) - Largest, runs on 16GB Mac with layer-split
Option 1: Download from Hugging Face
# Create models directory
mkdir -p ~/.zell/models/blobs
# Download Mixtral 8x7B (26GB - runs on 16GB Mac with native Zig MoE engine)
curl -L -C - -o ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf \
"https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf"
# Download Qwen2.5-Coder-7B (smaller, faster)
wget https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
-O ~/.zell/models/blobs/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
# Or download Codestral-22B (larger)
wget https://huggingface.co/bartowski/Codestral-22B-v0.1-GGUF/resolve/main/Codestral-22B-v0.1-Q4_K_M.gguf \
-O ~/.zell/models/blobs/Codestral-22B-v0.1-Q4_K_M.ggufOption 2: Use any GGUF model
Zell works with any GGUF-format model. Download from Hugging Face or convert your own using llama.cpp.
# Use any model file
zell build /path/to/your/model.gguf
zell run /path/to/your/model.gguf "Your prompt here"- OS: macOS (Metal GPU required), Linux (CPU only)
- Memory: 8GB RAM minimum (16GB+ recommended)
- Disk: 20-150MB for caches + model size
- Zig: 0.15.2 or later
zell/
├── src/
│ ├── main.zig # CLI and command routing
│ ├── mixtral_lazy.zig # Two-tier Mixtral engine (attention resident, experts streaming)
│ ├── moe_metal.zig # Metal GPU MoE kernels and expert cache
│ ├── llama.zig # llama.cpp bindings (legacy)
│ ├── scheduler.zig # Multi-cell parallel inference + prefix caching
│ ├── server.zig # HTTP server with continuous batching
│ ├── optimizer.zig # Model profiling and optimization
│ ├── conversation_logger.zig # Event sourcing for conversations
│ ├── auto_trainer.zig # Background LoRA training spawner
│ ├── lora_trainer.zig # LoRA fine-tuning implementation
│ ├── snapshot.zig # KV cache serialization
│ ├── budget_manager.zig # Adaptive context sizing
│ ├── hippocampus.zig # Sleep cycle state machine
│ ├── hardware_monitor.zig # Power, thermal, rate limiting
│ ├── memory_monitor.zig # Runtime memory pressure detection
│ ├── dashboard.zig # TUI brain scan display
│ ├── test_paris.zig # Paris test - inference correctness validation
│ └── inference/
│ ├── gguf_loader.zig # GGUF file parsing and tensor loading
│ ├── gguf_tokenizer.zig # Pure Zig BPE tokenizer from GGUF vocab
│ └── metal/
│ └── moe_kernels.metal # GPU kernels (attention, RoPE, MoE, softmax)
├── models/ # GGUF models (stored in ~/.zell/models/blobs/)
├── vendor/llama.cpp/ # Upstream llama.cpp
└── build.zig # Build configuration
# Debug build
zig build
# Release build (optimized)
zig build -Doptimize=ReleaseFast
# Run tests
zig build test
# Clean build artifacts
rm -rf zig-cache zig-out# The "Paris Test" - validates inference correctness
# Expected: "Paris" (token 5465) in top predictions
./zig-out/bin/test-paris ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf
# Performance benchmark (measures tok/s)
./zig-out/bin/bench-mixtral ~/.zell/models/blobs/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf [num_tokens]Paris Test Results:
- Model correctly predicts "Paris" in top-5 for "What is the capital of France?"
- Instruction-tuned models prefer sentence responses ("The capital of France is Paris.") over direct answers
- Token 415 ("The") as top prediction is expected behavior for INSTRUCT format
The native Zig MoE engine uses a two-tier architecture optimized for 16GB Macs:
┌─────────────────────────────────────────────────────────────┐
│ TIER 1: GPU RESIDENT (~850 MB) │
│ ───────────────────────────────────────────────────────── │
│ • Attention weights (Q/K/V/O) for all 32 layers │
│ • Uploaded once at startup, stays in VRAM │
│ • RoPE, GQA attention, softmax run on GPU │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ TIER 2: EXPERT CACHE (8 GB budget) │
│ ───────────────────────────────────────────────────────── │
│ • LRU cache with VIP pinning for hot experts │
│ • 256 total experts (32 layers × 8 experts) │
│ • ~50% cache hit rate after warmup │
│ • Speculative prefetch predicts next layer's experts │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ SSD STORAGE (~24 GB mmap) │
│ ───────────────────────────────────────────────────────── │
│ • Zero-copy mmap for expert weights │
│ • OS-managed page cache │
│ • ~266ms per expert load from SSD │
└─────────────────────────────────────────────────────────────┘
Performance on M2 Pro 16GB with Mixtral 8x7B Q4_K_M:
- Cold start: ~10 min to fill 8GB expert cache
- Measured: 0.7 tok/s (with SSD streaming)
- SOL ceiling: 3.1 tok/s (pure GPU, cache warm)
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow Zig standard library conventions
- Use 4-space indentation
- Add comments for complex logic
- Write tests for new features
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
See LICENSE for the full license text.
- llama.cpp - Core inference engine by Georgi Gerganov
- LoRA - Low-Rank Adaptation by Microsoft Research
- Event Sourcing - Architecture pattern by Martin Fowler
- Biological Sleep Cycles - Inspiration for memory consolidation
If you use Zell in your research, please cite:
@software{zell2024,
title = {Zell: Fast LLM Inference with Automatic Long-Term Memory},
author = {Steven Chong},
year = {2024},
url = {https://github.com/teamchong/zell}
}- Issues: GitHub Issues
- Discussions: GitHub Discussions