MLXR

High-performance, macOS-native LLM inference engine for Apple Silicon.

Overview

MLXR is a local LLM runner built specifically for Apple Silicon (M4, M3, M2) that combines:

MLX framework for tensor/graph management
Custom Metal kernels for performance-critical operations
OpenAI and Ollama-compatible APIs for seamless integration
React-based GUI with real-time streaming

Key Features

Native Performance: Custom Metal kernels optimized for Apple's unified memory architecture
Memory Efficient: Paged KV cache with smart eviction policies
High Throughput: Continuous batching and speculative decoding
Model Support: GGUF, HF safetensors, and native MLX formats
Quantization: Full support for Q2_K through Q8_K, FP8, and NF4
Developer Friendly: OpenAI and Ollama-compatible REST APIs

Architecture

┌─────────────────────────┐
│   React WebView GUI     │
│   (Tray/Dock App)       │
└───────────┬─────────────┘
            │ Unix Domain Socket
┌───────────▼─────────────┐
│   Daemon (REST/gRPC)    │
│   - OpenAI API          │
│   - Ollama API          │
│   - Model Registry      │
└───────────┬─────────────┘
            │
┌───────────▼─────────────┐
│   Inference Core        │
│   - MLX Graph           │
│   - Metal Kernels       │
│   - Paged KV Cache      │
│   - Continuous Batching │
└─────────────────────────┘

Performance Targets (M4)

First Token: < 1s for 7B-8B models at 4-bit
Decode: < 80ms/token steady-state
Embeddings: < 20ms/sample
Occupancy: ≥ 60% GPU utilization on attention kernels

Requirements

macOS 14.0 (Sonoma) or later
Apple Silicon (M2, M3, or M4)
Xcode 15+ (for building)
Homebrew package manager
CMake 3.20+
Python 3.11+ (with MLX installed)
Node.js 18+ (for frontend development)

System Dependencies

The following Homebrew packages are required:

# Install all dependencies at once
brew install cmake ninja mlx sentencepiece nlohmann-json cpp-httplib googletest

# Or use the Makefile convenience target
make install-deps

Note: CMake and Ninja must be installed via Homebrew, not Conda. The conda-forge "cmake" package is unrelated to the CMake build system.

Project Status

✅ Core Infrastructure Complete - Integration Work Remaining

Codebase Size: ~50,000 LOC across core, daemon, app, tests, and SDKs

Completed Features

Phase 1: Minimal Inference ✅ 100%

Complete Llama model with safetensors loading (737 lines)
SentencePiece tokenizer (252 lines)
Sampling strategies (greedy, temperature, top-k, top-p) - 534 lines
Working text generation pipeline
Example: simple_generation.cpp - WORKS ✅

Phase 2: Optimization ✅ 95%

KV Cache System - Complete paged arena (2,373 lines) with LRU eviction, GQA support
Scheduler - Continuous batching (439 lines) with prefill/decode separation
Metal Kernels - All 6 kernels implemented (~5,200 LOC total):
- RMSNorm: 217 lines shader + 362 lines primitive - INTEGRATED & TESTED (81/81 tests) ✅
- Attention Decode: 295 + 574 lines - Ready for integration
- Attention Prefill: 370 + 633 lines - Ready for integration
- RoPE: 434 + 478 lines - Ready for integration
- SwiGLU MLP: 432 + 321 lines - Ready for integration
- Q-Gemm Dequant: 486 + 525 lines - Ready for integration
Test Daemon - Working HTTP server (test_daemon) with health/models endpoints
⚠️ Integration Gap: Metal kernels need wiring in CachedAttention (8-16 hours)

Phase 3: Service Layer ✅ 70%

REST API - OpenAI & Ollama-compatible endpoints (1,758 lines)
gRPC Server - FULLY IMPLEMENTED (1,101 lines) with streaming
SSE Streaming - Real-time token generation (621 lines)
Model Registry - SQLite catalog (1,137 lines) with GGUF parser (891 lines)
Telemetry - Metrics collection (769 lines, 15/15 tests passing)
Test Suite - 14 C++ unit test files, 299 total tests
⚠️ Integration Gap: Model loading → Engine → Worker wiring (4-8 hours)

Frontend ✅ 90%

React UI - 78 components fully implemented
Chat Interface - With streaming and tool calls
Model Management - Pull, import, convert, quantize
Metrics Dashboard - Real-time performance visualization
All Pages - Chat, Models, Playgrounds, Metrics, Settings, Logs

macOS App ✅ 90%

Swift Components - 20 files implementing app host
JavaScript Bridge - Complete UDS communication
Xcode Project - Exists and configured
Daemon Management - launchd integration
⚠️ Missing: .app bundle build, code signing, .dmg creation

SDKs ✅ 95%

Python SDK - Complete client with async support
TypeScript SDK - Full type definitions and clients
Swift SDK - SwiftPM package with examples

Current Performance (TinyLlama 1.1B)

Prefill: 198-459 ms (5-10 tokens)
Decode: 53-220 ms/token
Throughput: 4.5-18.9 tokens/sec
Memory: 87.5% reduction with GQA (308 MB saved)
Expected after Metal integration: 2-5x improvement

Next Steps (Critical Path)

P0 - Get it Working (14-28 hours):

⚠️ Metal Kernel Integration (8-16h) - Wire attention kernels in CachedAttention
⚠️ Daemon Model Loading (4-8h) - Complete load_model() in REST server
Server Config (2-4h) - Create configs/server.yaml

P1 - Make it Fast (20-32 hours): 4. Quantization (8-12h) - GGUF loading + Q-gemm integration 5. RoPE/SwiGLU Kernels (6-10h) - Wire remaining kernels 6. Speculative Decoding (6-10h) - Connect draft model

P2 - Ship It (36-60 hours): 7. App Bundle (8-16h) - .dmg, code signing, auto-update 8. CPU Fallbacks (16-24h) - Neon SIMD implementations 9. Conversion Tools (12-20h) - GGUF→MLX, quantizers

See CLAUDE.md for comprehensive implementation roadmap and accurate metrics.

Repository Structure

MLXR/
  app/           # macOS app bundle & React UI
  daemon/        # Background server (REST/gRPC)
  core/          # Inference engine (C++/MLX/Metal)
  tools/         # Model converters and utilities
  sdks/          # Client libraries (Python, TS, Swift)
  configs/       # Configuration files
  scripts/       # Build and development scripts
  tests/         # Test suites
  plan/          # Architecture specifications

Getting Started

1. Install Dependencies

# Install Homebrew (if not already installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Clone the repository
git clone <repository-url>
cd MLXR

# Install system dependencies
make install-deps

# Setup Python environment (recommended)
make setup
conda activate mlxr

# Check installation
make status

2. Build

# Full build (Metal shaders + C++ core)
make build

# Or quick development build (Metal only)
make dev

# Run tests
make test-cpp

3. Run

# Run daemon
./build/cmake/bin/mlxrunnerd

# Develop frontend (separate terminal)
cd app/ui
yarn install
yarn dev

For detailed build instructions, see CLAUDE.md.

Development Phases

✅ Phase 0: Foundation (COMPLETE)

Repository structure and build system
Metal shader compilation pipeline
Toolchain validation (Homebrew, CMake, Ninja)

✅ Phase 1: Minimal Inference Core (COMPLETE)

MLX integration and model loading (safetensors)
SentencePiece tokenizer
Single-request inference with FP16
Working examples in examples/

⏳ Phase 2: Optimization (85% COMPLETE)

Paged KV cache with eviction policies
Metal kernel implementations (RMSNorm tested, others implemented)
Continuous batching scheduler
GQA support for memory efficiency
Next: CachedLlamaModel integration with Engine

⏳ Phase 3: Service Layer (60% COMPLETE)

REST API daemon (OpenAI & Ollama compatible)
Model registry with SQLite backend
SSE streaming for real-time generation
Telemetry and metrics
Next: Full API endpoint integration

🔜 Phase 4: Frontend & Distribution (Planned)

macOS app bundle with React WebView
Unix domain socket communication
Auto-updates via Sparkle
Code signing and notarization

See plan/SPEC01.md for complete roadmap and docs/IMPLEMENTATION_STATUS.md for current status.

Documentation

Developer Guides:

CLAUDE.md - Comprehensive development guide and coding standards
docs/IMPLEMENTATION_STATUS.md - Current implementation status and metrics
docs/SECURITY_FIXES.md - Security vulnerability tracking and best practices

Architecture & Planning:

plan/SPEC01.md - Complete specification and requirements
plan/Structure.md - Architecture and component overview
plan/MetalKernelsPlan.md - Metal kernel specifications
plan/PackagingDistro.md - Distribution strategy
plan/FrontendPlan.md - React UI implementation plan

Implementation Details:

docs/PHASE2_COMPLETION.md - Scheduler-engine integration details
docs/DAEMON_STATUS.md - Daemon components status
docs/KV_CACHE_IMPLEMENTATION.md - Paged KV cache architecture
docs/GQA_RESHAPE_FIX.md - Critical GQA support fix
app/ui/COMPONENTS.md - React UI components documentation

Contributing

This project is actively developed and welcomes contributions!

Current Focus Areas:

CachedLlamaModel integration with Engine
Metal kernel optimization and testing
OpenAI API endpoint completion
Performance benchmarking and profiling

Before Contributing:

Read CLAUDE.md for development guidelines
Check docs/IMPLEMENTATION_STATUS.md for current status
Review docs/SECURITY_FIXES.md for security best practices

Development Standards:

✅ All C++ code passes unit tests
✅ Security: No system() calls, proper input validation, ReDoS-safe regex
✅ Cross-platform: Use std::filesystem for paths
✅ Documentation: Update docs for significant changes

License

Apache License 2.0 - See LICENSE for details.

Acknowledgments

Built with:

MLX - Apple's machine learning framework
Metal - Apple's GPU compute API
Inspired by vLLM, llama.cpp, and Ollama

Status: Active Development (Phase 2 Complete, Phase 3 In Progress) Target: Q1 2025 MVP release Latest: See docs/IMPLEMENTATION_STATUS.md for current metrics and progress

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
.github		.github
.vscode		.vscode
app		app
cmake		cmake
configs		configs
core		core
daemon		daemon
docs		docs
examples		examples
plan		plan
scripts		scripts
sdks		sdks
test_models		test_models
tests		tests
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
PLAN_REVIEW_REPORT.md		PLAN_REVIEW_REPORT.md
README.md		README.md
TODO.md		TODO.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLXR

Overview

Key Features

Architecture

Performance Targets (M4)

Requirements

System Dependencies

Project Status

Completed Features

Current Performance (TinyLlama 1.1B)

Next Steps (Critical Path)

Repository Structure

Getting Started

1. Install Dependencies

2. Build

3. Run

Development Phases

✅ Phase 0: Foundation (COMPLETE)

✅ Phase 1: Minimal Inference Core (COMPLETE)

⏳ Phase 2: Optimization (85% COMPLETE)

⏳ Phase 3: Service Layer (60% COMPLETE)

🔜 Phase 4: Frontend & Distribution (Planned)

Documentation

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

LayerDynamics/MLXR

Folders and files

Latest commit

History

Repository files navigation

MLXR

Overview

Key Features

Architecture

Performance Targets (M4)

Requirements

System Dependencies

Project Status

Completed Features

Current Performance (TinyLlama 1.1B)

Next Steps (Critical Path)

Repository Structure

Getting Started

1. Install Dependencies

2. Build

3. Run

Development Phases

✅ Phase 0: Foundation (COMPLETE)

✅ Phase 1: Minimal Inference Core (COMPLETE)

⏳ Phase 2: Optimization (85% COMPLETE)

⏳ Phase 3: Service Layer (60% COMPLETE)

🔜 Phase 4: Frontend & Distribution (Planned)

Documentation

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages