Skip to content

LayerDynamics/MLXR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLXR

High-performance, macOS-native LLM inference engine for Apple Silicon.

Overview

MLXR is a local LLM runner built specifically for Apple Silicon (M4, M3, M2) that combines:

  • MLX framework for tensor/graph management
  • Custom Metal kernels for performance-critical operations
  • OpenAI and Ollama-compatible APIs for seamless integration
  • React-based GUI with real-time streaming

Key Features

  • Native Performance: Custom Metal kernels optimized for Apple's unified memory architecture
  • Memory Efficient: Paged KV cache with smart eviction policies
  • High Throughput: Continuous batching and speculative decoding
  • Model Support: GGUF, HF safetensors, and native MLX formats
  • Quantization: Full support for Q2_K through Q8_K, FP8, and NF4
  • Developer Friendly: OpenAI and Ollama-compatible REST APIs

Architecture

┌─────────────────────────┐
│   React WebView GUI     │
│   (Tray/Dock App)       │
└───────────┬─────────────┘
            │ Unix Domain Socket
┌───────────▼─────────────┐
│   Daemon (REST/gRPC)    │
│   - OpenAI API          │
│   - Ollama API          │
│   - Model Registry      │
└───────────┬─────────────┘
            │
┌───────────▼─────────────┐
│   Inference Core        │
│   - MLX Graph           │
│   - Metal Kernels       │
│   - Paged KV Cache      │
│   - Continuous Batching │
└─────────────────────────┘

Performance Targets (M4)

  • First Token: < 1s for 7B-8B models at 4-bit
  • Decode: < 80ms/token steady-state
  • Embeddings: < 20ms/sample
  • Occupancy: ≥ 60% GPU utilization on attention kernels

Requirements

  • macOS 14.0 (Sonoma) or later
  • Apple Silicon (M2, M3, or M4)
  • Xcode 15+ (for building)
  • Homebrew package manager
  • CMake 3.20+
  • Python 3.11+ (with MLX installed)
  • Node.js 18+ (for frontend development)

System Dependencies

The following Homebrew packages are required:

# Install all dependencies at once
brew install cmake ninja mlx sentencepiece nlohmann-json cpp-httplib googletest

# Or use the Makefile convenience target
make install-deps

Note: CMake and Ninja must be installed via Homebrew, not Conda. The conda-forge "cmake" package is unrelated to the CMake build system.

Project Status

Core Infrastructure Complete - Integration Work Remaining

Codebase Size: ~50,000 LOC across core, daemon, app, tests, and SDKs

Completed Features

Phase 1: Minimal Inference ✅ 100%

  • Complete Llama model with safetensors loading (737 lines)
  • SentencePiece tokenizer (252 lines)
  • Sampling strategies (greedy, temperature, top-k, top-p) - 534 lines
  • Working text generation pipeline
  • Example: simple_generation.cpp - WORKS

Phase 2: Optimization ✅ 95%

  • KV Cache System - Complete paged arena (2,373 lines) with LRU eviction, GQA support
  • Scheduler - Continuous batching (439 lines) with prefill/decode separation
  • Metal Kernels - All 6 kernels implemented (~5,200 LOC total):
    • RMSNorm: 217 lines shader + 362 lines primitive - INTEGRATED & TESTED (81/81 tests) ✅
    • Attention Decode: 295 + 574 lines - Ready for integration
    • Attention Prefill: 370 + 633 lines - Ready for integration
    • RoPE: 434 + 478 lines - Ready for integration
    • SwiGLU MLP: 432 + 321 lines - Ready for integration
    • Q-Gemm Dequant: 486 + 525 lines - Ready for integration
  • Test Daemon - Working HTTP server (test_daemon) with health/models endpoints
  • ⚠️ Integration Gap: Metal kernels need wiring in CachedAttention (8-16 hours)

Phase 3: Service Layer ✅ 70%

  • REST API - OpenAI & Ollama-compatible endpoints (1,758 lines)
  • gRPC Server - FULLY IMPLEMENTED (1,101 lines) with streaming
  • SSE Streaming - Real-time token generation (621 lines)
  • Model Registry - SQLite catalog (1,137 lines) with GGUF parser (891 lines)
  • Telemetry - Metrics collection (769 lines, 15/15 tests passing)
  • Test Suite - 14 C++ unit test files, 299 total tests
  • ⚠️ Integration Gap: Model loading → Engine → Worker wiring (4-8 hours)

Frontend ✅ 90%

  • React UI - 78 components fully implemented
  • Chat Interface - With streaming and tool calls
  • Model Management - Pull, import, convert, quantize
  • Metrics Dashboard - Real-time performance visualization
  • All Pages - Chat, Models, Playgrounds, Metrics, Settings, Logs

macOS App ✅ 90%

  • Swift Components - 20 files implementing app host
  • JavaScript Bridge - Complete UDS communication
  • Xcode Project - Exists and configured
  • Daemon Management - launchd integration
  • ⚠️ Missing: .app bundle build, code signing, .dmg creation

SDKs ✅ 95%

  • Python SDK - Complete client with async support
  • TypeScript SDK - Full type definitions and clients
  • Swift SDK - SwiftPM package with examples

Current Performance (TinyLlama 1.1B)

  • Prefill: 198-459 ms (5-10 tokens)
  • Decode: 53-220 ms/token
  • Throughput: 4.5-18.9 tokens/sec
  • Memory: 87.5% reduction with GQA (308 MB saved)
  • Expected after Metal integration: 2-5x improvement

Next Steps (Critical Path)

P0 - Get it Working (14-28 hours):

  1. ⚠️ Metal Kernel Integration (8-16h) - Wire attention kernels in CachedAttention
  2. ⚠️ Daemon Model Loading (4-8h) - Complete load_model() in REST server
  3. Server Config (2-4h) - Create configs/server.yaml

P1 - Make it Fast (20-32 hours): 4. Quantization (8-12h) - GGUF loading + Q-gemm integration 5. RoPE/SwiGLU Kernels (6-10h) - Wire remaining kernels 6. Speculative Decoding (6-10h) - Connect draft model

P2 - Ship It (36-60 hours): 7. App Bundle (8-16h) - .dmg, code signing, auto-update 8. CPU Fallbacks (16-24h) - Neon SIMD implementations 9. Conversion Tools (12-20h) - GGUF→MLX, quantizers

See CLAUDE.md for comprehensive implementation roadmap and accurate metrics.

Repository Structure

MLXR/
  app/           # macOS app bundle & React UI
  daemon/        # Background server (REST/gRPC)
  core/          # Inference engine (C++/MLX/Metal)
  tools/         # Model converters and utilities
  sdks/          # Client libraries (Python, TS, Swift)
  configs/       # Configuration files
  scripts/       # Build and development scripts
  tests/         # Test suites
  plan/          # Architecture specifications

Getting Started

1. Install Dependencies

# Install Homebrew (if not already installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Clone the repository
git clone <repository-url>
cd MLXR

# Install system dependencies
make install-deps

# Setup Python environment (recommended)
make setup
conda activate mlxr

# Check installation
make status

2. Build

# Full build (Metal shaders + C++ core)
make build

# Or quick development build (Metal only)
make dev

# Run tests
make test-cpp

3. Run

# Run daemon
./build/cmake/bin/mlxrunnerd

# Develop frontend (separate terminal)
cd app/ui
yarn install
yarn dev

For detailed build instructions, see CLAUDE.md.

Development Phases

✅ Phase 0: Foundation (COMPLETE)

  • Repository structure and build system
  • Metal shader compilation pipeline
  • Toolchain validation (Homebrew, CMake, Ninja)

✅ Phase 1: Minimal Inference Core (COMPLETE)

  • MLX integration and model loading (safetensors)
  • SentencePiece tokenizer
  • Single-request inference with FP16
  • Working examples in examples/

⏳ Phase 2: Optimization (85% COMPLETE)

  • Paged KV cache with eviction policies
  • Metal kernel implementations (RMSNorm tested, others implemented)
  • Continuous batching scheduler
  • GQA support for memory efficiency
  • Next: CachedLlamaModel integration with Engine

⏳ Phase 3: Service Layer (60% COMPLETE)

  • REST API daemon (OpenAI & Ollama compatible)
  • Model registry with SQLite backend
  • SSE streaming for real-time generation
  • Telemetry and metrics
  • Next: Full API endpoint integration

🔜 Phase 4: Frontend & Distribution (Planned)

  • macOS app bundle with React WebView
  • Unix domain socket communication
  • Auto-updates via Sparkle
  • Code signing and notarization

See plan/SPEC01.md for complete roadmap and docs/IMPLEMENTATION_STATUS.md for current status.

Documentation

Developer Guides:

Architecture & Planning:

Implementation Details:

Contributing

This project is actively developed and welcomes contributions!

Current Focus Areas:

  • CachedLlamaModel integration with Engine
  • Metal kernel optimization and testing
  • OpenAI API endpoint completion
  • Performance benchmarking and profiling

Before Contributing:

  1. Read CLAUDE.md for development guidelines
  2. Check docs/IMPLEMENTATION_STATUS.md for current status
  3. Review docs/SECURITY_FIXES.md for security best practices

Development Standards:

  • ✅ All C++ code passes unit tests
  • ✅ Security: No system() calls, proper input validation, ReDoS-safe regex
  • ✅ Cross-platform: Use std::filesystem for paths
  • ✅ Documentation: Update docs for significant changes

License

Apache License 2.0 - See LICENSE for details.

Acknowledgments

Built with:

  • MLX - Apple's machine learning framework
  • Metal - Apple's GPU compute API
  • Inspired by vLLM, llama.cpp, and Ollama

Status: Active Development (Phase 2 Complete, Phase 3 In Progress) Target: Q1 2025 MVP release Latest: See docs/IMPLEMENTATION_STATUS.md for current metrics and progress

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •