Skip to content

teamchong/metal0

Repository files navigation

metal0

Early alpha - API unstable, not production-ready

Python Syntax. Bare Metal Speed. Zero Friction.

git clone https://github.com/teamchong/metal0 && cd metal0 && make install
metal0 app.py        # compile + run
metal0 build app.py  # compile only

Why

Python is slow. Packaging is painful. metal0 fixes both:

Python metal0
Speed 1x 30x
Binary pip + venv + deps single 50KB file
Docker 900MB <1MB
Startup 50ms 1ms

How it Works

metal0 uses two-tier compilation for maximum speed while maintaining full Python compatibility:

┌──────────────────────────────────────────────────────────────┐
│  def calculate(x, y):                                        │
│      a = x + y          # ← Tier 1: AOT → Native Zig (30x)  │
│      b = a * 2          # ← Tier 1: AOT → Native Zig        │
│      c = eval("a + b")  # ← Tier 2: Bytecode VM (~1x)       │
│      return c + 1       # ← Tier 1: AOT → Native Zig        │
└──────────────────────────────────────────────────────────────┘
Tier What Speed When Used
Tier 1: AOT Python → Zig → Native 30x CPython Static code (99% of code)
Tier 2: VM Bytecode interpreter ~1x CPython eval(), exec(), dynamic features

Key insight: Only the specific eval() expression uses the VM - surrounding code stays native. One dynamic call doesn't slow down your entire program.

CPython C API Compatible: All Python types use extern struct with exact CPython memory layout (ob_refcnt, ob_type, etc.), enabling C extension compatibility.

📖 Full Architecture Documentation | C API Layout

Benchmarks

All benchmarks on Apple M2.

Async/Concurrency

metal0 compiles Python's asyncio to optimized native code:

  • I/O-bound: State machine coroutines with kqueue netpoller (single thread, high concurrency)
  • CPU-bound: Thread pool with M:N scheduling (parallel execution across cores)

Parallel Scaling: SHA256 Hashing (8 workers × 50K hashes each)

Runtime Speedup Efficiency Notes
metal0 6.05x 76% Thread pool + stack alloc, no GC
Go (goroutines) 3.72x 47% M:N scheduler, GC overhead
Rust (rayon) 1.04x 13% Work-stealing overhead
CPython 1.07x 13% GIL blocks parallelism
PyPy 0.98x 12% GIL + JIT overhead

Speedup = Sequential / Parallel. Ideal: 8x for 8 cores. metal0 achieves 1.6x better parallel efficiency than Go.

I/O-Bound: Concurrent Sleep (10,000 tasks × 100ms each)

Runtime Time Concurrency vs Sequential
metal0 103.5ms 9,662x Best event loop
Rust (tokio) 111.7ms 8,952x Great async runtime
Go 126.9ms 7,880x Great for network
CPython 194.3ms 5,147x Good for I/O
PyPy 258.8ms 3,864x Slower I/O

Sequential would take 1,000,000ms (16.7 min). metal0 achieves 9662× concurrency via state machine + kqueue netpoller.

Recursive Computation

Fibonacci(45) - Recursive:

Language Time vs Python
metal0 3.22s 30.1x faster
Rust 3.23s 30.0x faster
Go 3.60s 26.9x faster
PyPy 11.75s 8.3x faster
Python 96.94s baseline

Tail-Recursive Fibonacci (10K × fib(10000)) - TCO Test:

Language Time vs metal0
metal0 31.9ms 1.00x
Rust 32.2ms 1.01x
Go 286.7ms 8.99x slower
Python/PyPy N/A RecursionError

metal0 uses @call(.always_tail) for guaranteed TCO.

Startup Time - Hello World (100 runs):

Language Time vs CPython
metal0 1.6ms 14x faster
Rust 1.8ms 12x faster
Go 2.4ms 9x faster
CPython 22.4ms baseline

JSON Benchmark (50K iterations × 38KB realistic JSON)

JSON Parse (50K × 38KB = 1.9GB processed):

Implementation Time vs metal0
metal0 2.68s 1.00x
PyPy 3.16s 1.18x slower
Rust (serde_json) 4.70s 1.76x slower
Python 8.40s 3.14x slower
Go 14.0s 5.23x slower

JSON Stringify (50K × 38KB = 1.9GB processed):

Implementation Time vs metal0
metal0 2.68s 1.00x
Rust (serde_json) 3.01s 1.12x slower
Python 12.3s 4.60x slower
PyPy 12.4s 4.61x slower
Go 15.6s 5.81x slower

Key optimizations:

  • Arena allocator - bump-pointer (~2 CPU cycles per alloc vs ~100+ for malloc)
  • SWAR string scanning - 8 bytes at a time (PyPy's technique)
  • Small integer cache - pre-allocated for -10 to 256
  • SIMD whitespace skipping (AVX2/NEON) - 32 bytes per iteration
  • SIMD string escaping - 4.3x speedup on ARM64 NEON

Dict/String Benchmarks

Dict Benchmark (10M lookups, 8 keys):

Language Time vs Python
metal0 329ms 4.3x faster
PyPy 570ms 2.5x faster
Python 1.42s baseline

String Benchmark (100M iterations, comparison + length):

Language Time vs Python
metal0 1.6ms 5000x faster
PyPy 154ms 53x faster
Python 8.1s baseline

metal0 string operations are computed at comptime where possible.

NumPy Matrix Multiplication (BLAS)

500×500 matrix multiplication using BLAS cblas_dgemm.

Runtime Time vs metal0
metal0 (BLAS) 3.2ms 1.00x
Python (NumPy) 66ms 21x slower
PyPy (NumPy) 129ms 40x slower

All use the same BLAS library - metal0 eliminates interpreter overhead.

Tokenizer Benchmark

100% Correctness - Verified against tiktoken cl100k_base (3459/3459 tests pass).

BPE Encoding (59,200 encodes - 592 texts × 100 iterations):

Implementation Time vs metal0 Correctness
metal0 (Zig) 81ms 1.00x 100%
rs-bpe (Rust) 420ms 5.2x slower 100%
tiktoken (Rust) 1110ms 13.7x slower 100%
HuggingFace (Python) 5439ms 67x slower 100%

Tested on Apple M2 with json.load() data.

Web/WASM Encoding (583 texts × 200 iterations):

Library Time vs metal0 Size
metal0 (WASM) 93ms 1.00x 46KB + 773B runtime
gpt-tokenizer (JS) 713ms 7.7x slower 1.1MB
@anthropic-ai/tokenizer (JS) 8560ms 92x slower 8.6MB

Runtime uses Immer-style Proxy pattern - 773 bytes shared across all modules.

BPE Training (vocab_size=32000, 300 iterations):

Library Time vs metal0 Correctness
metal0 (Zig) 68.7ms 1.00x 100%
HuggingFace (Rust) 1707.9ms 25x slower 100%

Training produces identical vocabularies - verified with comparison test.

Unigram Training (vocab_size=32000, 100 iterations):

Library Time vs HuggingFace
HuggingFace (Rust) 2.15s 1.00x
metal0 (Zig) 5.70s 2.65x slower

BPE training is 22x faster. Unigram improved from 11.95x to 2.65x slower.

Regex Benchmark

Regex Pattern Matching (5 common patterns):

Implementation Total Time vs metal0
metal0 (Lazy DFA) 1.324s 1.00x
Rust (regex) 4.639s 3.50x slower
Python (re) ~43s ~32x slower
Go (regexp) ~58s ~44x slower

Pattern breakdown (1M iterations each):

Pattern metal0 Rust Speedup
Email 93ms 95ms 1.02x
URL 81ms 252ms 3.12x
Digits 692ms 3,079ms 4.45x
Word Boundary 116ms 385ms 3.32x
Date ISO 346ms 636ms 1.84x

HTTP Client Benchmark

HTTP/1.1 + TLS + Gzip (100 requests to https://www.google.com):

Client Time vs metal0
metal0 39.8s 1.00x
Go (net/http) 41.9s 1.05x slower
Rust (ureq) 44.2s 1.11x slower
Python (requests) 51.8s 1.30x slower
PyPy (requests) 52.5s 1.32x slower

HTTP/2 + TLS + Gzip (100 requests to https://www.google.com):

Client Time vs metal0
metal0 39.4s 1.00x
Python (httpx) 39.4s 1.00x
Go (net/http) 41.9s 1.06x slower
Rust (reqwest) 42.2s 1.07x slower

WebSocket (100 messages × 1KB, local echo server):

Client Time vs metal0
metal0 7.3ms 1.00x
Go (gorilla) 9.2ms 1.27x slower
Rust (tungstenite) 13.4ms 1.84x slower
Python (websocket-client) 72.6ms 9.95x slower
PyPy (websocket-client) 109.4ms 15.0x slower

metal0 uses custom TLS 1.3 implementation with AES-NI acceleration and HTTP/2 multiplexing. Connection pooling for all protocols.

Web Server Benchmark

HTTP throughput (Hello World JSON, wrk -t4 -c100 -d10s):

Server Requests/sec Latency (avg) vs Python
Rust (actix-web) 140,530 683us 96x faster
Go (net/http) 128,766 930us 88x faster
Python (Flask) 1,457 28.2ms baseline
PyPy (Flask) 165 588ms 9x slower

Tested on Apple M2. Flask dev server. PyPy slower due to JIT warmup overhead on short-lived requests.

Running Benchmarks

make benchmark-fib         # Fibonacci
make benchmark-json-full   # JSON parse + stringify
make benchmark-dict        # Dict lookups
make benchmark-string      # String operations
make benchmark-regex       # Regex patterns
make benchmark-asyncio     # CPU-bound async
make benchmark-asyncio-io  # I/O-bound async
make benchmark-numpy       # NumPy BLAS
make benchmark-http1       # HTTP/1.1 client (TLS + Gzip)
make benchmark-http2       # HTTP/2 client (TLS + Gzip)
make benchmark-websocket   # WebSocket client
make benchmark-http        # All HTTP benchmarks
make benchmark-webserver   # Web server throughput (wrk)

# Tokenizer benchmarks (run from packages/tokenizer/)
cd packages/tokenizer && zig build -Doptimize=ReleaseFast && ./zig-out/bin/bench_train

Install

git clone https://github.com/teamchong/metal0
cd metal0 && make install

Requires: Zig 0.15.2+

Usage

metal0 app.py                       # compile and run
metal0 build app.py                 # compile only
metal0 build --binary app.py        # standalone executable
metal0 --force app.py               # ignore cache
metal0 --target wasm-browser app.py # browser WASM (freestanding)
metal0 --target wasm-edge app.py    # WasmEdge/WASI WASM
metal0 server                       # start eval server

WASM Targets

Target Platform Allocator Features
wasm-browser Browser (freestanding) FixedBuffer 64KB No threads, smallest size
wasm-edge WasmEdge/WASI GPA fd_write, WASI sockets
metal0 --target wasm-browser app.py  # Browser WASM
metal0 --target wasm-edge app.py     # WasmEdge/WASI
# Outputs: app.wasm + app.d.ts

Usage:

import { load } from '@metal0/wasm-runtime';  // 773 bytes, Immer-style runtime
import type { Tokenizer } from './tokenizer';   // generated .d.ts

const mod = await load<Tokenizer>('./tokenizer.wasm');
mod.encode("hello");  // fully typed

Immer-Style Runtime (@metal0/wasm-runtime - 773 bytes):

Like Immer, our runtime uses a Proxy pattern for minimal code that works with ANY module:

// Generic Proxy-based loader - same for ALL modules
const E=new TextEncoder();let w,m,p,M=1<<20;
const g=()=>new Uint8Array(m.buffer,p,M);
const x=a=>{
  if(typeof a!=='string')return[a];
  const b=E.encode(a);
  if(b.length>M){M=b.length+1024;p=w.alloc(M)}
  g().set(b);return[p,b.length];
};
export async function load(s){
  const b=typeof s==='string'?await fetch(s).then(r=>r.arrayBuffer()):s;
  w=(await WebAssembly.instantiate(await WebAssembly.compile(b),{})).exports;
  m=w.memory;
  if(w.alloc){p=w.alloc(M)}
  return new Proxy({},{get:(_,n)=>n==='batch'?batch:typeof w[n]==='function'?(...a)=>w[n](...a.flatMap(x)):w[n]});
}

Generated TypeScript definitions (tokenizer.d.ts):

// Auto-generated - provides full IntelliSense
export interface Tokenizer {
  encode(text: string): number;
  decode(tokens: number[]): string;
}

Why Immer-Style?

  • 773 bytes - Tiny, works with ANY WASM module
  • Proxy pattern - Zero per-function wrapper code
  • Auto string marshalling - Handles JS↔WASM conversion
  • Module-specific .d.ts - Full TypeScript support

Features

  • Functions, classes, inheritance, decorators
  • int, float, str, bool, list, dict, tuple, set
  • List/dict/set comprehensions, f-strings, generators
  • Imports, type inference (no annotations needed)
  • json, re, math, os, sys, http, asyncio
  • eval(), exec() via bytecode VM
  • DWARF debug symbols, PGO, source maps

C Extension Support

metal0 supports any CPython C extension (NumPy, Pandas, TensorFlow, etc.) via a complete CPython C API implementation in pure Zig.

How It Works

Python script imports numpy → metal0 detects C extension → dlopen() at runtime
                                                              ↓
              NumPy calls PyList_New(), PyFloat_AsDouble() → metal0's exported C API
                                                              ↓
                                               PyObject* with CPython 3.12-compatible memory layout

metal0 exports 997 CPython C API functions with 100% binary compatibility for Python 3.10, 3.11, 3.12, and 3.13:

Category Functions Examples
Type Objects 45+ PyType_Type, PyLong_Type, PyList_Type
Object Creation 100+ PyObject_New, PyList_New, PyDict_New
Object Protocol 80+ PyObject_GetAttr, PySequence_GetItem
Memory Management 20+ Py_INCREF, Py_DECREF, PyMem_Malloc
Error Handling 40+ PyErr_SetString, PyErr_Occurred, PyErr_Clear
Module/Import 30+ PyModule_Create, PyImport_ImportModule
Buffer Protocol 15+ PyBuffer_GetPointer, PyMemoryView_FromBuffer
Iterator Protocol 10+ PyIter_Next, PyObject_GetIter
Codec APIs 50+ PyCodec_Encode, PyUnicode_DecodeUTF8
Type Creation 33 PyType_Ready, PyType_FromSpec, PyType_GenericAlloc

Key Features:

  • Pure Zig - No CPython linking, all functions implemented natively
  • Multi-version - Supports Python 3.10, 3.11, 3.12, 3.13 struct layouts
  • PEP 384 - Full stable ABI support with PyType_FromSpec heap types
  • Thread-safe - Thread-local exception state, atomic interrupt flags
  • Small int cache - Pre-allocated integers -5 to 256 (like CPython)

Example: NumPy

# examples/c_extensions/numpy_example.py
import numpy as np

arr = np.array([1, 2, 3, 4, 5])
print(f"Array: {arr}")
print(f"Sum: {arr.sum()}")
print(f"Mean: {arr.mean()}")

matrix = np.array([[1, 2], [3, 4]])
print(f"Matrix dot Matrix:\n{np.dot(matrix, matrix)}")
metal0 examples/c_extensions/numpy_example.py --force
# Info: C extension module 'numpy' will be loaded at runtime via c_interop

Why This Works

  1. No CPython linking - metal0 implements the C API in pure Zig
  2. Compatible memory layout - PyObject structs match CPython's layout exactly
  3. Runtime loading - C extensions loaded via dlopen(), call exported functions
  4. Zero changes needed - Existing C extensions work unmodified

Implementation

// packages/c_interop/src/cpython_api.zig - 997 exported functions
export fn PyList_New(size: isize) ?*cpython.PyObject { ... }
export fn PyDict_SetItem(dict: *cpython.PyObject, key: *cpython.PyObject, value: *cpython.PyObject) c_int { ... }
export fn Py_INCREF(obj: *cpython.PyObject) void { ... }

// Type creation (PEP 384 stable ABI)
export fn PyType_FromSpec(spec: *cpython.PyType_Spec) ?*cpython.PyObject { ... }
export fn PyType_Ready(type_obj: *cpython.PyTypeObject) c_int { ... }

// Type object getters (can't export var in Zig)
export fn _metal0_get_PyType_Type() *cpython.PyTypeObject { ... }
export fn _metal0_get_PyLong_Type() *cpython.PyTypeObject { ... }

eval()/exec() Architecture

                    ┌─────────────────────────────────────┐
                    │         eval()/exec() entry         │
                    └─────────────────┬───────────────────┘
                                      │
                    ┌─────────────────▼───────────────────┐
                    │    metal0 Parser + Type Inferrer    │
                    │    (REUSE existing src/parser/)     │
                    └─────────────────┬───────────────────┘
                                      │
                    ┌─────────────────▼───────────────────┐
                    │         Bytecode Compiler           │
                    │      src/bytecode/compiler.zig      │
                    └─────────────────┬───────────────────┘
                                      │
              ┌───────────────────────┼───────────────────────┐
              │                       │                       │
    ┌─────────▼─────────┐   ┌─────────▼─────────┐   ┌─────────▼─────────┐
    │   Native Binary   │   │   Browser WASM    │   │   WasmEdge WASI   │
    │   (stack-based)   │   │   (Web Worker)    │   │   (WASI sockets)  │
    │     vm.zig        │   │  wasm_worker.zig  │   │  wasi_socket.zig  │
    └───────────────────┘   └───────────────────┘   └───────────────────┘

Comptime Target Selection:

pub const target: Target = comptime blk: {
    if (builtin.target.isWasm()) {
        if (builtin.os.tag == .wasi) break :blk .wasm_edge;
        break :blk .wasm_browser;
    }
    break :blk .native;
};

Browser WASM: Immer-Style Runtime

For browser targets, eval() uses the same 773-byte Immer-style runtime with Web Worker isolation:

import { load, registerHandlers } from '@metal0/wasm-runtime';

// Register handlers for @wasm_import decorators
registerHandlers('js', {
  fetch: async (urlPtr, urlLen) => { /* ... */ },
  localStorage_get: (keyPtr, keyLen) => { /* ... */ }
});

// Eval spawns isolated Web Workers using cached WASM module
const mod = await load('./module.wasm');
const result = await mod.eval("1 + 2");  // Returns 3

Web Worker Isolation:

  • Simple expressions run inline
  • Complex code spawns Web Worker for security
  • Cached WASM module enables "viral spawning" - workers share compiled module

WasmEdge WASI: Server-Side Eval

# Just use eval() like normal Python
result = eval("1 + 2 * 3")  # Returns 7

# Or exec() for statements
exec("x = 42")
print(x)  # 42
# Server runs automatically when eval()/exec() is used
# Or start manually for persistent connections:
metal0 server --vm-module metal0_vm.wasm

Architecture:

  • Fresh WASM instance per eval() call (security isolation)
  • Bytecode compiled from Python source
  • Executed in WasmEdge sandbox

User-Declared Bindings

Instead of hardcoding JS/WASI functions, declare what you need:

from metal0 import wasm_import, wasm_export

@wasm_import("js")
def fetch(url: str) -> str: ...

@wasm_export
def process(data: str) -> list[int]:
    result = fetch("/api/data")
    return [ord(c) for c in result]

metal0 generates optimized Zig externs and minimal JS loader - only declared functions included.

How It Works

Python → Lexer → Parser → Type Inference → Zig codegen → Native binary

No Python runtime. Types inferred at compile time. Zig handles memory.

Debugging

metal0 app.py --debug           # DWARF + source maps
lldb ./build/lib.../app         # Python line numbers in debugger
metal0 profile run app.py       # Profile collection
metal0 profile show app.py      # View hotspots

License

Apache 2.0

About

Python Syntax. Bare Metal Speed. Zero Friction. (WIP)

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •