MetaXuda is an experimental CUDA-compatible runtime shim for Apple Silicon, written in Rust, that allows Numba CUDA kernels to run unmodified by transparently mapping CUDA runtime calls to Apple Metal.
It is designed as a drop-in replacement for core CUDA runtime libraries, enabling GPU-accelerated Python workflows on macOS without requiring the NVIDIA CUDA Toolkit or NVIDIA hardware.
โจ Features
Drop-in replacement for libcudart.dylib and libcuda.dylib
Run Numba CUDA kernels (@cuda.jit) directly on Apple Metal
Metal-backed implementations of core CUDA APIs:
cudaMalloc / cudaFree
cudaMemcpy / cudaMemcpyAsync
cudaLaunchKernel
Asynchronous execution with stream-style overlap (copy / compute / copy)
Precompiled Metal .metallib shaders for fused math operations
cuda_pipeline.so, exposing a low-level execution API that allows Numba and other callers to bypass the CUDA runtime shim and dispatch operations directly
No CUDA Toolkit, NVIDIA drivers, or NVIDIA GPU required
โ ๏ธ Project Status
Alpha / Research Prototype
MetaXuda is under active development and currently targets:
Numba CUDA kernels
Single-GPU execution on Apple Silicon
Not all CUDA APIs are implemented, and behavior may differ from NVIDIA CUDA in edge cases.
โ๏ธ Installation
Requirements
macOS 13+
Python >= 3.10
NumPy >= 1.23
Numba >= 0.59
Install (Editable / Dev)
# Clone the repository
git clone https://github.com/perinban/MetaXuda.git
cd MetaXuda
# Install in editable mode
pip install -e .
The installation places the required shim libraries (libcudart.dylib, libcuda.dylib, and libdevice.bc) inside the package so they can be discovered by Numba at runtime.
๐ Package Layout
MetaXuda ships demos and helper modules inside the Python package so they are available in editable and installed modes:
metaxuda/
โโโ buffers/ # GPU, managed, and tiered buffer abstractions
โโโ execution/ # Direct and pooled execution backends
โโโ streams/ # Stream and async execution helpers (Numba-compatible)
โโโ demos/ # End-to-end demos and debug examples
โโโ native/ # Native shims and pipelines
โ โโโ libcudart.dylib
โ โโโ libcuda.dylib
โ โโโ libnvvm.dylib
โ โโโ libdevice.bc
โ โโโ cuda_pipeline.so
โโโ env.py # Environment detection and setup
โโโ patch.py # Numba / runtime patching hooks
โโโ __init__.py
The demos/ directory contains runnable examples covering kernel execution, buffers, streams, disk tiering, and the direct math pipeline.
You can run them directly once the package is installed:
Once installed, existing Numba CUDA code should run without modification:
from numba import cuda
import numpy as np
@cuda.jit
def add(a, b, out):
i = cuda.grid(1)
if i < out.size:
out[i] = a[i] + b[i]
n = 1024
a = np.arange(n, dtype=np.float32)
b = np.arange(n, dtype=np.float32)
out = np.zeros_like(a)
add[32, 32](a, b, out)
print(out[:5])
Execution is transparently dispatched to Metal via the MetaXuda runtime.
๐๏ธ Quantization, Compression, and Disk Tiering
MetaXuda supports quantized and compressed data storage for non-resident buffers and intermediate results. These behaviors are controlled via environment variables and handled by the runtime initialization logic in env.py.
This is primarily used for Tierโ3 (disk-backed) storage, allowing large workloads to exceed GPU memory limits while minimizing I/O and storage overhead.
Environment Configuration
The shim reads the following environment variables at startup:
MX_ENABLE_DATASTORE_COMPRESSION(default: 1)
Enable or disable compression for spilled data blocks.
MX_DATASTORE_COMPRESSION_TYPE(default: lz4)
Compression algorithm to use (e.g. lz4).
MX_DATASTORE_COMPRESSION_LEVEL(default: 3)
Compression level passed to the backend compressor.
MX_DISK_PARALLELISM_LEVEL(default: auto)
Controls parallel read/write behavior for disk operations.
MX_DISK_SPILL_ENABLED(default: 0)
Enable spilling GPU buffers to disk when memory pressure occurs.
MX_TIER3_STRATEGY(default: prefer_external)
Strategy for selecting Tierโ3 storage locations.
MX_TIER3_INTERNAL_PATH(default: block_store)
Directory used for internal Tierโ3 storage.
MX_TIER3_EXTERNAL_DEVICES(format: id:path,id:path)
Commaโseparated list of external devices or paths for Tierโ3 storage.
MX_DEBUG(options: memory)
Enable debug logging for specific subsystems.
These settings allow fineโgrained control over compression, quantization, disk spill behavior, and debugging without changing application code.
๐งฎ Operation Coverage
MetaXuda includes a precompiled Metal math pipeline (cuda_pipeline.so) implementing a broad set of scalar and elementwise operations that can be invoked directly by Numba or higher-level tooling.
MetaXuda is not affiliated with NVIDIA. CUDA is a trademark of NVIDIA Corporation. This project is an independent compatibility layer intended for research and development purposes.