jundot

jundot/omlx

Built by jundot β€’ 7,172 stars

What is jundot/omlx?

LLM inference server with continuous batching & SSD caching for Apple Silicon β€” managed from the macOS menu bar

How to use jundot/omlx?

1. Install a compatible MCP client (like Claude Desktop). 2. Open your configuration settings. 3. Add jundot/omlx using the following command: npx @modelcontextprotocol/jundot-omlx 4. Restart the client and verify the new tools are active.
πŸ›‘οΈ Scoped (Restricted)
npx @modelcontextprotocol/jundot-omlx --scope restricted
πŸ”“ Unrestricted Access
npx @modelcontextprotocol/jundot-omlx

Key Features

Native MCP Protocol Support
Real-time Tool Activation & Execution
Verified High-performance Implementation
Secure Resource & Context Handling

Optimized Use Cases

Extending AI models with custom local capabilities
Automating system workflows via natural language
Connecting external data sources to LLM context windows

jundot/omlx FAQ

Q

Is jundot/omlx safe?

Yes, jundot/omlx follows the standardized Model Context Protocol security patterns and only executes tools with explicit user-granted permissions.

Q

Is jundot/omlx up to date?

jundot/omlx is currently active in the registry with 7,172 stars on GitHub, indicating its reliability and community support.

Q

Are there any limits for jundot/omlx?

Usage limits depend on the specific implementation of the MCP server and your system resources. Refer to the official documentation below for technical details.

Official Documentation

View on GitHub
<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="docs/images/icon-rounded-dark.svg" width="140"> <source media="(prefers-color-scheme: light)" srcset="docs/images/icon-rounded-light.svg" width="140"> <img alt="oMLX" src="docs/images/icon-rounded-light.svg" width="140"> </picture> </p> <h1 align="center">oMLX</h1> <p align="center"><b>LLM inference, optimized for your Mac</b><br>Continuous batching and tiered KV caching, managed directly from your menu bar.</p> <p align="center"> <a href="https://www.buymeacoffee.com/jundot"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" height="40"></a> </p> <p align="center"> <img src="https://img.shields.io/badge/license-Apache%202.0-blue" alt="License"> <img src="https://img.shields.io/badge/python-3.10+-green" alt="Python 3.10+"> <img src="https://img.shields.io/badge/platform-Apple%20Silicon-black?logo=apple" alt="Apple Silicon"> </p> <p align="center"> <a href="mailto:junkim.dot@gmail.com">junkim.dot@gmail.com</a> Β· <a href="https://omlx.ai/me">https://omlx.ai/me</a> </p> <p align="center"> <a href="#install">Install</a> Β· <a href="#quickstart">Quickstart</a> Β· <a href="#features">Features</a> Β· <a href="#models">Models</a> Β· <a href="#cli-configuration">CLI Configuration</a> Β· <a href="https://omlx.ai/benchmarks">Benchmarks</a> Β· <a href="https://omlx.ai">oMLX.ai</a> </p> <p align="center"> <b>English</b> Β· <a href="README.zh.md">δΈ­ζ–‡</a> Β· <a href="README.ko.md">ν•œκ΅­μ–΄</a> Β· <a href="README.ja.md">ζ—₯本θͺž</a> </p>
<p align="center"> <img src="docs/images/omlx_dashboard.png" alt="oMLX Admin Dashboard" width="800"> </p>

Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar.

oMLX persists KV cache across a hot in-memory tier and cold SSD tier - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That's why I built it.

Install

macOS App

Download the .dmg from Releases, drag to Applications, done. The app includes in-app auto-update, so future upgrades are just one click.

Homebrew

brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx

# Upgrade to the latest version
brew update && brew upgrade omlx

# Run as a background service (auto-restarts on crash)
brew services start omlx

# Optional: MCP (Model Context Protocol) support
/opt/homebrew/opt/omlx/libexec/bin/pip install mcp

From Source

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .          # Core only
pip install -e ".[mcp]"   # With MCP (Model Context Protocol) support

Requires macOS 15.0+ (Sequoia), Python 3.10+, and Apple Silicon (M1/M2/M3/M4).

Quickstart

macOS App

Launch oMLX from your Applications folder. The Welcome screen guides you through three steps - model directory, server start, and first model download. That's it. To connect OpenClaw, OpenCode, or Codex, see Integrations.

<p align="center"> <img src="docs/images/Screenshot 2026-02-10 at 00.36.32.png" alt="oMLX Welcome Screen" width="360"> <img src="docs/images/Screenshot 2026-02-10 at 00.34.30.png" alt="oMLX Menubar" width="240"> </p>

CLI

omlx serve --model-dir ~/models

The server discovers LLMs, VLMs, embedding models, and rerankers from subdirectories automatically. Any OpenAI-compatible client can connect to http://localhost:8000/v1. A built-in chat UI is also available at http://localhost:8000/admin/chat.

Homebrew Service

If you installed via Homebrew, you can run oMLX as a managed background service:

brew services start omlx    # Start (auto-restarts on crash)
brew services stop omlx     # Stop
brew services restart omlx  # Restart
brew services info omlx     # Check status

The service runs omlx serve with zero-config defaults (~/.omlx/models, port 8000). To customize, either set environment variables (OMLX_MODEL_DIR, OMLX_PORT, etc.) or run omlx serve --model-dir /your/path once to persist settings to ~/.omlx/settings.json.

Logs are written to two locations:

  • Service log: $(brew --prefix)/var/log/omlx.log (stdout/stderr)
  • Server log: ~/.omlx/logs/server.log (structured application log)

Features

Supports text LLMs, vision-language models (VLM), OCR models, embeddings, and rerankers on Apple Silicon.

Admin Dashboard

Web UI at /admin for real-time monitoring, model management, chat, benchmark, and per-model settings. Supports English, Korean, Japanese, and Chinese. All CDN dependencies are vendored for fully offline operation.

<p align="center"> <img src="docs/images/Screenshot 2026-02-10 at 00.45.34.png" alt="oMLX Admin Dashboard" width="720"> </p>

Vision-Language Models

Run VLMs with the same continuous batching and tiered KV cache stack as text LLMs. Supports multi-image chat, base64/URL/file image inputs, and tool calling with vision context. OCR models (DeepSeek-OCR, DOTS-OCR, GLM-OCR) are auto-detected with optimized prompts.

Tiered KV Cache (Hot + Cold)

Block-based KV cache management inspired by vLLM, with prefix sharing and Copy-on-Write. The cache operates across two tiers:

  • Hot tier (RAM): Frequently accessed blocks stay in memory for fast access.
  • Cold tier (SSD): When the hot cache fills up, blocks are offloaded to SSD in safetensors format. On the next request with a matching prefix, they're restored from disk instead of recomputed from scratch - even after a server restart.
<p align="center"> <img src="docs/images/omlx_hot_cold_cache.png" alt="oMLX Hot & Cold Cache" width="720"> </p>

Continuous Batching

Handles concurrent requests through mlx-lm's BatchGenerator. Prefill and completion batch sizes are configurable.

Claude Code Optimization

Context scaling support for running smaller context models with Claude Code. Scales reported token counts so that auto-compact triggers at the right timing, and SSE keep-alive prevents read timeouts during long prefill.

Multi-Model Serving

Load LLMs, VLMs, embedding models, and rerankers within the same server. Models are managed through a combination of automatic and manual controls:

  • LRU eviction: Least-recently-used models are evicted automatically when memory runs low.
  • Manual load/unload: Interactive status badges in the admin panel let you load or unload models on demand.
  • Model pinning: Pin frequently used models to keep them always loaded.
  • Per-model TTL: Set an idle timeout per model to auto-unload after a period of inactivity.
  • Process memory enforcement: Total memory limit (default: system RAM - 8GB) prevents system-wide OOM.

Per-Model Settings

Configure sampling parameters, chat template kwargs, TTL, model alias, model type override, and more per model directly from the admin panel. Changes apply immediately without server restart.

  • Model alias: set a custom API-visible name. /v1/models returns the alias, and requests accept both the alias and directory name.
  • Model type override: manually set a model as LLM or VLM regardless of auto-detection.
<p align="center"> <img src="docs/images/omlx_ChatTemplateKwargs.png" alt="oMLX Chat Template Kwargs" width="480"> </p>

Built-in Chat

Chat directly with any loaded model from the admin panel. Supports conversation history, model switching, dark mode, reasoning model output, and image upload for VLM/OCR models.

<p align="center"> <img src="docs/images/ScreenShot_2026-03-14_104350_610.png" alt="oMLX Chat" width="720"> </p>

Model Downloader

Search and download MLX models from HuggingFace directly in the admin dashboard. Browse model cards, check file sizes, and download with one click.

<p align="center"> <img src="docs/images/downloader_omlx.png" alt="oMLX Model Downloader" width="720"> </p>

Integrations

Set up OpenClaw, OpenCode, and Codex directly from the admin dashboard with a single click. No manual config editing required.

<p align="center"> <img src="docs/images/omlx_integrations.png" alt="oMLX Integrations" width="720"> </p>

Performance Benchmark

One-click benchmarking from the admin panel. Measures prefill (PP) and text generation (TG) tokens per second, with partial prefix cache hit testing for realistic performance numbers.

<p align="center"> <img src="docs/images/benchmark_omlx.png" alt="oMLX Benchmark Tool" width="720"> </p>

macOS Menubar App

Native PyObjC menubar app (not Electron). Start, stop, and monitor the server without opening a terminal. Includes persistent serving stats (survives restarts), auto-restart on crash, and in-app auto-update.

<p align="center"> <img src="docs/images/Screenshot 2026-02-10 at 00.51.54.png" alt="oMLX Menubar Stats" width="400"> </p>

API Compatibility

Drop-in replacement for OpenAI and Anthropic APIs. Supports streaming usage stats (stream_options.include_usage), Anthropic adaptive thinking, and vision inputs (base64, URL).

EndpointDescription
POST /v1/chat/completionsChat completions (streaming)
POST /v1/completionsText completions (streaming)
POST /v1/messagesAnthropic Messages API
POST /v1/embeddingsText embeddings
POST /v1/rerankDocument reranking
GET /v1/modelsList available models

Tool Calling & Structured Output

Supports all function calling formats available in mlx-lm, JSON schema validation, and MCP tool integration. Tool calling requires the model's chat template to support the tools parameter. The following model families are auto-detected via mlx-lm's built-in tool parsers:

Model FamilyFormat
Llama, Qwen, DeepSeek, etc.JSON <tool_call>
Qwen3.5 SeriesXML <function=...>
Gemma<start_function_call>
GLM (4.7, 5)<arg_key>/<arg_value> XML
MiniMaxNamespaced <minimax:tool_call>
Mistral[TOOL_CALLS]
Kimi K2<|tool_calls_section_begin|>
Longcat<longcat_tool_call>

Models not listed above may still work if their chat template accepts tools and their output uses a recognized <tool_call> XML format. For tool-enabled streaming, assistant text is emitted incrementally while known tool-call control markup is suppressed from visible content; structured tool calls are emitted after parsing the completed turn.

Models

Point --model-dir at a directory containing MLX-format model subdirectories. Two-level organization folders (e.g., mlx-community/model-name/) are also supported.

~/models/
β”œβ”€β”€ Step-3.5-Flash-8bit/
β”œβ”€β”€ Qwen3-Coder-Next-8bit/
β”œβ”€β”€ gpt-oss-120b-MXFP4-Q8/
β”œβ”€β”€ Qwen3.5-122B-A10B-4bit/
└── bge-m3/

Models are auto-detected by type. You can also download models directly from the admin dashboard.

TypeModels
LLMAny model supported by mlx-lm
VLMQwen3.5 Series, GLM-4V, Pixtral, and other mlx-vlm models
OCRDeepSeek-OCR, DOTS-OCR, GLM-OCR
EmbeddingBERT, BGE-M3, ModernBERT
RerankerModernBERT, XLM-RoBERTa

CLI Configuration

# Memory limit for loaded models
omlx serve --model-dir ~/models --max-model-memory 32GB

# Process-level memory limit (default: auto = RAM - 8GB)
omlx serve --model-dir ~/models --max-process-memory 80%

# Enable SSD cache for KV blocks
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

# Set in-memory hot cache size
omlx serve --model-dir ~/models --hot-cache-max-size 20%

# Adjust batch sizes
omlx serve --model-dir ~/models --prefill-batch-size 8 --completion-batch-size 32

# With MCP tools
omlx serve --model-dir ~/models --mcp-config mcp.json

# HuggingFace mirror endpoint (for restricted regions)
omlx serve --model-dir ~/models --hf-endpoint https://hf-mirror.com

# API key authentication
omlx serve --model-dir ~/models --api-key your-secret-key
# Localhost-only: skip verification via admin panel global settings

All settings can also be configured from the web admin panel at /admin. Settings are persisted to ~/.omlx/settings.json, and CLI flags take precedence.

<details> <summary>Architecture</summary>
FastAPI Server (OpenAI / Anthropic API)
    β”‚
    β”œβ”€β”€ EnginePool (multi-model, LRU eviction, TTL, manual load/unload)
    β”‚   β”œβ”€β”€ BatchedEngine (LLMs, continuous batching)
    β”‚   β”œβ”€β”€ VLMEngine (vision-language models)
    β”‚   β”œβ”€β”€ EmbeddingEngine
    β”‚   └── RerankerEngine
    β”‚
    β”œβ”€β”€ ProcessMemoryEnforcer (total memory limit, TTL checks)
    β”‚
    β”œβ”€β”€ Scheduler (FCFS, configurable batch sizes)
    β”‚   └── mlx-lm BatchGenerator
    β”‚
    └── Cache Stack
        β”œβ”€β”€ PagedCacheManager (GPU, block-based, CoW, prefix sharing)
        β”œβ”€β”€ Hot Cache (in-memory tier, write-back)
        └── PagedSSDCacheManager (SSD cold tier, safetensors format)
</details>

Development

CLI Server

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e ".[dev]"
pytest -m "not slow"

macOS App

Requires Python 3.11+ and venvstacks (pip install venvstacks).

cd packaging

# Full build (venvstacks + app bundle + DMG)
python build.py

# Skip venvstacks (code changes only)
python build.py --skip-venv

# DMG only
python build.py --dmg-only

See packaging/README.md for details on the app bundle structure and layer configuration.

Contributing

Contributions are welcome! See Contributing Guide for details.

  • Bug fixes and improvements
  • Performance optimizations
  • Documentation improvements

License

Apache 2.0

Acknowledgments

  • MLX and mlx-lm by Apple
  • mlx-vlm - Vision-language model inference on Apple Silicon
  • vllm-mlx - oMLX started from vllm-mlx v0.1.0 and evolved significantly with multi-model serving, tiered KV caching, VLM with full paged cache support, an admin panel, and a macOS menu bar app
  • venvstacks - Portable Python environment layering for the macOS app bundle
  • mlx-embeddings - Embedding model support for Apple Silicon
  • llm-compressor - Reference AWQ implementation for MoE models, used as design reference for oQ weight equalization

Global Ranking

8.5
Trust ScoreMCPHub Index

Based on codebase health & activity.

Manual Config

{ "mcpServers": { "jundot-omlx": { "command": "npx", "args": ["jundot-omlx"] } } }