Kreuzberg

Extract text, metadata, and code intelligence from 91+ file formats and 306 programming languages at native speeds without needing a GPU.

Key Features

Code intelligence – Extract functions, classes, imports, symbols, and docstrings from 306 programming languages via tree-sitter. Results in ExtractionResult.code_intelligence with semantic chunking
Extensible architecture – Plugin system for custom OCR backends, validators, post-processors, document extractors, and renderers
Polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, Kotlin, C#, PHP, Elixir, R, Dart, Swift, Zig, and C
91+ file formats – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
LLM intelligence – VLM OCR (GPT-4o, Claude, Gemini, Ollama), structured JSON extraction with schema constraints, and provider-hosted embeddings via 143 LLM providers (including local engines: Ollama, LM Studio, vLLM, llama.cpp) through liter-llm
OCR support – Tesseract (all bindings, including Tesseract-WASM for browsers), PaddleOCR (all native bindings), EasyOCR (Python), VLM OCR (143 vision model providers including local engines), extensible via plugin API
High performance – Rust core with pure-Rust PDF, SIMD optimizations and full parallelism
Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
TOON wire format – Token-efficient serialization for LLM/RAG pipelines, ~30-50% fewer tokens than JSON
GFM-quality output – Comrak-based rendering with proper fenced code blocks, table nodes, bracket escaping, and cross-format parity (Markdown, HTML, Djot, Plain)
HTML passthrough – HTML-to-Markdown conversion uses html-to-markdown output directly, bypassing lossy intermediate round-trips
Memory efficient – Streaming parsers for multi-GB files

Complete Documentation | Live Demo | Installation Guides

Installation

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Scripting Languages:

Python – PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
Ruby – RubyGems package, idiomatic Ruby API, native bindings
PHP – Composer package, modern PHP 8.2+ support, type-safe API, async extraction
Elixir – Hex package, OTP integration, concurrent processing
R – r-universe package, idiomatic R API, extendr bindings
Dart / Flutter – pub.dev package, flutter_rust_bridge runtime, native bindings for macOS/iOS/Android/Linux/Windows

JavaScript/TypeScript:

@kreuzberg/node – Native NAPI-RS bindings for Node.js/Bun, fastest performance
@kreuzberg/wasm – WebAssembly for browsers/Deno/Cloudflare Workers, comprehensive format and OCR support (PDF, Excel, archives, all office formats, real Tesseract via the WASI build) — only ORT-dependent features (paddle-ocr, layout detection, embeddings, auto-rotate) and server modes (api/mcp/cli) are excluded

Compiled Languages:

Go – Go module with FFI bindings, context-aware async
Java – Maven Central, Foreign Function & Memory API
Kotlin – Maven Central, Kotlin/JVM with idiomatic data classes, sealed enums, and coroutine-based async
C# – NuGet package, .NET 6.0+, full async/await support
Swift – Swift Package Manager, macOS 13+/iOS 16+, native Swift types and async/await

Native:

Rust – Core library, flexible feature flags, zero-copy APIs
Zig – zig fetch + build.zig.zon, idiomatic error sets, optional types, slice-based memory
C (FFI) – C header + shared library, pkg-config/CMake support, cross-platform

Containers:

Docker – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)

Command-Line:

CLI – Cross-platform binary, batch processing, MCP server mode

All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.

Platform Support

Complete architecture coverage across all language bindings:

Language	Linux x86_64	Linux aarch64	macOS ARM64	Windows x64
Python	✅	✅	✅	✅
Node.js	✅	✅	✅	✅
WASM	✅	✅	✅	✅
Ruby	✅	✅	✅	-
R	✅	✅	✅	✅
Elixir	✅	✅	✅	✅
Go	✅	✅	✅	✅
Java	✅	✅	✅	✅
Kotlin	✅	✅	✅	✅
C#	✅	✅	✅	✅
PHP	✅	✅	✅	✅
Swift	-	-	✅	-
Dart	✅	✅	✅	✅
Zig	✅	✅	✅	✅
Rust	✅	✅	✅	✅
C (FFI)	✅	✅	✅	✅
CLI	✅	✅	✅	✅
Docker	✅	✅	✅	-

Note: ✅ = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. MacOS support is Apple Silicon only.

Mobile (iOS, Android)

Target	ORT-dependent features*
iOS (`aarch64-apple-ios`, `aarch64-apple-ios-sim`)	✅
Android arm64 (`aarch64-linux-android`)	✅
Android x86_64 emulator (`x86_64-linux-android`)	❌

*ORT-dependent features: PaddleOCR, layout detection, embeddings, auto-rotate. All non-ORT capabilities (Tesseract OCR, every document format, chunking, language detection, keywords, tree-sitter code intelligence, API/MCP, LLM) are available on all four mobile targets.

The x86_64-linux-android emulator triple lacks an ORT prebuilt upstream; kreuzberg's kreuzberg crate exposes an android-target aggregate feature that selects the same no-ORT feature set as WASM. The kreuzberg-ffi and kreuzberg-dart crates auto-select that aggregate for the emulator via target-conditional dependencies — host and arm64 phones get full features automatically.

Browsers / Edge (WebAssembly)

WASM excludes the same ORT-dependent feature set as the Android x86_64 emulator. The shared no-ORT base lives behind the no-ort-target feature in the core crate; both wasm-target and android-target compose it.

Embeddings Support (Optional)

To use embeddings functionality:

Install ONNX Runtime 1.24+:
- Linux: Download from ONNX Runtime releases (Debian packages may have older versions)
- MacOS: brew install onnxruntime
- Windows: Download from ONNX Runtime releases
Use embeddings in your code - see Embeddings Guide

Note: Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.

Supported Formats

91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

Category	Formats	Capabilities
Word Processing	`.docx`, `.docm`, `.dotx`, `.dotm`, `.dot`, `.odt`, `.pages`	Full text, tables, lists, images, metadata, styles
Spreadsheets	`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.xltx`, `.xlt`, `.ods`, `.numbers`	Sheet data, formulas, cell metadata, charts
Presentations	`.pptx`, `.pptm`, `.ppsx`, `.potx`, `.potm`, `.pot`, `.key`	Slides, speaker notes, images, metadata
PDF	`.pdf`	Text, tables, images, metadata, OCR support
eBooks	`.epub`, `.fb2`	Chapters, metadata, embedded resources
Database	`.dbf`	Table data extraction, field type support
Hangul	`.hwp`, `.hwpx`	Korean document format, text extraction

Images (OCR-Enabled)

Category	Formats	Features
Raster	`.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`	OCR, table detection, EXIF metadata, dimensions, color space
Advanced	`.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`	Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection
Vector	`.svg`	DOM parsing, embedded text, graphics metadata

Web & Data

Category	Formats	Features
Markup	`.html`, `.htm`, `.xhtml`, `.xml`, `.svg`	DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data	`.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv`	Schema detection, nested structures, validation
Text & Markdown	`.txt`, `.md`, `.markdown`, `.djot`, `.mdx`, `.rst`, `.org`, `.rtf`	CommonMark, GFM, Djot, MDX, reStructuredText, Org Mode, Rich Text

Email & Archives

Category	Formats	Features
Email	`.eml`, `.msg`	Headers, body (HTML/plain), attachments, UTF-16 support
Archives	`.zip`, `.tar`, `.tgz`, `.gz`, `.7z`	Recursive extraction, nested archives, metadata

Academic & Scientific

Category	Formats	Features
Citations	`.bib`, `.ris`, `.nbib`, `.enw`, `.csl`	BibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON
Scientific	`.tex`, `.latex`, `.typ`, `.typst`, `.jats`, `.ipynb`	LaTeX, Typst, JATS journal articles, Jupyter notebooks
Publishing	`.fb2`, `.docbook`, `.dbk`, `.opml`	FictionBook, DocBook XML, OPML outlines
Documentation	`.pod`, `.mdoc`, `.troff`	Perl POD, man pages, troff

Complete Format Reference →

Code Intelligence (306 Languages)

Feature	Description
Structure Extraction	Functions, classes, methods, structs, interfaces, enums
Import/Export Analysis	Module dependencies, re-exports, wildcard imports
Symbol Extraction	Variables, constants, type aliases, properties
Docstring Parsing	Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats
Diagnostics	Parse errors with line/column positions
Syntax-Aware Chunking	Split code by semantic boundaries, not arbitrary byte offsets

Key Features

<details> <summary><strong>OCR with Table Extraction</strong></summary>

Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.

OCR Backend Documentation →

</details> <details> <summary><strong>Batch Processing</strong></summary>

Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.

Batch Processing Guide →

</details> <details> <summary><strong>Password-Protected PDFs</strong></summary>

Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.

PDF Configuration →

</details> <details> <summary><strong>Language Detection</strong></summary>

Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.

Language Detection Guide →

</details> <details> <summary><strong>Metadata Extraction</strong></summary>

Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.

Metadata Guide →

</details>

AI Coding Assistants

Kreuzberg ships with an Agent Skill that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.

Install the skill into any project using the Vercel Skills CLI:

npx skills add kreuzberg-dev/kreuzberg

The skill is located at skills/kreuzberg/SKILL.md and is automatically discovered by supported AI coding tools once installed.

Documentation

Installation Guide – Setup and dependencies
User Guide – Comprehensive usage guide
API Reference – Complete API documentation
Format Support – Supported file formats
OCR Backends – OCR engine setup
CLI Guide – Command-line usage
Migration Guides – Upgrading from other libraries

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Part of Kreuzberg.dev

Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
html-to-markdown — fast, lossless HTML→Markdown engine.
liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
alef — the polyglot binding generator that produces all per-language bindings.
Discord — community, roadmap, announcements.

License

Elastic License 2.0 (ELv2) - see LICENSE for details. See https://www.elastic.co/licensing/elastic-license for the full license text.

kreuzberg dev/kreuzberg

What is kreuzberg dev/kreuzberg?

How to use kreuzberg dev/kreuzberg?

Key Features

Optimized Use Cases

kreuzberg dev/kreuzberg FAQ

Is kreuzberg dev/kreuzberg safe?

Is kreuzberg dev/kreuzberg up to date?

Are there any limits for kreuzberg dev/kreuzberg?

Official Documentation

Kreuzberg

Key Features

Installation

Platform Support

Mobile (iOS, Android)

Browsers / Edge (WebAssembly)

Embeddings Support (Optional)

Supported Formats

Office Documents

Images (OCR-Enabled)

Web & Data

Email & Archives

Academic & Scientific

Code Intelligence (306 Languages)

Key Features

AI Coding Assistants

Documentation

Contributing

Part of Kreuzberg.dev

License

Global Ranking

Manual Config