Open source Apache-2.0 licensed

Open source Apache-2.0 licensed

Turn documents into structured JSON — 100% locally.

Turn documents into structured JSON — 100% locally.

ParseHawk extracts validated JSON from PDFs, scans, images, and text — without sending sensitive files to a third-party API. Drive it from the REST API, the CLI, or the web UI.

ParseHawk extracts validated JSON from PDFs, scans, images, and text — without sending sensitive files to a third-party API. Drive it from the REST API, the CLI, or the web UI.

$ git clone https://github.com/parsehawk/parsehawk.git

Runs on macOS Apple Silicon or Linux with an NVIDIA GPU

From unstructured input to validated JSON

Point ParseHawk at a document, describe the shape you want back, and get clean, schema-checked data — without training a model.

Any document in

PDFs, scans, images, plain text, and Markdown all become structured JSON.

Your own schemas

Describe exactly the fields you want back with JSON Schema Draft 2020-12.

Zero-shot or few-shot

Start with instructions and a schema, then add examples when a type needs guidance.

Validated output

Every result is checked against your schema and stored as canonical job.result.data.

Runs on your hardware

vLLM on Linux NVIDIA and vLLM Metal on Apple Silicon — a server or your MacBook.

Private by default

Files, jobs, extractors, and results stay local. Nothing leaves your machine.

For teams working with documents that must stay private

ParseHawk is built for developers and teams handling sensitive files — the kind of data that simply should not leave your own infrastructure.

Developers

Wire extraction into apps, services, and agents through one local REST API.

Teams & ops

Stand up shared local extraction for invoices, receipts, and back-office files.

Regulated work

Keep medical, legal, and financial documents on infrastructure you fully control.

Common inputs

Invoices

Receipts

Contracts

Internal docs

Customer files

Medical records

Financial records

One local API, three ways to drive it

ParseHawk exposes a single local REST API. The CLI and web UI are clients of that same API — reach for whichever fits the job.

01

REST API

Programmatic extraction for apps, services, and agents.

02

CLI

Run the local stack and one-shot extractions from your shell.

03

Web UI

Upload, pick an extractor, run, and inspect the result.

Run document AI on your own hardware.

Clone the repo, install the CLI, and start extracting locally in minutes.

$ git clone https://github.com/parsehawk/parsehawk.git

$ cd parsehawk && uv tool install --editable .

$ parsehawk start

Built with open source