Extract

Extract content from documents in multiple formats.

POST /extract

Extract content from a document. Returns plain text, chunked output for RAG pipelines, or fully structured JSON with sections, tables, and metadata.

Request Body

{
  "file_id": "clx7abc123def",
  "format": "text"
}

or:

{
  "file_url": "https://example.com/report.pdf",
  "format": "structured"
}

Parameters

Field	Type	Required	Description
`file_url`	string	Yes*	Public URL of the source file
`file_id`	string	Yes*	File ID from `/upload`
`format`	string	No	One of: `text`, `chunks`, `structured` (default: `text`)

*Provide either file_url or file_id, not both.

Supported File Types for Extraction

File Type	Extensions	Notes
PDF	`.pdf`	Text-layer extraction with OCR fallback for scanned documents
Word	`.docx`, `.doc`	Full text extraction from Word documents
HTML	`.html`, `.htm`	Tag stripping and text extraction
Markdown	`.md`	Markdown-to-text extraction
Plain Text	`.txt`	Pass-through (returned as-is)
Images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.tiff`	OCR-based text extraction using Tesseract

Format Options

`text` — Plain text extraction

Returns the full document as a single text string.

`chunks` — Chunked output for RAG

Returns an array of text chunks optimized for vector database ingestion.

`structured` — Full structured extraction

Returns sections, tables, metadata, and document structure. See Structured Extraction for the full schema.

Examples

curl -X POST https://api.parsekit.dev/extract \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"file_id": "clx7abc123def", "format": "text"}'

# Extract chunks from a URL
curl -X POST https://api.parsekit.dev/extract \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"file_url": "https://example.com/report.pdf", "format": "chunks"}'

Response (202 Accepted)

{
  "job_id": "clx7job789jkl",
  "status": "queued"
}

Poll GET /job/:id until status is complete, then download the result from output_url.

← Convert Structured Extraction →