Extract

Extract content from documents in multiple formats.

POST /extract

Extract content from a document. Returns plain text, chunked output for RAG pipelines, or fully structured JSON with sections, tables, and metadata.

Request Body

{
  "file_id": "clx7abc123def",
  "format": "text"
}

or:

{
  "file_url": "https://example.com/report.pdf",
  "format": "structured"
}

Parameters

FieldTypeRequiredDescription
file_urlstringYes*Public URL of the source file
file_idstringYes*File ID from /upload
formatstringNoOne of: text, chunks, structured (default: text)

*Provide either file_url or file_id, not both.

Supported File Types for Extraction

File TypeExtensionsNotes
PDF.pdfText-layer extraction with OCR fallback for scanned documents
Word.docx, .docFull text extraction from Word documents
HTML.html, .htmTag stripping and text extraction
Markdown.mdMarkdown-to-text extraction
Plain Text.txtPass-through (returned as-is)
Images.png, .jpg, .jpeg, .gif, .webp, .tiffOCR-based text extraction using Tesseract

Format Options

`text` — Plain text extraction

Returns the full document as a single text string.

`chunks` — Chunked output for RAG

Returns an array of text chunks optimized for vector database ingestion.

`structured` — Full structured extraction

Returns sections, tables, metadata, and document structure. See Structured Extraction for the full schema.

Examples

curl -X POST https://api.parsekit.dev/extract \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"file_id": "clx7abc123def", "format": "text"}'

# Extract chunks from a URL
curl -X POST https://api.parsekit.dev/extract \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"file_url": "https://example.com/report.pdf", "format": "chunks"}'

Response (202 Accepted)

{
  "job_id": "clx7job789jkl",
  "status": "queued"
}

Poll GET /job/:id until status is complete, then download the result from output_url.