Extract
Extract content from documents in multiple formats.
POST /extract
Extract content from a document. Returns plain text, chunked output for RAG pipelines, or fully structured JSON with sections, tables, and metadata.
Request Body
{
"file_id": "clx7abc123def",
"format": "text"
}or:
{
"file_url": "https://example.com/report.pdf",
"format": "structured"
}Parameters
| Field | Type | Required | Description |
|---|---|---|---|
file_url | string | Yes* | Public URL of the source file |
file_id | string | Yes* | File ID from /upload |
format | string | No | One of: text, chunks, structured (default: text) |
*Provide either file_url or file_id, not both.
Supported File Types for Extraction
| File Type | Extensions | Notes |
|---|---|---|
.pdf | Text-layer extraction with OCR fallback for scanned documents | |
| Word | .docx, .doc | Full text extraction from Word documents |
| HTML | .html, .htm | Tag stripping and text extraction |
| Markdown | .md | Markdown-to-text extraction |
| Plain Text | .txt | Pass-through (returned as-is) |
| Images | .png, .jpg, .jpeg, .gif, .webp, .tiff | OCR-based text extraction using Tesseract |
Format Options
`text` — Plain text extraction
Returns the full document as a single text string.
`chunks` — Chunked output for RAG
Returns an array of text chunks optimized for vector database ingestion.
`structured` — Full structured extraction
Returns sections, tables, metadata, and document structure. See Structured Extraction for the full schema.
Examples
curl -X POST https://api.parsekit.dev/extract \
-H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{"file_id": "clx7abc123def", "format": "text"}'
# Extract chunks from a URL
curl -X POST https://api.parsekit.dev/extract \
-H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{"file_url": "https://example.com/report.pdf", "format": "chunks"}'Response (202 Accepted)
{
"job_id": "clx7job789jkl",
"status": "queued"
}Poll GET /job/:id until status is complete, then download the result from output_url.