Extract document data
The Extract Document Data step uses AI to pull structured fields from a PDF or image and return them as JSON that your automation can use.
This is useful when you need to turn unstructured files, such as invoices, forms, or receipts, into machine-readable data.
You need to enable AI to use this feature.
What this step does
At runtime, this step:
- Loads a document from either a URL or an attachment binding.
- Sends the content to your configured LLM.
- Forces the LLM output into your provided schema shape.
- Returns extracted data as a JSON array in
data.
The extraction output returns at most one object per run.
Typical use cases
- Extracting invoice fields such as
invoiceNumber,total, anddueDate. - Pulling ID or application form details into a table row.
- Reading values from uploaded receipts, purchase orders, or contracts.
- Converting incoming document attachments from email triggers into structured records.
Inputs
- Document (required) - The source document to parse.
- Source (optional, default
URL) - Where the document comes from. - File Type (URL only, optional, default
pdf) - File type for URL documents. - Data schema (required) - Expected output shape.
Source options
URL- Use a document link in theDocumentfield.Attachment- Use an attachment object in theDocumentfield.
If source and input format do not match, extraction fails.
Supported file types
pdfjpgjpegpng
For URL sources, File Type determines how the document is processed.
Designing your data schema
Data schema defines the fields and expected types. Keep it simple and explicit.
Supported value hints:
stringnumberboolean
Example schema:
{
"invoiceNumber": "string",
"invoiceDate": "string",
"totalAmount": "number",
"isPaid": "boolean"
}Tips:
- Use stable field names you can map directly into table columns.
- Prefer simple primitive types.
- If a value is ambiguous (for example dates), initially extract as
stringand normalise in a later step.
Outputs
success(boolean) - Whether extraction succeeded.data(json) - Extracted structured records (array, max 1 object).response(string) - Error details when extraction fails.
Example flow
- Trigger:
Row Createdon a table with an attachment column. - Step:
Extract Document Data - Input:
Source:AttachmentDocument: attachment binding from the trigger rowData schema: expected fields (for example invoice fields)
- Step:
Update Rowto writestepsByName.ExtractStep.data.0.<field>values into columns. - Optional: add a
Conditionstep to branch whenExtractStep.successisfalse.
Troubleshooting
If extraction fails, check the following:
- Missing required inputs:
- Ensure both
DocumentandData schemaare set.
- Ensure both
- Source/input mismatch:
URLsource must receive a URL string.Attachmentsource must receive an attachment object.
- URL fetch failures:
- Confirm the URL is reachable from your Budibase environment.
- Confirm authentication or firewall rules are not blocking access.
- Schema/output parsing failures:
- Simplify your schema and retry.
- Start with a few fields, then add more incrementally.
- No data found:
- The step can fail if the model cannot find matching values for your schema.
- Try clearer field names and cleaner source documents.
Related guides
Updated about 3 hours ago