← Home Gemini / How to Use Gemini 2.5 Pro’s Native…
11 min
Gemini

How to Use Gemini 2.5 Pro’s Native PDF API: Extract, Classify, and Mine Documents at Scale

promptyze
Editor · Promptowy
05.03.2026 Date
11 min Reading time
How to Use Gemini 2.5 Pro's Native PDF API: Extract, Classify, and Mine Documents at Scale
Native PDF parsing meets structured data extraction. promptowy.com

Most document processing pipelines are a mess of PDFMiner, Tesseract, chunking hacks, and silent failures on page 47 of a scanned lease agreement. Then Google shipped native PDF support in the Gemini API, and the whole stack collapses into a single multimodal request. You send the file, the model reads it — text, tables, charts, handwriting, the lot — and returns structured output you can actually use.

Gemini 2.5 Pro is currently the top-tier model in Google’s lineup, available through Google AI Studio and the Gemini API. Its multimodal context window handles mixed content — including PDFs supplied either as inline base64 or via Google Cloud Storage URIs — without you needing to pre-process anything. This tutorial walks through every layer: authentication, single-document extraction, batch classification, table mining, and a few patterns that save real time in production.

What You’ll Build

By the end of this guide you’ll have a working Python pipeline that authenticates against the Gemini API, uploads PDFs using the File API, extracts structured metadata from each one, classifies documents by type, pulls tables into clean JSON, and handles errors without losing your mind. The code is copy-paste ready and tested against the official Google AI Python SDK.

Requirements

You need a Google AI Studio API key (free tier works for prototyping; for production volumes, look at Google Cloud Vertex AI). Install the SDK with pip install google-generativeai — you want version 0.7 or later, which includes the File API. PDFs can be up to 1,000 pages per file per the current File API documentation, and each page counts against your context window. Gemini 2.5 Pro’s context window is 1 million tokens, which in practice means very long documents process fine. Have your PDFs on disk or in GCS before you start.

Note 💡

The “10,000 pages in 60 seconds” framing from some marketing materials is not a verified benchmark. Actual throughput depends on document complexity, your rate limits (requests per minute vary by tier), and network latency. Plan your batch sizes around your actual API quota, not headline numbers.

Step 1 — Authentication and SDK Setup

Start with the basics. Set your API key as an environment variable rather than hardcoding it.

import google.generativeai as genai
import os

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Confirm the model is reachable
model = genai.GenerativeModel("gemini-2.5-pro-latest")
print(model.model_name)

Use gemini-2.5-pro-latest as the model string to always hit the latest stable 2.5 Pro checkpoint. If you need reproducibility for an audit trail, pin to a dated version string once Google publishes stable aliases.

Step 2 — Upload a PDF with the File API

The File API is how you get PDFs into the model without stuffing raw base64 into every request. Files are stored on Google’s side for 48 hours and referenced by URI, which matters for batching: upload once, query many times.

import google.generativeai as genai
import os

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

def upload_pdf(file_path: str):
    """Upload a PDF to the Gemini File API and return the file object."""
    with open(file_path, "rb") as f:
        uploaded = genai.upload_file(
            path=file_path,
            mime_type="application/pdf",
            display_name=os.path.basename(file_path)
        )
    print(f"Uploaded: {uploaded.display_name} -> {uploaded.uri}")
    return uploaded

file_obj = upload_pdf("contract_2025.pdf")

The returned uploaded object has a .uri you pass directly into model calls. No parsing, no page splitting. If the upload fails (network blip, file too large), the exception tells you exactly what went wrong — wrap it in a retry loop for production.

Pro tip ✅

Upload your entire PDF batch first, collect all file URIs into a list, then fire the extraction calls. This decouples upload latency from inference latency and makes it easier to retry individual failures without re-uploading.

Step 3 — Extract Metadata from a Single Document

Here’s where the magic is. A single prompt handles what used to require a five-library pipeline.

import google.generativeai as genai
import json
import os

genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-2.5-pro-latest")

def extract_metadata(file_obj) -> dict:
    prompt = """
    Analyze this PDF document and return a JSON object with the following fields:
    - document_type: one of [contract, invoice, report, form, correspondence, other]
    - title: document title or best guess from content
    - date: document date in ISO 8601 format, or null if not found
    - parties: list of organizations or individuals named as primary parties
    - summary: two-sentence summary of the document's purpose
    - page_count: total number of pages
    - language: ISO 639-1 language code
    Return only valid JSON, no markdown fences.
    """
    response = model.generate_content([prompt, file_obj])
    return json.loads(response.text)

file_obj = genai.get_file("files/your-file-id")  # or pass directly from upload
metadata = extract_metadata(file_obj)
print(json.dumps(metadata, indent=2))

The prompt instructs the model to return pure JSON — no backticks, no explanation. Add try/except json.JSONDecodeError around json.loads() and on failure, re-prompt asking the model to fix the JSON. Happens rarely with 2.5 Pro but it happens.

Single document, multiple structured outputs.
Single document, multiple structured outputs.

Step 4 — Pull Tables into Structured JSON

Table extraction is where native PDF support really earns its keep. Scanned tables that would break any regex-based extractor come through cleanly because the model is reading layout, not just character streams.

def extract_tables(file_obj) -> list:
    prompt = """
    Find every table in this document. For each table return a JSON object with:
    - table_index: integer starting from 1
    - page_number: page where the table appears
    - headers: list of column header strings
    - rows: list of lists, each inner list representing one data row
    - caption: table caption or title if present, otherwise null
    Return a JSON array of these objects. No markdown, no explanation.
    """
    response = model.generate_content([prompt, file_obj])
    return json.loads(response.text)

For financial documents with complex merged cells, add this line to the prompt: "If a cell spans multiple columns, repeat its value in each logical column position." That keeps your row arrays uniform and prevents downstream pandas explosions.

def extract_specific_table(file_obj, description: str) -> dict:
    prompt = f"""
    Find the table in this document that contains {description}.
    Return it as a JSON object with keys: headers (list), rows (list of lists).
    If no matching table exists, return null.
    No markdown fences.
    """
    response = model.generate_content([prompt, file_obj])
    result = response.text.strip()
    if result.lower() == "null":
        return None
    return json.loads(result)

# Example usage
revenue_table = extract_specific_table(file_obj, "annual revenue figures by quarter")

Pro tip ✅

When you only need one specific table out of a 200-page report, describe it precisely in natural language. “The table containing employee headcount by department” is faster and cheaper than extracting every table and filtering in code.

Step 5 — Build a Batch Document Classifier

This is the production-ready piece. Classify a folder of PDFs concurrently, staying inside your rate limits with a simple semaphore.

import asyncio
import google.generativeai as genai
from pathlib import Path
import json
import os

genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-2.5-pro-latest")

CONCURRENCY_LIMIT = 5  # Adjust to your API tier's RPM

async def classify_document(semaphore, file_path: Path) -> dict:
    async with semaphore:
        file_obj = genai.upload_file(
            path=str(file_path),
            mime_type="application/pdf",
            display_name=file_path.name
        )
        prompt = """
        Classify this document. Return JSON with:
        - file_name: the document's display name
        - category: [contract, invoice, legal_filing, financial_report, hr_document, technical_spec, correspondence, other]
        - confidence: float 0.0 to 1.0
        - key_entities: up to 5 named entities (people, orgs, dates) critical to the document
        - action_required: boolean, true if the document requires a human decision or signature
        Only JSON, no markdown.
        """
        response = model.generate_content([prompt, file_obj])
        result = json.loads(response.text)
        result["local_path"] = str(file_path)
        genai.delete_file(file_obj.name)  # Clean up immediately
        return result

async def batch_classify(folder: str) -> list:
    pdf_files = list(Path(folder).glob("*.pdf"))
    semaphore = asyncio.Semaphore(CONCURRENCY_LIMIT)
    tasks = [classify_document(semaphore, p) for p in pdf_files]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    clean = [r for r in results if not isinstance(r, Exception)]
    errors = [r for r in results if isinstance(r, Exception)]
    print(f"Classified: {len(clean)} | Errors: {len(errors)}")
    return clean

# Run it
results = asyncio.run(batch_classify("./documents"))
with open("classification_results.json", "w") as f:
    json.dump(results, f, indent=2)

The CONCURRENCY_LIMIT = 5 is conservative. On a paid tier you can push higher, but start there and watch your quota dashboard. The return_exceptions=True flag means one bad PDF doesn’t kill the whole batch — you get the error in the list and process the rest.

Batch classification pipeline in motion.
Batch classification pipeline in motion.

Warning ⚠️

Gemini’s File API stores uploaded files for 48 hours and counts them against your storage quota. Call genai.delete_file(file_obj.name) after each extraction if you’re running high volumes. The code above does this automatically, but if you’re adapting snippets, don’t forget it.

Step 6 — Chain Extraction Calls on the Same File

Since uploaded files persist for 48 hours by URI, you can run multiple extraction passes without re-uploading. Upload once, extract metadata, extract tables, run a compliance check — all from the same file object.

def full_document_pipeline(file_path: str) -> dict:
    # Upload once
    file_obj = genai.upload_file(
        path=file_path,
        mime_type="application/pdf",
        display_name=os.path.basename(file_path)
    )
    
    results = {}
    
    # Pass 1: Metadata
    meta_prompt = """
    Return JSON: {document_type, title, date (ISO 8601 or null), 
    parties (list), page_count, language}. No markdown.
    """
    r1 = model.generate_content([meta_prompt, file_obj])
    results["metadata"] = json.loads(r1.text)
    
    # Pass 2: Tables
    table_prompt = """
    Return a JSON array of all tables. Each item: 
    {table_index, page_number, headers, rows}. 
    Empty array if no tables. No markdown.
    """
    r2 = model.generate_content([table_prompt, file_obj])
    results["tables"] = json.loads(r2.text)
    
    # Pass 3: Risk flags (useful for contracts/legal docs)
    risk_prompt = """
    Identify clauses or statements that represent legal or financial risk.
    Return JSON array: [{risk_type, description, page_number, severity: low|medium|high}].
    Empty array if none. No markdown.
    """
    r3 = model.generate_content([risk_prompt, file_obj])
    results["risks"] = json.loads(r3.text)
    
    # Clean up
    genai.delete_file(file_obj.name)
    
    return results

Pro tip ✅

Split your extraction into focused single-purpose prompts rather than one giant prompt trying to do everything at once. You get cleaner JSON, easier error handling, and you can re-run just the failing pass without repeating the others.

Step 7 — Prompt Patterns That Actually Work

The difference between useful output and a JSON parsing headache is almost always in how you phrase the instruction. These patterns are tested and reliable with Gemini 2.5 Pro.

For strict JSON output every time:

Extract all invoice line items from this document.
Return ONLY a valid JSON array. Each element must have:
- description: string
- quantity: number
- unit_price: number
- total: number
- tax_rate: number or null
Do not include any text before or after the JSON array.
Do not use markdown code fences.

For handling documents where a field might not exist:

From this contract, extract the termination clause verbatim.
If no termination clause exists, return the JSON object: {"found": false, "text": null}
If it exists, return: {"found": true, "text": "", "page": }
Only JSON. No other text.

For comparing two documents — pass both file objects to the same call:

Compare these two contracts. Return JSON:
{
  "matching_clauses": [list of clause types present in both],
  "document_1_only": [clause types only in first document],
  "document_2_only": [clause types only in second document],
  "key_differences": [concise description of material differences]
}
No markdown.
Two-document comparison via multimodal context.
Two-document comparison via multimodal context.

Avoid 🚫

Don’t ask Gemini to return JSON and then embed instructions like “explain your reasoning” in the same prompt. The model will mix prose and JSON and your parser will fail. Keep reasoning prompts separate from structured output prompts.

Real-World Use Cases Worth Your Time

Three scenarios where this pipeline pays for itself fast. First: contract lifecycle management. A legal team with thousands of executed contracts uses the classifier to tag document types, the metadata extractor to pull party names and dates into a database, and the risk pass to flag non-standard clauses for attorney review. The model catches things keyword search misses because it understands context, not just strings.

Second: invoice processing. Finance teams running AP automation upload invoice batches nightly. The table extractor pulls line items, the metadata pass grabs vendor names and totals, and the output feeds directly into ERP systems. The edge-case win here is scanned invoices from vendors who haven’t heard of PDFs — Gemini handles the OCR implicitly.

Third: research document intake. Academic or consulting teams ingesting large volumes of reports use the pipeline to generate structured summaries, tag topics, extract cited statistics, and identify documents needing deeper human review. What used to take a team of research assistants a week runs overnight.

Pro tip ✅

For high-confidence production pipelines, always include a validation pass: after extraction, ask the model to verify its own output against specific fields. Something like “Review this JSON and confirm all dates are valid ISO 8601 and all numeric fields are numbers, not strings. Return the corrected JSON.” Adds one API call, saves hours of downstream debugging.

Where to Go From Here

The patterns in this guide scale. Upload PDFs in parallel, chain passes on the same file object, feed outputs into Postgres or BigQuery, and you have a document intelligence pipeline that would have taken a small team months to build with traditional tooling. The Gemini API’s native PDF support removes the fragile pre-processing layer that breaks on every unusual document format — and in document-heavy workflows, unusual formats are the majority.

The next logical step is combining this with NotebookLM for interactive document Q&A, or routing classified documents into different Gemini prompts based on type — contracts go to the risk extraction chain, invoices go to the line-item chain. The File API URI makes that routing trivial: classify first, then decide what extraction logic fires next. That’s a proper document intelligence stack, built in an afternoon.

author avatar
promptyze
promptyze
Founder · Editor · Promptowy

Piszę o AI i automatyzacji od 3 lat. Prowadzę promptowy.com.

More →