Skip to content
Gemini

How to Extract Structured JSON from Unstructured Text Using Gemini 2.5 Pro

Turn messy emails, support tickets, and survey responses into clean JSON using Gemini 2.5 Pro’s structured output — with Python and Node.js code you can copy right now.

9 min read
How to Extract Structured JSON from Unstructured Text Using Gemini 2.5 Pro

Unstructured text is the cockroach of enterprise data — it’s everywhere, it’s messy, and nobody wants to deal with it manually. Customer emails arrive in 47 different formats. Support tickets bury the actual problem in three paragraphs of apology. Survey responses are basically stream-of-consciousness therapy sessions. If you’ve been copy-pasting this stuff into spreadsheets by hand, stop. Gemini 2.5 Pro’s structured output support — using the response_schema parameter in the Gemini API — will do it for you, reliably, at scale.

The feature isn’t marketed under some snappy name. Google calls it structured output, and it works by letting you pass a JSON schema directly to the model. The model then returns a response that strictly conforms to that schema — no hallucinated keys, no missing fields, no free-text where you wanted an integer. It’s available in both Python and Node.js via the official Google Generative AI SDKs, and it works with Gemini 2.5 Pro’s two-million-token context window, which means you can feed it an entire inbox thread and get structured data back in one shot.

This tutorial walks through exactly how to set it up, with concrete schemas and prompts for three real-world use cases: customer emails, support tickets, and survey responses.

What You’ll Actually Build

By the end of this, you’ll have working Python and Node.js code that takes raw, unstructured text and outputs clean JSON conforming to a schema you define. You’ll be able to extract sentiment, urgency, named entities, action items, and categorical labels — all in a single API call, with type validation baked in. This is the kind of pipeline that normally requires a custom NLP stack. With Gemini 2.5 Pro and a well-written schema, you can prototype it in under an hour.

Requirements

You need a Google AI Studio API key (free tier works for testing, paid for production volume). Install the SDK with pip install google-generativeai for Python or npm install @google/generative-ai for Node.js. You’ll also want Python 3.9+ or Node.js 18+. That’s genuinely it — no extra dependencies, no vector databases, no orchestration framework required for what we’re covering here.

Note 💡

Get your API key from Google AI Studio at aistudio.google.com. The free tier gives you generous rate limits for development, but for production pipelines processing thousands of documents, you’ll want a Google Cloud project with billing enabled.

Understanding the response_schema Parameter

The core mechanic is simple: instead of letting Gemini return whatever it wants, you hand it a JSON Schema definition and tell it to fill in the blanks. The model reads your input text, reasons about it, and outputs a JSON object that matches your schema exactly. If a field is a string, it’s a string. If it’s an enum with three allowed values, the model picks one of those three — it doesn’t invent a fourth.

You configure this through the generation_config parameter when calling the model. You set response_mime_type to "application/json" and pass your schema object to response_schema. The schema follows standard JSON Schema conventions with Google’s type notation. Here’s the minimal setup in Python:

import google.generativeai as genai
import json

genai.configure(api_key="YOUR_API_KEY")

schema = {
    "type": "object",
    "properties": {
        "sentiment": {
            "type": "string",
            "enum": ["positive", "neutral", "negative"]
        },
        "urgency": {
            "type": "string",
            "enum": ["low", "medium", "high", "critical"]
        },
        "summary": {"type": "string"},
        "action_required": {"type": "boolean"}
    },
    "required": ["sentiment", "urgency", "summary", "action_required"]
}

model = genai.GenerativeModel(
    model_name="gemini-2.5-pro",
    generation_config={
        "response_mime_type": "application/json",
        "response_schema": schema
    }
)

response = model.generate_content("Analyze this customer email: " + email_text)
parsed = json.loads(response.text)
print(parsed)

The required array is important — fields listed there will always appear in the output. Optional fields may be omitted if the model determines they don’t apply. Get used to being explicit about what’s required versus optional in your schema design.

Use Case 1: Parsing Customer Emails

Customer email parsing is where this feature earns its keep immediately. The schema needs to capture the practical stuff: what the customer wants, how upset they are, what product they’re talking about, and whether a human needs to look at this before it hits the auto-responder queue.

Analyze the following customer email and extract structured information.

Email:
"Hi, I've been a customer for 6 years and I'm genuinely furious. I ordered the Pro subscription on November 3rd, was charged twice ($149 each time), and despite three support chats, nobody has fixed this. I need a refund for the duplicate charge AND a response from someone senior, not a bot. If this isn't resolved by Friday, I'm canceling and disputing both charges with my bank."

Extract:
- customer_sentiment (positive/neutral/negative/furious)
- primary_issue (billing/technical/shipping/account/other)
- monetary_amount_mentioned (number or null)
- deadline_mentioned (string or null)
- escalation_required (boolean)
- key_action_items (array of strings)
- churn_risk (low/medium/high/critical)

Pair that prompt with this schema:

{
  "type": "object",
  "properties": {
    "customer_sentiment": {
      "type": "string",
      "enum": ["positive", "neutral", "negative", "furious"]
    },
    "primary_issue": {
      "type": "string",
      "enum": ["billing", "technical", "shipping", "account", "other"]
    },
    "monetary_amount_mentioned": {"type": "number", "nullable": true},
    "deadline_mentioned": {"type": "string", "nullable": true},
    "escalation_required": {"type": "boolean"},
    "key_action_items": {
      "type": "array",
      "items": {"type": "string"}
    },
    "churn_risk": {
      "type": "string",
      "enum": ["low", "medium", "high", "critical"]
    }
  },
  "required": ["customer_sentiment", "primary_issue", "escalation_required", "key_action_items", "churn_risk"]
}

Pro tip ✅

Add a confidence_score field (number between 0 and 1) to your schema. Gemini will self-report how confident it is in its extraction. Anything below 0.7 can be flagged for human review automatically — gives you a built-in quality gate without extra model calls.

Use Case 2: Support Ticket Triage

Support tickets need different fields than customer emails. You care about reproducibility, affected components, and whether the user already tried the obvious fixes — so you don’t have tier-one support asking someone to restart their browser when they already said they restarted their browser three times.

You are a support ticket parser. Extract structured triage data from the following support ticket.

Ticket:
"Title: App crashes on export
Description: Every time I try to export a report to PDF using the Chrome extension (v2.4.1), the app freezes for about 30 seconds then crashes. I've tried on two different computers (both Windows 11), cleared cache, disabled other extensions, and reinstalled the extension twice. Started happening after your update on Feb 20th. This is blocking our entire team from closing month-end reports."

Extract all fields as specified in the schema.
{
  "type": "object",
  "properties": {
    "issue_category": {
      "type": "string",
      "enum": ["crash", "performance", "data_loss", "ui_bug", "integration", "permissions", "other"]
    },
    "affected_component": {"type": "string"},
    "affected_platform": {"type": "string"},
    "version_mentioned": {"type": "string", "nullable": true},
    "steps_already_tried": {
      "type": "array",
      "items": {"type": "string"}
    },
    "business_impact": {
      "type": "string",
      "enum": ["individual", "team", "department", "company-wide"]
    },
    "regression_suspected": {"type": "boolean"},
    "regression_date": {"type": "string", "nullable": true},
    "priority": {
      "type": "string",
      "enum": ["P1", "P2", "P3", "P4"]
    },
    "suggested_assignee_team": {
      "type": "string",
      "enum": ["frontend", "backend", "integrations", "infrastructure", "unknown"]
    }
  },
  "required": ["issue_category", "affected_component", "steps_already_tried", "business_impact", "regression_suspected", "priority", "suggested_assignee_team"]
}

Warning ⚠️

Don’t make your enums too narrow. If you define affected_platform as an enum with only five OS options and a ticket mentions a Chromebook, the model has to pick the closest match or hallucinate — neither is great. Use open strings for fields where variety is genuinely unbounded, and enums only where you want to force categorization.

Use Case 3: Survey Response Analysis

Survey responses are the chaotic neutral of unstructured data. People answer “What do you like about our product?” with a complaint about shipping. They rate satisfaction 2/10 and then say “everything is great.” Gemini handles the contradiction by being explicit about what it’s extracting versus what the user literally said.

Parse this open-ended survey response and extract structured insights.

Survey question: "How has our product impacted your workflow? What would you change?"

Response:
"Honestly the dashboard saves me like an hour a day which is huge. But the mobile app is basically unusable — I've given up on it. Also your pricing went up 30% and I got zero notice, that was pretty frustrating. The integrations with Slack and Jira are the best thing you've added in years. Would kill for a dark mode though."

Extract structured feedback data according to the schema.
{
  "type": "object",
  "properties": {
    "overall_sentiment": {
      "type": "string",
      "enum": ["very_positive", "positive", "mixed", "negative", "very_negative"]
    },
    "positive_themes": {
      "type": "array",
      "items": {"type": "string"}
    },
    "negative_themes": {
      "type": "array",
      "items": {"type": "string"}
    },
    "feature_requests": {
      "type": "array",
      "items": {"type": "string"}
    },
    "products_mentioned": {
      "type": "array",
      "items": {"type": "string"}
    },
    "pricing_sentiment": {
      "type": "string",
      "enum": ["positive", "neutral", "negative", "not_mentioned"]
    },
    "quantified_value_mentioned": {"type": "boolean"},
    "quantified_value_detail": {"type": "string", "nullable": true},
    "churn_signal": {"type": "boolean"}
  },
  "required": ["overall_sentiment", "positive_themes", "negative_themes", "feature_requests", "pricing_sentiment", "quantified_value_mentioned", "churn_signal"]
}

Node.js Implementation

The Python and Node.js SDKs are structurally similar, but if your backend runs on Node, here’s the equivalent setup for the support ticket example:

import { GoogleGenerativeAI, SchemaType } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);

const schema = {
  type: SchemaType.OBJECT,
  properties: {
    issue_category: {
      type: SchemaType.STRING,
      enum: ["crash", "performance", "data_loss", "ui_bug", "integration", "permissions", "other"]
    },
    affected_component: { type: SchemaType.STRING },
    steps_already_tried: {
      type: SchemaType.ARRAY,
      items: { type: SchemaType.STRING }
    },
    priority: {
      type: SchemaType.STRING,
      enum: ["P1", "P2", "P3", "P4"]
    },
    regression_suspected: { type: SchemaType.BOOLEAN }
  },
  required: ["issue_category", "affected_component", "steps_already_tried", "priority", "regression_suspected"]
};

const model = genAI.getGenerativeModel({
  model: "gemini-2.5-pro",
  generationConfig: {
    responseMimeType: "application/json",
    responseSchema: schema
  }
});

async function parseTicket(ticketText) {
  const prompt = `Parse this support ticket and extract triage data:nn${ticketText}`;
  const result = await model.generateContent(prompt);
  return JSON.parse(result.response.text());
}

parseTicket(yourTicketText).then(console.log);

Pro tip ✅

Wrap your json.loads() or JSON.parse() call in a try/catch. Gemini’s structured output is reliable, but network errors or token limit hits can return partial responses. Catching parse failures and routing them to a fallback (or a retry with a simpler schema) keeps your pipeline from silently dropping records.

Batch Processing and Context Window Strategy

Here’s where Gemini 2.5 Pro’s two-million-token context window becomes genuinely useful rather than just a spec sheet flex. You can send multiple documents in a single prompt and ask for an array of parsed objects back. This cuts your API calls dramatically compared to processing one document at a time.

You will receive multiple customer emails separated by "---EMAIL BOUNDARY---". 
Parse each email and return a JSON array where each element corresponds to one email, in order.

---EMAIL BOUNDARY---
[Email 1 text here]
---EMAIL BOUNDARY---
[Email 2 text here]
---EMAIL BOUNDARY---
[Email 3 text here]
---EMAIL BOUNDARY---

Return a JSON array of objects. Each object must conform to the schema.

Your schema for batch mode wraps the single-document schema in an array type:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "customer_sentiment": {"type": "string", "enum": ["positive", "neutral", "negative", "furious"]},
      "primary_issue": {"type": "string"},
      "escalation_required": {"type": "boolean"},
      "churn_risk": {"type": "string", "enum": ["low", "medium", "high", "critical"]}
    },
    "required": ["customer_sentiment", "primary_issue", "escalation_required", "churn_risk"]
  }
}

Pro tip ✅

For batch processing, keep individual documents under 10,000 tokens each and batch up to 20-30 at a time. Larger batches can cause the model to mix up which extracted data belongs to which document — especially if the documents are similar in content. If accuracy matters more than throughput, process in smaller batches and verify the output count matches your input count.

Avoid 🚫

Don’t use nested schemas more than three levels deep. Deeply nested JSON Schema definitions confuse the model and increase the rate of schema violations. If you need complex nested data, flatten it where possible or extract in two passes — one for top-level categorization, one for detail extraction.

Why This Belongs in Your Stack

The honest case for this approach over regex pipelines or fine-tuned classifiers is flexibility. When your product adds a new issue category, you update a string in your enum. When your survey question changes, you update the prompt. There’s no retraining cycle, no labeled dataset to curate, no model deployment. For organizations processing hundreds to low thousands of documents per day, Gemini 2.5 Pro’s structured output hits a sweet spot that’s hard to replicate cheaply with traditional NLP. The schema enforcement means the JSON you get back is safe to dump directly into a database without a validation layer in between — and if you’ve ever debugged a production pipeline that choked on an unexpected null where a string should be, you know exactly how much that’s worth.

author avatar
promptyze

promptyze

ADMINISTRATOR