Skip to content
LLMs

LLM Output Parsing & Structured Data: JSON, Function Calling, Validation

16 min read

Extract structured data reliably from LLMs. JSON parsing, function calling, Pydantic validation, and error handling for production applications.

Introduction: The Structured Data Challenge

Large Language Models generate impressive natural language responses, but production applications rarely need prose—they need structured data. Your customer service chatbot doesn’t just want a friendly paragraph about order status; it needs an order ID, status code, and estimated delivery date in a format your database understands. Your document processor doesn’t want a text summary; it needs extracted fields validated and ready for insertion into your CRM.

The gap between LLM text generation and structured data requirements represents one of the biggest friction points in production AI systems. Models are non-deterministic, prone to adding explanation around requested JSON, inconsistent in field naming, and creative in inventing schemas. A prompt asking for JSON might return markdown-wrapped JSON, or JSON with comments, or prose followed by JSON, or malformed JSON that crashes your parser.

Yet structured data extraction is critical for AI applications. Every form processor, data pipeline, API integration, workflow automation, and analytics system requires reliable parsing of LLM outputs into typed, validated data structures. This guide reveals production-grade techniques for extracting structured data from LLMs with reliability approaching traditional APIs.

We’ll cover JSON extraction, OpenAI function calling, Pydantic validation, error handling, retry strategies, and schema enforcement—everything needed to transform unreliable text into production-ready structured data.

The Structured Output Spectrum

Different approaches offer varying levels of structure and reliability.

Reliability Hierarchy

Level 1: Prompt-Based JSON (60-80% reliability)

  • Ask model to return JSON
  • Parse with try/except
  • Handle failures manually

Level 2: Constrained Decoding (85-95% reliability)

  • JSON mode (OpenAI)
  • Structured outputs (Anthropic)
  • Grammar-based generation

Level 3: Function Calling (90-98% reliability)

  • Native function call interfaces
  • Typed parameters
  • Built-in validation

Level 4: Typed Extraction with Validation (95-99% reliability)

  • Function calling + Pydantic
  • Multi-level validation
  • Automatic retry on failure

Approach Comparison

MethodReliabilityFlexibilitySetup ComplexityBest For
Prompt JSON70%HighLowPrototypes
JSON Mode90%MediumLowSimple structures
Function Calling95%MediumMediumAPI integrations
Typed + Validation98%LowerHighProduction systems

Strategy 1: Prompt-Based JSON Extraction

The simplest approach: ask nicely and parse carefully.

Basic JSON Extraction

import json
import re
from typing import Optional, Any

class JSONExtractor:
    def extract_json(self, text: str) -> Optional[dict]:
        """Extract JSON from LLM response."""
        # Remove markdown code blocks
        text = re.sub(r'```json\s*', '', text)
        text = re.sub(r'```\s*', '', text)
        
        # Try to find JSON object
        json_match = re.search(r'\{.*\}', text, re.DOTALL)
        if not json_match:
            # Try array
            json_match = re.search(r'\[.*\]', text, re.DOTALL)
        
        if not json_match:
            return None
        
        try:
            return json.loads(json_match.group())
        except json.JSONDecodeError:
            return self.repair_json(json_match.group())
    
    def repair_json(self, malformed_json: str) -> Optional[dict]:
        """Attempt to repair common JSON issues."""
        # Remove trailing commas
        repaired = re.sub(r',(\s*[}\]])', r'\1', malformed_json)
        
        # Fix single quotes
        repaired = repaired.replace("'", '"')
        
        # Remove comments
        repaired = re.sub(r'//.*?\n', '\n', repaired)
        repaired = re.sub(r'/\*.*?\*/', '', repaired, flags=re.DOTALL)
        
        try:
            return json.loads(repaired)
        except json.JSONDecodeError:
            return None

# Usage
extractor = JSONExtractor()

llm_response = """Here's the data you requested:
```json
{
  "name": "John Doe",
  "age": 30,
  "email": "john@example.com",
}
```"""

data = extractor.extract_json(llm_response)
print(data)  # {'name': 'John Doe', 'age': 30, 'email': 'john@example.com'}

Robust Prompting for JSON

class StructuredPromptBuilder:
    def build_json_prompt(
        self,
        task: str,
        schema: dict,
        examples: list[dict] = None
    ) -> str:
        """Build prompt optimized for JSON output."""
        prompt_parts = [
            f"Task: {task}",
            "",
            "CRITICAL: Return ONLY valid JSON. No explanations, no markdown, no preamble.",
            "",
            "Expected schema:",
            json.dumps(schema, indent=2)
        ]
        
        if examples:
            prompt_parts.append("\nExamples:")
            for i, example in enumerate(examples, 1):
                prompt_parts.append(f"\nExample {i}:")
                prompt_parts.append(json.dumps(example, indent=2))
        
        prompt_parts.extend([
            "",
            "Return your response as a JSON object matching the schema above.",
            "Do not wrap in markdown code blocks.",
            "Do not include any text before or after the JSON."
        ])
        
        return "\n".join(prompt_parts)

# Usage
builder = StructuredPromptBuilder()

schema = {
    "product_name": "string",
    "price": "number",
    "in_stock": "boolean",
    "categories": ["string"]
}

prompt = builder.build_json_prompt(
    task="Extract product information from the description",
    schema=schema,
    examples=[
        {
            "product_name": "Wireless Mouse",
            "price": 29.99,
            "in_stock": True,
            "categories": ["Electronics", "Accessories"]
        }
    ]
)

response = llm.generate(prompt)
data = extractor.extract_json(response)

JSON Mode (OpenAI)

from openai import OpenAI

class OpenAIJSONExtractor:
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
    
    def extract_structured_data(
        self,
        prompt: str,
        model: str = "gpt-4-turbo-preview"
    ) -> dict:
        """Use JSON mode for guaranteed valid JSON."""
        response = self.client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "system",
                    "content": "You are a data extraction assistant. Always respond with valid JSON."
                },
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            response_format={"type": "json_object"}  # Enforces JSON
        )
        
        return json.loads(response.choices[0].message.content)

# Usage
extractor = OpenAIJSONExtractor(api_key="your-key")

result = extractor.extract_structured_data(
    """Extract information from: "Premium wireless mouse, $29.99, in stock"
    
    Return JSON with: product_name, price, in_stock"""
)

print(result)
# Guaranteed to be valid JSON

Strategy 2: Function Calling

Function calling provides the most reliable structured extraction.

OpenAI Function Calling

from openai import OpenAI
from typing import Literal

class FunctionCallingExtractor:
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
    
    def extract_with_function(
        self,
        prompt: str,
        function_name: str,
        function_description: str,
        parameters: dict
    ) -> dict:
        """Extract data using function calling."""
        response = self.client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[{"role": "user", "content": prompt}],
            tools=[
                {
                    "type": "function",
                    "function": {
                        "name": function_name,
                        "description": function_description,
                        "parameters": parameters
                    }
                }
            ],
            tool_choice={"type": "function", "function": {"name": function_name}}
        )
        
        # Extract function arguments
        tool_call = response.choices[0].message.tool_calls[0]
        arguments = json.loads(tool_call.function.arguments)
        
        return arguments

# Usage - Product Extraction
extractor = FunctionCallingExtractor(api_key="your-key")

product_data = extractor.extract_with_function(
    prompt="Extract: 'Premium Wireless Mouse - $29.99, in stock, Electronics category'",
    function_name="extract_product",
    function_description="Extract product information",
    parameters={
        "type": "object",
        "properties": {
            "product_name": {
                "type": "string",
                "description": "Product name"
            },
            "price": {
                "type": "number",
                "description": "Product price in USD"
            },
            "in_stock": {
                "type": "boolean",
                "description": "Whether product is in stock"
            },
            "category": {
                "type": "string",
                "description": "Product category"
            }
        },
        "required": ["product_name", "price", "in_stock"]
    }
)

print(product_data)
# {'product_name': 'Premium Wireless Mouse', 'price': 29.99, 'in_stock': True, 'category': 'Electronics'}

Complex Schema with Nested Objects

def extract_invoice_data(invoice_text: str) -> dict:
    """Extract structured invoice data."""
    extractor = FunctionCallingExtractor(api_key="your-key")
    
    return extractor.extract_with_function(
        prompt=f"Extract all information from this invoice:\n\n{invoice_text}",
        function_name="extract_invoice",
        function_description="Extract structured invoice information",
        parameters={
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string"},
                "date": {"type": "string", "description": "ISO format date"},
                "vendor": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "address": {"type": "string"},
                        "tax_id": {"type": "string"}
                    },
                    "required": ["name"]
                },
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": {"type": "string"},
                            "quantity": {"type": "number"},
                            "unit_price": {"type": "number"},
                            "total": {"type": "number"}
                        },
                        "required": ["description", "quantity", "unit_price", "total"]
                    }
                },
                "subtotal": {"type": "number"},
                "tax": {"type": "number"},
                "total": {"type": "number"}
            },
            "required": ["invoice_number", "date", "vendor", "line_items", "total"]
        }
    )

# Usage
invoice_text = """
INVOICE #INV-2024-001
Date: 2024-01-15

Vendor: Acme Corp
Address: 123 Main St, City, State 12345
Tax ID: 12-3456789

Line Items:
1. Premium Widgets (10 × $5.00) = $50.00
2. Standard Gadgets (5 × $10.00) = $50.00

Subtotal: $100.00
Tax (8%): $8.00
Total: $108.00
"""

data = extract_invoice_data(invoice_text)
print(json.dumps(data, indent=2))

Anthropic Tool Use

from anthropic import Anthropic

class ClaudeToolExtractor:
    def __init__(self, api_key: str):
        self.client = Anthropic(api_key=api_key)
    
    def extract_with_tool(
        self,
        prompt: str,
        tool_name: str,
        tool_description: str,
        input_schema: dict
    ) -> dict:
        """Extract data using Claude's tool use."""
        message = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4096,
            tools=[
                {
                    "name": tool_name,
                    "description": tool_description,
                    "input_schema": input_schema
                }
            ],
            messages=[{"role": "user", "content": prompt}]
        )
        
        # Extract tool use
        for content in message.content:
            if content.type == "tool_use":
                return content.input
        
        raise ValueError("No tool use found in response")

# Usage
claude_extractor = ClaudeToolExtractor(api_key="your-key")

contact_data = claude_extractor.extract_with_tool(
    prompt="Extract contact info: 'John Doe, john@example.com, +1-555-0123, San Francisco, CA'",
    tool_name="extract_contact",
    tool_description="Extract contact information",
    input_schema={
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "email": {"type": "string"},
            "phone": {"type": "string"},
            "city": {"type": "string"},
            "state": {"type": "string"}
        },
        "required": ["name", "email"]
    }
)

print(contact_data)

Strategy 3: Pydantic Validation

Combine function calling with Pydantic for type-safe extraction.

Basic Pydantic Model

from pydantic import BaseModel, Field, validator
from typing import List, Optional
from datetime import date

class Product(BaseModel):
    product_name: str = Field(..., min_length=1, max_length=200)
    price: float = Field(..., gt=0)
    in_stock: bool
    categories: List[str] = Field(default_factory=list)
    sku: Optional[str] = None
    
    @validator('price')
    def validate_price(cls, v):
        if v > 100000:
            raise ValueError('Price seems unreasonably high')
        return round(v, 2)
    
    @validator('product_name')
    def clean_name(cls, v):
        return v.strip().title()

class PydanticExtractor:
    def __init__(self, llm_extractor):
        self.llm = llm_extractor
    
    def extract_and_validate(
        self,
        prompt: str,
        model_class: type[BaseModel]
    ) -> BaseModel:
        """Extract data and validate with Pydantic."""
        # Convert Pydantic model to function schema
        schema = model_class.model_json_schema()
        
        # Extract using function calling
        raw_data = self.llm.extract_with_function(
            prompt=prompt,
            function_name=f"extract_{model_class.__name__.lower()}",
            function_description=f"Extract {model_class.__name__} data",
            parameters=schema
        )
        
        # Validate and parse with Pydantic
        try:
            return model_class(**raw_data)
        except Exception as e:
            raise ValueError(f"Validation failed: {e}")

# Usage
pydantic_extractor = PydanticExtractor(llm_extractor)

product = pydantic_extractor.extract_and_validate(
    prompt="Extract: 'premium WIRELESS mouse - $29.99, in stock, electronics'",
    model_class=Product
)

print(product.product_name)  # "Premium Wireless Mouse" (cleaned and formatted)
print(product.price)  # 29.99 (validated and rounded)
print(product.in_stock)  # True
print(product.dict())  # Get as dictionary

Complex Nested Models

from decimal import Decimal
from enum import Enum

class Currency(str, Enum):
    USD = "USD"
    EUR = "EUR"
    GBP = "GBP"

class Address(BaseModel):
    street: str
    city: str
    state: Optional[str] = None
    postal_code: str
    country: str = "USA"

class Vendor(BaseModel):
    name: str
    address: Address
    tax_id: Optional[str] = None
    email: Optional[str] = None
    
    @validator('email')
    def validate_email(cls, v):
        if v and '@' not in v:
            raise ValueError('Invalid email format')
        return v

class LineItem(BaseModel):
    description: str
    quantity: int = Field(..., gt=0)
    unit_price: Decimal = Field(..., gt=0)
    total: Decimal = Field(..., gt=0)
    
    @validator('total')
    def validate_total(cls, v, values):
        if 'quantity' in values and 'unit_price' in values:
            expected = values['quantity'] * values['unit_price']
            if abs(v - expected) > Decimal('0.01'):
                raise ValueError(f'Total {v} does not match quantity × price')
        return v

class Invoice(BaseModel):
    invoice_number: str
    date: date
    vendor: Vendor
    line_items: List[LineItem] = Field(..., min_items=1)
    subtotal: Decimal
    tax: Decimal = Field(..., ge=0)
    total: Decimal
    currency: Currency = Currency.USD
    notes: Optional[str] = None
    
    @validator('total')
    def validate_total(cls, v, values):
        if 'subtotal' in values and 'tax' in values:
            expected = values['subtotal'] + values['tax']
            if abs(v - expected) > Decimal('0.01'):
                raise ValueError('Total does not match subtotal + tax')
        return v
    
    class Config:
        use_enum_values = True

# Usage - Extract and validate complex invoice
invoice = pydantic_extractor.extract_and_validate(
    prompt=invoice_text,
    model_class=Invoice
)

# All fields are now typed and validated
print(f"Invoice: {invoice.invoice_number}")
print(f"Total: {invoice.currency} {invoice.total}")
print(f"Vendor: {invoice.vendor.name}")
for item in invoice.line_items:
    print(f"  {item.description}: {item.quantity} × ${item.unit_price} = ${item.total}")

Auto-Retry on Validation Failure

class RobustPydanticExtractor:
    def __init__(self, llm_extractor, max_retries: int = 3):
        self.llm = llm_extractor
        self.max_retries = max_retries
    
    def extract_with_retry(
        self,
        prompt: str,
        model_class: type[BaseModel]
    ) -> BaseModel:
        """Extract with automatic retry on validation failure."""
        last_error = None
        
        for attempt in range(self.max_retries):
            try:
                # Add validation error context to prompt if retrying
                if last_error:
                    enhanced_prompt = f"""{prompt}
                    
                    PREVIOUS ATTEMPT FAILED with error:
                    {last_error}
                    
                    Please correct the issue and try again."""
                else:
                    enhanced_prompt = prompt
                
                # Extract
                schema = model_class.model_json_schema()
                raw_data = self.llm.extract_with_function(
                    prompt=enhanced_prompt,
                    function_name=f"extract_{model_class.__name__.lower()}",
                    function_description=f"Extract {model_class.__name__} data",
                    parameters=schema
                )
                
                # Validate
                return model_class(**raw_data)
                
            except Exception as e:
                last_error = str(e)
                if attempt == self.max_retries - 1:
                    raise ValueError(f"Failed after {self.max_retries} attempts: {last_error}")
        
        raise ValueError("Unexpected error in retry logic")

# Usage
robust_extractor = RobustPydanticExtractor(llm_extractor, max_retries=3)

# Will automatically retry if validation fails
invoice = robust_extractor.extract_with_retry(
    prompt=invoice_text,
    model_class=Invoice
)

Strategy 4: Schema Evolution and Versioning

Handle schema changes gracefully.

Version-Aware Models

from typing import Union

class ProductV1(BaseModel):
    name: str
    price: float

class ProductV2(BaseModel):
    name: str
    price: float
    in_stock: bool
    categories: List[str] = []
    
    class Config:
        version = 2

class ProductV3(BaseModel):
    name: str
    price: float
    in_stock: bool
    categories: List[str] = []
    vendor: str
    
    class Config:
        version = 3

class VersionedExtractor:
    def __init__(self, llm_extractor):
        self.llm = llm_extractor
        self.versions = {
            1: ProductV1,
            2: ProductV2,
            3: ProductV3
        }
    
    def extract_latest(self, prompt: str) -> BaseModel:
        """Extract using latest schema version."""
        latest_version = max(self.versions.keys())
        model_class = self.versions[latest_version]
        
        return self.llm.extract_and_validate(prompt, model_class)
    
    def extract_with_fallback(self, prompt: str) -> BaseModel:
        """Try latest version, fall back to older on failure."""
        for version in sorted(self.versions.keys(), reverse=True):
            try:
                model_class = self.versions[version]
                result = self.llm.extract_and_validate(prompt, model_class)
                return result
            except Exception as e:
                if version == 1:
                    raise
                continue
    
    def migrate_version(
        self,
        data: BaseModel,
        target_version: int
    ) -> BaseModel:
        """Migrate data between schema versions."""
        current_version = data.Config.version if hasattr(data.Config, 'version') else 1
        target_class = self.versions[target_version]
        
        # Convert to dict
        data_dict = data.dict()
        
        # Apply migrations
        if current_version < target_version:
            data_dict = self.migrate_up(data_dict, current_version, target_version)
        elif current_version > target_version:
            data_dict = self.migrate_down(data_dict, current_version, target_version)
        
        return target_class(**data_dict)

Optional Fields for Flexibility

class FlexibleProduct(BaseModel):
    # Core required fields
    name: str
    price: float
    
    # Optional fields with defaults
    description: Optional[str] = None
    in_stock: bool = True
    categories: List[str] = Field(default_factory=list)
    sku: Optional[str] = None
    vendor: Optional[str] = None
    dimensions: Optional[dict] = None
    
    # Allow extra fields
    class Config:
        extra = 'allow'  # or 'ignore' to silently drop extras
    
    @classmethod
    def from_partial(cls, data: dict) -> 'FlexibleProduct':
        """Create from potentially incomplete data."""
        # Set defaults for missing required fields
        data.setdefault('name', 'Unknown Product')
        data.setdefault('price', 0.0)
        
        return cls(**data)

Strategy 5: Batch Extraction

Process multiple items efficiently.

Batch Processing

class BatchExtractor:
    def __init__(self, llm_extractor):
        self.llm = llm_extractor
    
    def extract_batch(
        self,
        items: List[str],
        model_class: type[BaseModel],
        batch_size: int = 10
    ) -> List[BaseModel]:
        """Extract multiple items in batches."""
        results = []
        
        for i in range(0, len(items), batch_size):
            batch = items[i:i + batch_size]
            
            # Create batch prompt
            batch_prompt = self.create_batch_prompt(batch, model_class)
            
            # Extract as array
            schema = {
                "type": "object",
                "properties": {
                    "items": {
                        "type": "array",
                        "items": model_class.model_json_schema()
                    }
                },
                "required": ["items"]
            }
            
            raw_data = self.llm.extract_with_function(
                prompt=batch_prompt,
                function_name="extract_batch",
                function_description=f"Extract batch of {model_class.__name__}",
                parameters=schema
            )
            
            # Validate each item
            for item_data in raw_data['items']:
                try:
                    results.append(model_class(**item_data))
                except Exception as e:
                    # Log error but continue
                    logger.error(f"Validation failed for item: {e}")
        
        return results
    
    def create_batch_prompt(self, items: List[str], model_class: type[BaseModel]) -> str:
        """Create prompt for batch extraction."""
        items_text = "\n".join(f"{i+1}. {item}" for i, item in enumerate(items))
        
        return f"""Extract information from these items:

{items_text}

Return as an array of {model_class.__name__} objects."""

# Usage
extractor = BatchExtractor(llm_extractor)

product_texts = [
    "Premium Wireless Mouse - $29.99, in stock",
    "Mechanical Keyboard - $89.99, out of stock",
    "USB-C Hub - $49.99, in stock"
]

products = extractor.extract_batch(product_texts, Product, batch_size=10)

for product in products:
    print(f"{product.product_name}: ${product.price}")

Parallel Processing

import asyncio
from typing import List

class AsyncBatchExtractor:
    def __init__(self, llm_extractor, max_concurrent: int = 5):
        self.llm = llm_extractor
        self.semaphore = asyncio.Semaphore(max_concurrent)
    
    async def extract_one_async(
        self,
        item: str,
        model_class: type[BaseModel]
    ) -> Optional[BaseModel]:
        """Extract single item asynchronously."""
        async with self.semaphore:
            try:
                # Simulate async extraction (replace with actual async LLM call)
                raw_data = await self.llm.extract_with_function_async(
                    prompt=f"Extract: {item}",
                    function_name=f"extract_{model_class.__name__.lower()}",
                    function_description=f"Extract {model_class.__name__}",
                    parameters=model_class.model_json_schema()
                )
                
                return model_class(**raw_data)
            except Exception as e:
                logger.error(f"Extraction failed: {e}")
                return None
    
    async def extract_batch_async(
        self,
        items: List[str],
        model_class: type[BaseModel]
    ) -> List[BaseModel]:
        """Extract multiple items in parallel."""
        tasks = [
            self.extract_one_async(item, model_class)
            for item in items
        ]
        
        results = await asyncio.gather(*tasks)
        
        # Filter out None results
        return [r for r in results if r is not None]

# Usage
async def main():
    async_extractor = AsyncBatchExtractor(llm_extractor, max_concurrent=5)
    
    products = await async_extractor.extract_batch_async(
        product_texts,
        Product
    )
    
    print(f"Extracted {len(products)} products")

# Run
asyncio.run(main())

Production Error Handling

Comprehensive Error Handling

from enum import Enum
from typing import Union, Tuple

class ExtractionError(Exception):
    """Base exception for extraction errors."""
    pass

class ValidationError(ExtractionError):
    """Data failed validation."""
    pass

class ParseError(ExtractionError):
    """Failed to parse LLM output."""
    pass

class ExtractionResult(BaseModel):
    success: bool
    data: Optional[BaseModel] = None
    error: Optional[str] = None
    attempts: int = 1
    
class ProductionExtractor:
    def __init__(self, llm_extractor, max_retries: int = 3):
        self.llm = llm_extractor
        self.max_retries = max_retries
    
    def extract_safe(
        self,
        prompt: str,
        model_class: type[BaseModel]
    ) -> ExtractionResult:
        """Extract with comprehensive error handling."""
        for attempt in range(self.max_retries):
            try:
                # Extract
                result = self.llm.extract_and_validate(prompt, model_class)
                
                return ExtractionResult(
                    success=True,
                    data=result,
                    attempts=attempt + 1
                )
                
            except json.JSONDecodeError as e:
                error = f"Parse error: {e}"
                if attempt == self.max_retries - 1:
                    return ExtractionResult(
                        success=False,
                        error=error,
                        attempts=attempt + 1
                    )
            
            except ValidationError as e:
                error = f"Validation error: {e}"
                if attempt == self.max_retries - 1:
                    return ExtractionResult(
                        success=False,
                        error=error,
                        attempts=attempt + 1
                    )
            
            except Exception as e:
                error = f"Unexpected error: {e}"
                return ExtractionResult(
                    success=False,
                    error=error,
                    attempts=attempt + 1
                )
        
        return ExtractionResult(
            success=False,
            error="Max retries exceeded",
            attempts=self.max_retries
        )

# Usage
extractor = ProductionExtractor(llm_extractor, max_retries=3)

result = extractor.extract_safe(
    prompt="Extract: 'Wireless Mouse - $29.99'",
    model_class=Product
)

if result.success:
    print(f"Success after {result.attempts} attempts")
    print(result.data)
else:
    print(f"Failed: {result.error}")
    # Log error, use fallback, alert team, etc.

Fallback Strategies

class FallbackExtractor:
    def __init__(self, primary_llm, fallback_llm):
        self.primary = primary_llm
        self.fallback = fallback_llm
    
    def extract_with_fallback(
        self,
        prompt: str,
        model_class: type[BaseModel]
    ) -> Tuple[BaseModel, str]:
        """Try primary LLM, fall back to secondary on failure."""
        # Try primary
        try:
            result = self.primary.extract_and_validate(prompt, model_class)
            return result, "primary"
        except Exception as primary_error:
            logger.warning(f"Primary extraction failed: {primary_error}")
        
        # Try fallback
        try:
            result = self.fallback.extract_and_validate(prompt, model_class)
            return result, "fallback"
        except Exception as fallback_error:
            logger.error(f"Fallback extraction failed: {fallback_error}")
            raise ExtractionError("Both primary and fallback failed")
    
    def extract_with_template_fallback(
        self,
        prompt: str,
        model_class: type[BaseModel],
        template_data: Optional[dict] = None
    ) -> BaseModel:
        """Fall back to template if extraction fails."""
        try:
            return self.primary.extract_and_validate(prompt, model_class)
        except Exception:
            if template_data:
                # Use template with partial data
                return model_class(**template_data)
            else:
                # Use empty/default template
                return model_class.construct()

Monitoring and Quality Metrics

from dataclasses import dataclass
from datetime import datetime

@dataclass
class ExtractionMetrics:
    total_attempts: int = 0
    successful: int = 0
    failed: int = 0
    retries: int = 0
    validation_errors: int = 0
    parse_errors: int = 0
    avg_attempts: float = 0.0
    
    def success_rate(self) -> float:
        return self.successful / self.total_attempts if self.total_attempts > 0 else 0.0

class MonitoredExtractor:
    def __init__(self, llm_extractor):
        self.llm = llm_extractor
        self.metrics = ExtractionMetrics()
        self.extraction_history = []
    
    def extract_monitored(
        self,
        prompt: str,
        model_class: type[BaseModel]
    ) -> ExtractionResult:
        """Extract with metrics tracking."""
        start_time = datetime.now()
        self.metrics.total_attempts += 1
        
        result = self.llm.extract_safe(prompt, model_class)
        
        # Update metrics
        if result.success:
            self.metrics.successful += 1
        else:
            self.metrics.failed += 1
            
            if "validation" in result.error.lower():
                self.metrics.validation_errors += 1
            elif "parse" in result.error.lower():
                self.metrics.parse_errors += 1
        
        if result.attempts > 1:
            self.metrics.retries += result.attempts - 1
        
        # Track history
        self.extraction_history.append({
            'timestamp': start_time,
            'duration': (datetime.now() - start_time).total_seconds(),
            'success': result.success,
            'attempts': result.attempts,
            'model': model_class.__name__
        })
        
        return result
    
    def get_report(self) -> dict:
        """Generate metrics report."""
        return {
            'success_rate': f"{self.metrics.success_rate():.1%}",
            'total_attempts': self.metrics.total_attempts,
            'successful': self.metrics.successful,
            'failed': self.metrics.failed,
            'retries': self.metrics.retries,
            'validation_errors': self.metrics.validation_errors,
            'parse_errors': self.metrics.parse_errors,
            'avg_attempts': sum(h['attempts'] for h in self.extraction_history) / len(self.extraction_history) if self.extraction_history else 0
        }

Conclusion: Reliable Structured Extraction

Structured data extraction from LLMs requires systematic approaches beyond simple prompting. Function calling, Pydantic validation, retry logic, and comprehensive error handling transform unreliable text generation into production-grade data extraction.

Key implementation principles:

  1. Use function calling: Native APIs provide highest reliability
  2. Validate with Pydantic: Type safety and validation catch errors
  3. Implement retry logic: Models improve on second attempt when given error context
  4. Handle failures gracefully: Not every extraction will succeed
  5. Monitor quality: Track success rates and common failure modes

The techniques in this guide enable building production systems that reliably extract structured data from LLM outputs—transforming creative text generators into trustworthy data processors.


Last Updated: December 2024

promptyze

ADMINISTRATOR