← Home LLMs / Multimodal AI Mastery: GPT-4V, Claude 3, and…
17 min
LLMs

Multimodal AI Mastery: GPT-4V, Claude 3, and Gemini Vision Guide

promptyze
Editor · Promptowy
06.12.2025 Date
17 min Reading time
Ilustracja: PROMPTOWY promptowy.com

Unlock vision capabilities in modern LLMs. Process images, PDFs, charts, and screenshots with GPT-4 Vision, Claude 3 Opus, and Gemini 1.5 Pro.

Introduction: Beyond Text

For years, Large Language Models operated in a purely textual world. You could ask them to write code, analyze documents, or answer questions—but only if you could describe what you needed in words. Want to analyze a chart? You’d have to manually transcribe the data. Need to understand a screenshot? Describe it verbally. Have a handwritten note? Type it out first.

Multimodal AI shatters this limitation. Modern models—GPT-4 Vision (GPT-4V), Claude 3 Opus/Sonnet, and Gemini 1.5 Pro—natively process images alongside text. This isn’t OCR or simple object detection. These models understand images semantically: they read handwritten notes, analyze complex charts, explain memes, debug UI screenshots, extract structured data from documents, and even understand spatial relationships in diagrams.

The practical implications are transformative. Customer support can now handle screenshot-based troubleshooting. Document processing extracts data from invoices and forms without templates. Code review tools analyze UI mockups. Educational applications explain visual concepts. Healthcare systems process medical images. The barrier between visual and textual information disappears.

This comprehensive guide reveals how to harness vision capabilities in production applications. From basic image understanding to advanced multi-document analysis, chart extraction, and UI automation, you’ll learn to build applications that see and understand the visual world.

Understanding Multimodal Capabilities

Different models offer varying vision capabilities.

Model Comparison Matrix

CapabilityGPT-4VClaude 3 OpusClaude 3.5 SonnetGemini 1.5 Pro
Image Understanding★★★★★★★★★★★★★★★★★★★★
OCR/Text Extraction★★★★☆★★★★★★★★★★★★★★★
Chart/Graph Reading★★★★☆★★★★☆★★★★☆★★★★★
Document Analysis★★★★☆★★★★★★★★★★★★★★★
Multiple Images★★★☆☆★★★★★★★★★★★★★★★
Video Frames★★★★☆★★★★☆★★★★★
PDF Native Support★★★★★★★★★★★★★★★
Max Images per Request~10~20~20~3,000
Cost (per image)*~$0.00765~$0.012~$0.012~$0.0026

*Based on typical high-res image (~1,500 tokens)

Image Input Formats

GPT-4V:

  • Supported: PNG, JPEG, WEBP, GIF (non-animated)
  • Max size: 20MB
  • Format: Base64 or URL

Claude 3/3.5:

  • Supported: PNG, JPEG, WEBP, GIF
  • Also: PDF (native support)
  • Max size: 5MB per image (PDF up to 32MB)
  • Format: Base64 only

Gemini 1.5 Pro:

  • Supported: PNG, JPEG, WEBP, GIF
  • Also: PDF, video (as frames)
  • Max size: 20MB
  • Format: Base64 or URL

Vision Tokenization

Images consume significant tokens:

Token Calculation (GPT-4V):

  • Low detail: 85 tokens (fixed)
  • High detail: Based on 512px tiles
  • Example: 2048×1536 image = ~765 tokens

Claude 3/3.5:

  • Approximately 1,600 tokens per image regardless of size
  • PDF pages: ~1,600 tokens per page

Gemini 1.5 Pro:

  • Images: ~258 tokens per image
  • Video: ~258 tokens per second

Cost Implications:

# GPT-4V cost for analyzing a high-res image
image_tokens = 765  # typical high-res
text_tokens = 200   # prompt
output_tokens = 500 # detailed analysis

cost = (image_tokens + text_tokens) * 0.00001 + output_tokens * 0.00003
# = $0.0097 + $0.015 = $0.0247 per analysis

# Gemini cost for same task
cost = (258 + 200) * 0.00000125 + 500 * 0.000005
# = $0.00057 + $0.0025 = $0.00307 per analysis
# 87% cheaper!

Implementation: Basic Vision

GPT-4V Implementation

from openai import OpenAI
import base64
from pathlib import Path

class GPT4VisionClient:
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
    
    def analyze_image_from_file(
        self,
        image_path: str,
        prompt: str,
        detail: str = "high"
    ) -> str:
        """Analyze image file with GPT-4V."""
        # Read and encode image
        with open(image_path, "rb") as image_file:
            base64_image = base64.b64encode(image_file.read()).decode('utf-8')
        
        # Determine MIME type
        suffix = Path(image_path).suffix.lower()
        mime_types = {
            '.png': 'image/png',
            '.jpg': 'image/jpeg',
            '.jpeg': 'image/jpeg',
            '.webp': 'image/webp',
            '.gif': 'image/gif'
        }
        mime_type = mime_types.get(suffix, 'image/jpeg')
        
        # Create request
        response = self.client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:{mime_type};base64,{base64_image}",
                                "detail": detail  # "low" or "high"
                            }
                        }
                    ]
                }
            ],
            max_tokens=1000
        )
        
        return response.choices[0].message.content
    
    def analyze_image_from_url(self, image_url: str, prompt: str) -> str:
        """Analyze image from URL."""
        response = self.client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {
                            "type": "image_url",
                            "image_url": {"url": image_url}
                        }
                    ]
                }
            ]
        )
        
        return response.choices[0].message.content
    
    def analyze_multiple_images(
        self,
        images: list[str],
        prompt: str
    ) -> str:
        """Analyze multiple images together."""
        content = [{"type": "text", "text": prompt}]
        
        for image_path in images:
            with open(image_path, "rb") as image_file:
                base64_image = base64.b64encode(image_file.read()).decode('utf-8')
            
            content.append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base64_image}"
                }
            })
        
        response = self.client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=[{"role": "user", "content": content}]
        )
        
        return response.choices[0].message.content

# Usage examples
vision_client = GPT4VisionClient(api_key="your-key")

# Basic image analysis
result = vision_client.analyze_image_from_file(
    "screenshot.png",
    "Describe what you see in this screenshot and identify any UI issues."
)

# Chart analysis
chart_data = vision_client.analyze_image_from_file(
    "sales_chart.png",
    "Extract the data from this chart as a structured table."
)

# Multiple images
comparison = vision_client.analyze_multiple_images(
    ["design_v1.png", "design_v2.png"],
    "Compare these two UI designs and highlight the differences."
)

Claude 3.5 Implementation

from anthropic import Anthropic
import base64

class Claude35VisionClient:
    def __init__(self, api_key: str):
        self.client = Anthropic(api_key=api_key)
    
    def analyze_image(
        self,
        image_path: str,
        prompt: str,
        model: str = "claude-3-5-sonnet-20241022"
    ) -> str:
        """Analyze image with Claude 3.5."""
        # Read and encode image
        with open(image_path, "rb") as image_file:
            image_data = base64.standard_b64encode(image_file.read()).decode("utf-8")
        
        # Determine media type
        if image_path.endswith('.png'):
            media_type = "image/png"
        elif image_path.endswith('.jpg') or image_path.endswith('.jpeg'):
            media_type = "image/jpeg"
        elif image_path.endswith('.webp'):
            media_type = "image/webp"
        elif image_path.endswith('.gif'):
            media_type = "image/gif"
        else:
            media_type = "image/jpeg"
        
        # Create message
        message = self.client.messages.create(
            model=model,
            max_tokens=4096,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": media_type,
                                "data": image_data,
                            },
                        },
                        {
                            "type": "text",
                            "text": prompt
                        }
                    ],
                }
            ],
        )
        
        return message.content[0].text
    
    def analyze_pdf(self, pdf_path: str, prompt: str) -> str:
        """Analyze PDF document with Claude."""
        with open(pdf_path, "rb") as pdf_file:
            pdf_data = base64.standard_b64encode(pdf_file.read()).decode("utf-8")
        
        message = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4096,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "document",
                            "source": {
                                "type": "base64",
                                "media_type": "application/pdf",
                                "data": pdf_data,
                            },
                        },
                        {
                            "type": "text",
                            "text": prompt
                        }
                    ],
                }
            ],
        )
        
        return message.content[0].text
    
    def analyze_multiple_images(
        self,
        images: list[tuple[str, str]],  # [(path, media_type)]
        prompt: str
    ) -> str:
        """Analyze up to 20 images together."""
        content = []
        
        # Add images
        for image_path, media_type in images:
            with open(image_path, "rb") as image_file:
                image_data = base64.standard_b64encode(image_file.read()).decode("utf-8")
            
            content.append({
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": image_data,
                },
            })
        
        # Add prompt
        content.append({
            "type": "text",
            "text": prompt
        })
        
        message = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4096,
            messages=[{"role": "user", "content": content}],
        )
        
        return message.content[0].text

# Usage
claude_vision = Claude35VisionClient(api_key="your-key")

# Image analysis
result = claude_vision.analyze_image(
    "invoice.png",
    "Extract all line items, amounts, and totals from this invoice."
)

# PDF analysis (Claude's unique strength)
pdf_result = claude_vision.analyze_pdf(
    "contract.pdf",
    "Summarize the key terms and highlight any unusual clauses."
)

# Multiple images
batch_result = claude_vision.analyze_multiple_images(
    [
        ("page1.png", "image/png"),
        ("page2.png", "image/png"),
        ("page3.png", "image/png"),
    ],
    "These are pages from a presentation. Create a summary of the key points."
)

Gemini 1.5 Pro Implementation

import google.generativeai as genai
from PIL import Image

class GeminiVisionClient:
    def __init__(self, api_key: str):
        genai.configure(api_key=api_key)
        self.model = genai.GenerativeModel('gemini-1.5-pro')
    
    def analyze_image(self, image_path: str, prompt: str) -> str:
        """Analyze image with Gemini."""
        image = Image.open(image_path)
        
        response = self.model.generate_content([prompt, image])
        return response.text
    
    def analyze_multiple_images(
        self,
        image_paths: list[str],
        prompt: str
    ) -> str:
        """Analyze multiple images (up to 3,000!)."""
        content = [prompt]
        
        for image_path in image_paths:
            content.append(Image.open(image_path))
        
        response = self.model.generate_content(content)
        return response.text
    
    def analyze_video(
        self,
        video_path: str,
        prompt: str,
        sample_rate: int = 1  # Sample every N seconds
    ) -> str:
        """Analyze video by sampling frames."""
        video_file = genai.upload_file(path=video_path)
        
        # Wait for processing
        while video_file.state.name == "PROCESSING":
            time.sleep(1)
            video_file = genai.get_file(video_file.name)
        
        response = self.model.generate_content([
            prompt,
            video_file
        ])
        
        return response.text
    
    def analyze_with_context(
        self,
        images: list[str],
        prompt: str,
        context: str
    ) -> str:
        """Analyze images with additional textual context."""
        content = [
            f"Context: {context}\n\nTask: {prompt}"
        ]
        
        for image_path in images:
            content.append(Image.open(image_path))
        
        response = self.model.generate_content(content)
        return response.text

# Usage
gemini_vision = GeminiVisionClient(api_key="your-key")

# Simple analysis
result = gemini_vision.analyze_image(
    "diagram.png",
    "Explain this architecture diagram in detail."
)

# Batch processing (Gemini's strength - up to 3,000 images!)
batch_result = gemini_vision.analyze_multiple_images(
    ["product1.jpg", "product2.jpg", ..., "product500.jpg"],
    "Categorize these products and identify any quality issues."
)

# Video analysis (Gemini exclusive feature)
video_summary = gemini_vision.analyze_video(
    "tutorial.mp4",
    "Summarize this tutorial video and list the main steps."
)

Advanced Use Cases

Document Processing and Data Extraction

class DocumentProcessor:
    def __init__(self, vision_client):
        self.vision = vision_client
    
    def extract_invoice_data(self, invoice_path: str) -> dict:
        """Extract structured data from invoice."""
        prompt = """Extract the following information from this invoice:
        
        1. Invoice number
        2. Date
        3. Vendor name and address
        4. Line items (description, quantity, unit price, total)
        5. Subtotal
        6. Tax amount
        7. Total amount
        
        Return as JSON format."""
        
        result = self.vision.analyze_image(invoice_path, prompt)
        
        # Parse JSON from response
        import json
        import re
        
        json_match = re.search(r'\{.*\}', result, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        
        return {"raw_response": result}
    
    def extract_table_from_image(self, image_path: str) -> list[list]:
        """Extract table data from image."""
        prompt = """Extract the table from this image.
        
        Return as a markdown table, preserving all rows and columns exactly.
        Make sure to align columns properly."""
        
        result = self.vision.analyze_image(image_path, prompt)
        
        # Parse markdown table
        table = self.parse_markdown_table(result)
        return table
    
    def parse_markdown_table(self, markdown: str) -> list[list]:
        """Convert markdown table to 2D list."""
        lines = markdown.strip().split('\n')
        table = []
        
        for line in lines:
            if '|' in line and not line.strip().startswith('|-'):
                cells = [cell.strip() for cell in line.split('|')[1:-1]]
                table.append(cells)
        
        return table
    
    def process_form(self, form_path: str, fields: list[str]) -> dict:
        """Extract specific fields from form."""
        fields_list = "\n".join(f"{i+1}. {field}" for i, field in enumerate(fields))
        
        prompt = f"""Extract these specific fields from the form:
        
        {fields_list}
        
        Return as JSON with field names as keys."""
        
        result = self.vision.analyze_image(form_path, prompt)
        
        import json
        import re
        json_match = re.search(r'\{.*\}', result, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        
        return {}

# Usage
processor = DocumentProcessor(claude_vision)

# Process invoice
invoice_data = processor.extract_invoice_data("invoice.pdf")
print(f"Total: ${invoice_data['total_amount']}")

# Extract table
table_data = processor.extract_table_from_image("report_table.png")
for row in table_data:
    print(row)

# Process form
form_data = processor.process_form(
    "application_form.pdf",
    ["Full Name", "Email", "Phone", "Address", "Date of Birth"]
)

Chart and Graph Analysis

class ChartAnalyzer:
    def __init__(self, vision_client):
        self.vision = vision_client
    
    def extract_chart_data(self, chart_path: str, chart_type: str) -> dict:
        """Extract data from chart."""
        prompts = {
            'bar': """Extract data from this bar chart.
            
            Return JSON with:
            - categories: list of x-axis labels
            - values: list of corresponding values
            - title: chart title
            - units: measurement units if shown""",
            
            'line': """Extract data from this line chart.
            
            Return JSON with:
            - x_values: list of x-axis values
            - y_values: list of corresponding y values
            - series_name: name of the data series
            - title: chart title""",
            
            'pie': """Extract data from this pie chart.
            
            Return JSON with:
            - segments: list of {label, value, percentage}
            - title: chart title"""
        }
        
        prompt = prompts.get(chart_type, prompts['bar'])
        result = self.vision.analyze_image(chart_path, prompt)
        
        import json
        import re
        json_match = re.search(r'\{.*\}', result, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        
        return {"raw_response": result}
    
    def analyze_trend(self, chart_path: str) -> dict:
        """Analyze trends in chart."""
        prompt = """Analyze this chart and identify:
        
        1. Overall trend (increasing, decreasing, stable, cyclical)
        2. Key inflection points
        3. Outliers or anomalies
        4. Rate of change
        5. Notable patterns
        
        Return as JSON."""
        
        result = self.vision.analyze_image(chart_path, prompt)
        
        import json
        import re
        json_match = re.search(r'\{.*\}', result, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        
        return {"analysis": result}
    
    def compare_charts(self, chart_paths: list[str]) -> str:
        """Compare multiple charts."""
        prompt = """Compare these charts and identify:
        
        1. Similar trends or patterns
        2. Divergences
        3. Correlations
        4. Key differences in scale or timeframe
        5. Overall insights from the comparison"""
        
        return self.vision.analyze_multiple_images(chart_paths, prompt)

# Usage
analyzer = ChartAnalyzer(gemini_vision)

# Extract bar chart data
chart_data = analyzer.extract_chart_data("sales_chart.png", "bar")
print(f"Categories: {chart_data['categories']}")
print(f"Values: {chart_data['values']}")

# Analyze trend
trend = analyzer.analyze_trend("stock_price.png")
print(f"Trend: {trend['overall_trend']}")

# Compare multiple charts
comparison = analyzer.compare_charts([
    "q1_sales.png",
    "q2_sales.png",
    "q3_sales.png",
    "q4_sales.png"
])

UI/UX Analysis

class UIAnalyzer:
    def __init__(self, vision_client):
        self.vision = vision_client
    
    def analyze_accessibility(self, screenshot_path: str) -> dict:
        """Analyze UI accessibility."""
        prompt = """Analyze this UI for accessibility issues:
        
        1. Color contrast problems
        2. Text readability
        3. Button/clickable element sizes
        4. Visual hierarchy
        5. Alternative text needs
        6. Keyboard navigation concerns
        
        Rate each area 1-10 and provide specific recommendations."""
        
        return self.vision.analyze_image(screenshot_path, prompt)
    
    def detect_ui_bugs(self, screenshot_path: str) -> list[dict]:
        """Detect UI bugs and issues."""
        prompt = """Identify any UI bugs or issues in this screenshot:
        
        Look for:
        - Layout problems (overlapping, misalignment)
        - Broken images or icons
        - Truncated text
        - Inconsistent styling
        - Missing elements
        - Poor responsive design
        
        For each issue, provide:
        - Description
        - Severity (low/medium/high)
        - Location
        - Suggested fix"""
        
        result = self.vision.analyze_image(screenshot_path, prompt)
        return self.parse_bug_list(result)
    
    def compare_design_mockup(
        self,
        mockup_path: str,
        implementation_path: str
    ) -> dict:
        """Compare design mockup to implementation."""
        prompt = """Compare the design mockup (first image) to the implementation (second image).
        
        Identify:
        1. Design deviations
        2. Color differences
        3. Spacing/alignment issues
        4. Typography discrepancies
        5. Missing or extra elements
        6. Overall fidelity score (0-100)
        
        Return as JSON."""
        
        return self.vision.analyze_multiple_images(
            [mockup_path, implementation_path],
            prompt
        )
    
    def extract_component_specs(self, design_path: str) -> dict:
        """Extract component specifications from design."""
        prompt = """Analyze this UI design and extract specifications:
        
        For each major component, provide:
        - Type (button, input, card, etc.)
        - Dimensions (approximate)
        - Colors (primary, text, background)
        - Spacing (padding, margins)
        - Typography (font size, weight)
        - States (default, hover, active, disabled)
        
        Return as JSON."""
        
        result = self.vision.analyze_image(design_path, prompt)
        
        import json
        import re
        json_match = re.search(r'\{.*\}', result, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        
        return {}

# Usage
ui_analyzer = UIAnalyzer(gpt4v_client)

# Check accessibility
accessibility = ui_analyzer.analyze_accessibility("dashboard.png")

# Find bugs
bugs = ui_analyzer.detect_ui_bugs("mobile_app.png")
for bug in bugs:
    print(f"{bug['severity']}: {bug['description']}")

# Compare mockup vs implementation
comparison = ui_analyzer.compare_design_mockup(
    "design.png",
    "screenshot.png"
)
print(f"Fidelity score: {comparison['fidelity_score']}")

Code from Screenshots

class CodeExtractor:
    def __init__(self, vision_client):
        self.vision = vision_client
    
    def extract_code(self, screenshot_path: str, language: str = None) -> str:
        """Extract code from screenshot."""
        lang_hint = f" (It's {language} code)" if language else ""
        
        prompt = f"""Extract the code from this screenshot{lang_hint}.
        
        Return only the code, properly formatted, without any markdown backticks or explanations.
        Preserve indentation and structure exactly."""
        
        return self.vision.analyze_image(screenshot_path, prompt)
    
    def debug_screenshot(self, error_screenshot_path: str) -> dict:
        """Debug code error from screenshot."""
        prompt = """Analyze this error screenshot and provide:
        
        1. Error type and message
        2. Root cause analysis
        3. Affected code section
        4. Step-by-step fix
        5. Prevention strategies
        
        Return as JSON."""
        
        result = self.vision.analyze_image(error_screenshot_path, prompt)
        
        import json
        import re
        json_match = re.search(r'\{.*\}', result, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        
        return {"analysis": result}
    
    def review_ui_code(
        self,
        ui_screenshot_path: str,
        code_screenshot_path: str
    ) -> str:
        """Review code against UI screenshot."""
        prompt = """Compare the UI (first image) with the code (second image).
        
        Analyze:
        1. Does the code match the UI structure?
        2. Are styles correctly implemented?
        3. Missing or incorrect elements
        4. Code quality and best practices
        5. Recommendations for improvement"""
        
        return self.vision.analyze_multiple_images(
            [ui_screenshot_path, code_screenshot_path],
            prompt
        )

# Usage
extractor = CodeExtractor(claude_vision)

# Extract code from screenshot
code = extractor.extract_code("code_snippet.png", language="python")
print(code)

# Debug error
debug_info = extractor.debug_screenshot("error_screen.png")
print(f"Error: {debug_info['error_type']}")
print(f"Fix: {debug_info['fix']}")

Best Practices and Optimization

Image Quality Optimization

from PIL import Image
import io

class ImageOptimizer:
    def optimize_for_vision_api(
        self,
        image_path: str,
        max_size_kb: int = 4000,
        max_dimension: int = 2048
    ) -> bytes:
        """Optimize image for vision API."""
        image = Image.open(image_path)
        
        # Resize if too large
        if max(image.size) > max_dimension:
            ratio = max_dimension / max(image.size)
            new_size = tuple(int(dim * ratio) for dim in image.size)
            image = image.resize(new_size, Image.Resampling.LANCZOS)
        
        # Convert to RGB if needed
        if image.mode not in ('RGB', 'L'):
            image = image.convert('RGB')
        
        # Compress to target size
        quality = 95
        while quality > 20:
            buffer = io.BytesIO()
            image.save(buffer, format='JPEG', quality=quality, optimize=True)
            
            if buffer.tell() <= max_size_kb * 1024:
                return buffer.getvalue()
            
            quality -= 5
        
        # If still too large, reduce dimensions
        scale = 0.9
        while buffer.tell() > max_size_kb * 1024 and scale > 0.3:
            new_size = tuple(int(dim * scale) for dim in image.size)
            resized = image.resize(new_size, Image.Resampling.LANCZOS)
            
            buffer = io.BytesIO()
            resized.save(buffer, format='JPEG', quality=85, optimize=True)
            scale -= 0.1
        
        return buffer.getvalue()

# Usage
optimizer = ImageOptimizer()
optimized = optimizer.optimize_for_vision_api("large_image.png")

# Save optimized image
with open("optimized.jpg", "wb") as f:
    f.write(optimized)

Prompt Engineering for Vision

class VisionPromptBuilder:
    def build_analysis_prompt(
        self,
        task: str,
        output_format: str = "text",
        focus_areas: list[str] = None
    ) -> str:
        """Build effective vision analysis prompt."""
        prompt_parts = [f"Task: {task}"]
        
        if focus_areas:
            prompt_parts.append(
                "Focus specifically on:\n" + 
                "\n".join(f"- {area}" for area in focus_areas)
            )
        
        if output_format == "json":
            prompt_parts.append(
                "Return the result as valid JSON without any markdown formatting."
            )
        elif output_format == "table":
            prompt_parts.append(
                "Return the result as a markdown table."
            )
        elif output_format == "list":
            prompt_parts.append(
                "Return the result as a numbered list."
            )
        
        # Add specificity guidelines
        prompt_parts.append(
            "Be specific and detailed. Include exact measurements, colors, and positions when relevant."
        )
        
        return "\n\n".join(prompt_parts)
    
    def build_extraction_prompt(
        self,
        fields: list[str],
        context: str = None
    ) -> str:
        """Build data extraction prompt."""
        prompt = "Extract the following information from this image:\n\n"
        prompt += "\n".join(f"{i+1}. {field}" for i, field in enumerate(fields))
        
        if context:
            prompt = f"Context: {context}\n\n" + prompt
        
        prompt += "\n\nReturn as JSON with field names as keys."
        return prompt

# Usage
prompt_builder = VisionPromptBuilder()

analysis_prompt = prompt_builder.build_analysis_prompt(
    task="Analyze this product image",
    output_format="json",
    focus_areas=["product condition", "visible defects", "packaging quality"]
)

extraction_prompt = prompt_builder.build_extraction_prompt(
    fields=["product name", "price", "SKU", "dimensions"],
    context="This is a product listing image from an e-commerce site"
)

Cost Optimization for Vision

class VisionCostOptimizer:
    def __init__(self):
        self.cache = {}
    
    def should_use_vision(self, query: str, image_path: str) -> bool:
        """Determine if vision API is necessary."""
        # Check if OCR might suffice
        if self.is_text_only(image_path):
            return False
        
        # Check if cached result exists
        cache_key = f"{query}:{self.hash_image(image_path)}"
        if cache_key in self.cache:
            return False
        
        return True
    
    def select_cheapest_model(
        self,
        task_complexity: str,
        num_images: int
    ) -> str:
        """Select most cost-effective model."""
        if num_images > 50:
            return "gemini-1.5-pro"  # Cheapest per image
        elif task_complexity == "simple":
            return "gemini-1.5-flash"
        elif num_images <= 5:
            return "gpt-4-vision-preview"
        else:
            return "claude-3-5-sonnet"
    
    def batch_process_efficiently(
        self,
        images: list[str],
        prompt: str
    ) -> list:
        """Process images in cost-effective batches."""
        # Group similar images
        batches = self.group_similar_images(images)
        
        results = []
        for batch in batches:
            # Use multi-image analysis when beneficial
            if len(batch) > 1 and self.benefits_from_batch(prompt):
                result = vision_client.analyze_multiple_images(batch, prompt)
                results.append(result)
            else:
                for img in batch:
                    result = vision_client.analyze_image(img, prompt)
                    results.append(result)
        
        return results

Production Considerations

Error Handling

class RobustVisionClient:
    def __init__(self, vision_client):
        self.vision = vision_client
        self.max_retries = 3
    
    async def analyze_with_retry(
        self,
        image_path: str,
        prompt: str
    ) -> dict:
        """Analyze with retry logic."""
        for attempt in range(self.max_retries):
            try:
                result = self.vision.analyze_image(image_path, prompt)
                return {
                    'success': True,
                    'result': result,
                    'attempts': attempt + 1
                }
            except Exception as e:
                if attempt == self.max_retries - 1:
                    return {
                        'success': False,
                        'error': str(e),
                        'attempts': attempt + 1
                    }
                
                # Exponential backoff
                await asyncio.sleep(2 ** attempt)
        
        return {'success': False, 'error': 'Max retries exceeded'}

Quality Validation

class VisionQualityValidator:
    def validate_extraction(
        self,
        extracted_data: dict,
        required_fields: list[str]
    ) -> tuple[bool, list[str]]:
        """Validate extracted data completeness."""
        missing_fields = [
            field for field in required_fields
            if field not in extracted_data or not extracted_data[field]
        ]
        
        return len(missing_fields) == 0, missing_fields
    
    def validate_response_quality(self, response: str, min_length: int = 50) -> bool:
        """Check if response meets quality threshold."""
        if len(response) < min_length:
            return False
        
        # Check for common failure patterns
        failure_patterns = [
            "I cannot",
            "I'm unable to",
            "I don't see",
            "The image appears to be",
            "I cannot access"
        ]
        
        return not any(pattern in response for pattern in failure_patterns)

Conclusion: Seeing is Believing

Multimodal AI capabilities transform what’s possible with LLMs. From document processing to UI analysis, chart extraction to code debugging, vision-enabled models handle tasks that previously required specialized OCR systems, computer vision models, or human review.

Key implementation principles:

  1. Choose the right model: Gemini for scale/cost, Claude for documents, GPT-4V for complex analysis
  2. Optimize images: Balance quality with token costs
  3. Engineer prompts carefully: Specificity matters even more with images
  4. Handle errors gracefully: Vision APIs can fail in unique ways
  5. Validate outputs: Check extraction completeness and quality

The multimodal revolution is here. Applications that can see and understand visual information are no longer science fiction—they’re production reality.


Last Updated: December 2024

author avatar
promptyze
promptyze
Founder · Editor · Promptowy

Piszę o AI i automatyzacji od 3 lat. Prowadzę promptowy.com.

More →