Skip to content
Claude

How to Fine-Tune Claude on Your Company’s Docs: A Practical Enterprise Guide

Step-by-step tutorial on fine-tuning Claude on proprietary company documents, with copy-paste prompts, cost estimation, and batch processing for legal and consulting teams.

11 min read
How to Fine-Tune Claude on Your Company's Docs: A Practical Enterprise Guide

Generic AI assistants hallucinate on your internal docs, misread your legal templates, and give advice that sounds plausible but would make your compliance team cry. Fine-tuning fixes that. Instead of wrestling Claude into shape with increasingly elaborate system prompts, you train a version of the model that actually knows your procedures, your terminology, and your standards — from day one of a conversation.

Anthropic’s fine-tuning API, available for Claude models including Sonnet and Haiku, lets enterprise teams build specialized assistants on proprietary datasets. The Batch API sits alongside it, handling large-scale document processing at half the cost of real-time inference. For legal teams annotating contracts, consultants synthesizing industry reports, or researchers working through dense literature, this combination is genuinely useful — once you know how to set it up properly.

This guide walks through the full workflow: preparing your dataset, structuring training examples, running a fine-tuning job, estimating costs, and testing the result. Every step includes copy-paste prompts and the parameters that actually matter.

What You’ll Actually Achieve

By the end of this tutorial, you’ll have a fine-tuned Claude model that responds in your organization’s tone, understands your domain-specific vocabulary, and handles your document formats without needing a paragraph of context in every prompt. You’ll also have a cost estimate before you spend a dollar, and a testing framework to verify the model improved before you deploy it to anyone important.

Requirements Before You Start

You need an Anthropic API account with fine-tuning access enabled — this is available on paid plans, and you’ll want to confirm your tier includes fine-tuning before building your dataset. You’ll need Python 3.8 or later, the Anthropic Python SDK, and a dataset of at minimum 10 examples for testing (100 or more for anything you’d actually use in production). Plan for a training run that takes several hours, and budget accordingly — fine-tuning is not cheap, but inference on the trained model costs significantly less per query than running it cold.

Note 💡

Fine-tuning access on the Anthropic API may require contacting their sales or enterprise team depending on your current plan. Check your dashboard under API settings before assembling your dataset — nothing worse than 200 curated examples and no way to submit them.

Step 1 — Build a Dataset That’s Actually Worth Training On

The quality of your fine-tuned model is entirely determined by the quality of your training data. A hundred excellent examples will outperform a thousand mediocre ones every time. For enterprise knowledge work, this means selecting documents that represent your real use cases: contracts your team reviews weekly, client briefings with standard structures, research summaries in your house style, compliance memos that follow specific formatting rules.

The Anthropic fine-tuning API uses a conversational format. Each training example is a JSON object with a list of messages — a user turn and an assistant turn, at minimum. The assistant turn is what you’re teaching the model to produce. Here’s the base structure:

{
  "messages": [
    {
      "role": "user",
      "content": "Summarize the key obligations in this NDA clause: 'The Receiving Party shall not disclose, reproduce, or distribute Confidential Information to any third party without prior written consent from the Disclosing Party, for a period of five (5) years following the termination of this Agreement.'"
    },
    {
      "role": "assistant",
      "content": "Key obligations: (1) No disclosure, reproduction, or distribution of Confidential Information to third parties. (2) Any permitted disclosure requires prior written consent from the Disclosing Party. (3) Obligations survive for five years post-termination. Risk flag: The clause does not define 'Confidential Information' — check Section 1 for the definition, or flag as ambiguous."
    }
  ]
}

That assistant response is doing several things at once: it summarizes in a numbered format, it flags a legal risk, and it directs the user to cross-reference another section. All three behaviors are what you want the fine-tuned model to replicate on new clauses it’s never seen. The format you put in is the format you get out.

For a consulting context, the structure shifts toward structured insight delivery:

{
  "messages": [
    {
      "role": "user",
      "content": "Based on this market data excerpt, what's the key strategic implication for a mid-market manufacturer entering the Southeast Asian EV supply chain?"
    },
    {
      "role": "assistant",
      "content": "Strategic implication: The window for supplier qualification is narrowing. OEMs in the region are finalizing Tier-2 vendor lists through Q3, meaning a manufacturer without an existing relationship or local JV partner faces a 12-18 month delay before meaningful volume. Recommended action: prioritize distributor partnership over direct market entry. Supporting data point needed: verify current localization requirements under Indonesia's TKDN policy before committing to a manufacturing footprint."
    }
  ]
}

Pro tip ✅

Include a system prompt in your training examples if your production deployment uses one. The model learns to respond appropriately given that system context — omitting it from training but including it at inference creates a distribution mismatch that degrades performance.

Step 2 — Prepare and Validate Your JSONL File

Training data goes into a JSONL file — one JSON object per line, no trailing commas, UTF-8 encoding. Validation before upload saves you from discovering format errors after a failed job that still bills your account. Here’s a quick Python validator you can run locally:

import json

def validate_training_file(filepath):
    errors = []
    with open(filepath, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                example = json.loads(line)
            except json.JSONDecodeError as e:
                errors.append(f"Line {i}: JSON parse error — {e}")
                continue
            if 'messages' not in example:
                errors.append(f"Line {i}: Missing 'messages' key")
                continue
            messages = example['messages']
            if not any(m['role'] == 'assistant' for m in messages):
                errors.append(f"Line {i}: No assistant turn found")
    if errors:
        for err in errors:
            print(err)
    else:
        print(f"Validation passed. {i} examples ready.")

validate_training_file('training_data.jsonl')

Run this before uploading anything. The Anthropic API will reject malformed files, and debugging a 500-example dataset line by line is not how you want to spend an afternoon.

Step 3 — Upload Your Dataset and Start a Fine-Tuning Job

With a validated JSONL file ready, the upload and job creation process is straightforward through the SDK:

import anthropic

client = anthropic.Anthropic(api_key="your_api_key_here")

# Upload training file
with open('training_data.jsonl', 'rb') as f:
    training_file = client.beta.files.upload(
        file=('training_data.jsonl', f, 'application/jsonl')
    )

print(f"File uploaded: {training_file.id}")

Once uploaded, you create the fine-tuning job by specifying the base model and your file ID. For enterprise document work, starting with a capable base model gives you the strongest foundation to specialize from:

# Create fine-tuning job
job = client.beta.fine_tuning.jobs.create(
    model="claude-haiku-4-5",  # confirm available models in your API dashboard
    training_file=training_file.id,
    hyperparameters={
        "n_epochs": 3
    }
)

print(f"Job created: {job.id}"
print(f"Status: {job.status}")

Warning ⚠️

Always verify which specific model versions support fine-tuning in your current API tier by checking docs.anthropic.com directly before writing your pipeline. Model availability for fine-tuning can change, and building against an unavailable model wastes time you could spend on dataset curation.

Step 4 — Monitor the Job and Retrieve Your Model

Fine-tuning jobs run asynchronously. Poll for status rather than assuming a fixed completion time — job duration varies with dataset size and current API load:

import time

def wait_for_job(client, job_id, poll_interval=60):
    while True:
        job = client.beta.fine_tuning.jobs.retrieve(job_id)
        print(f"Status: {job.status}")
        if job.status in ['succeeded', 'failed', 'cancelled']:
            return job
        time.sleep(poll_interval)

completed_job = wait_for_job(client, job.id)

if completed_job.status == 'succeeded':
    fine_tuned_model_id = completed_job.fine_tuned_model
    print(f"Model ready: {fine_tuned_model_id}")
else:
    print(f"Job did not succeed: {completed_job.status}")

Save that fine_tuned_model_id somewhere persistent. You’ll use it for all inference calls, and it’s not intuitive to retrieve later if you lose it.

Step 5 — Use the Batch API for Large Document Processing

Once your fine-tuned model exists, the Batch API is how you process large document volumes without paying real-time inference prices. According to Anthropic’s documentation, batch processing cuts costs by roughly 50% compared to synchronous API calls — the tradeoff is that results come back asynchronously, typically within 24 hours, with a maximum of 100,000 requests per batch.

For a legal team reviewing hundreds of contracts, or a consulting firm processing a stack of due diligence documents, this is the economics that makes fine-tuned models practical at scale. Here’s how to structure a batch of document analysis requests against your fine-tuned model:

import json

def create_batch_requests(documents, fine_tuned_model_id):
    requests = []
    for i, doc in enumerate(documents):
        request = {
            "custom_id": f"doc_{i:04d}",
            "params": {
                "model": fine_tuned_model_id,
                "max_tokens": 1024,
                "messages": [
                    {
                        "role": "user",
                        "content": f"Review the following contract clause and identify: (1) key obligations, (2) risk flags, (3) missing definitions.

Clause:
{doc}"
                    }
                ]
            }
        }
        requests.append(request)
    return requests

# Submit batch
batch = client.beta.messages.batches.create(
    requests=create_batch_requests(your_documents, fine_tuned_model_id)
)

print(f"Batch submitted: {batch.id}")

Pro tip ✅

Use the custom_id field to match batch results back to your source documents. A sequential ID like contract_2024_087 is far more useful than a numeric index when you’re reconciling 500 outputs two days later.

Step 6 — Calculate Your Costs Before You Commit

Fine-tuning is not something you want to discover the price of after the job runs. Here’s a straightforward cost estimator based on Anthropic’s published pricing. Note that pricing can change — always verify current rates at anthropic.com/pricing before running production jobs.

def estimate_finetuning_cost(
    training_examples,
    avg_tokens_per_example,
    n_epochs=3
):
    """
    Estimate fine-tuning cost based on token consumption.
    Verify current pricing at anthropic.com/pricing before use.
    """
    # Approximate: input + output tokens per example
    total_tokens = training_examples * avg_tokens_per_example * n_epochs
    
    # These are approximate — confirm current rates with Anthropic
    input_cost_per_million = 15.0   # USD, fine-tuning input
    output_cost_per_million = 75.0  # USD, fine-tuning output
    
    # Rough split: ~70% input, ~30% output tokens
    input_tokens = total_tokens * 0.7
    output_tokens = total_tokens * 0.3
    
    input_cost = (input_tokens / 1_000_000) * input_cost_per_million
    output_cost = (output_tokens / 1_000_000) * output_cost_per_million
    
    return {
        'total_tokens': total_tokens,
        'estimated_cost_usd': round(input_cost + output_cost, 2)
    }

# Example: 500 examples, ~800 tokens each, 3 epochs
estimate = estimate_finetuning_cost(500, 800, n_epochs=3)
print(f"Estimated training cost: ${estimate['estimated_cost_usd']}")
# → Roughly $126 for this scenario

For inference after training, batch processing at scale becomes significantly more economical than real-time calls. A legal team running 10,000 contract clause analyses per month through batch API spends a fraction of what they’d pay for synchronous inference — and the fine-tuned model means fewer follow-up queries to correct the output.

Pro tip ✅

Run your first fine-tuning job with a small dataset — 50 to 100 high-quality examples — before committing to a full production dataset. Verify the model’s behavior on your test set first. If the outputs aren’t meaningfully better than the base model, the problem is almost always in the training data quality, not the number of examples.

Step 7 — Test Against a Held-Out Evaluation Set

Never deploy a fine-tuned model without a structured evaluation. Before you start training, reserve 10-15% of your dataset as a held-out test set — examples the model never saw during training. After the job completes, run your fine-tuned model and the base model against the same test prompts and compare outputs systematically.

def evaluate_model(client, model_id, test_examples):
    results = []
    for example in test_examples:
        user_message = next(
            m['content'] for m in example['messages'] 
            if m['role'] == 'user'
        )
        expected = next(
            m['content'] for m in example['messages'] 
            if m['role'] == 'assistant'
        )
        
        response = client.messages.create(
            model=model_id,
            max_tokens=1024,
            messages=[{"role": "user", "content": user_message}]
        )
        
        results.append({
            'expected': expected,
            'actual': response.content[0].text,
            'prompt': user_message
        })
    
    return results

# Run on both base and fine-tuned model
base_results = evaluate_model(client, "claude-haiku-4-5", test_set)
ft_results = evaluate_model(client, fine_tuned_model_id, test_set)

For qualitative domains like legal analysis, automated evaluation is limited — you’ll need a human reviewer checking a sample of outputs. But you can automate format compliance checks: does the output use your numbered structure? Does it include the risk flag section? Does it reference cross-sections when appropriate? These structural signals give you a fast first-pass signal on whether training worked.

Avoid 🚫

Don’t include sensitive client data or personally identifiable information in your training dataset without a clear data processing agreement with Anthropic. Enterprise teams handling legal or financial documents should review Anthropic’s data privacy policies and, where required, use an API tier that offers data isolation before uploading proprietary materials.

Here are three production-ready prompts structured for fine-tuning training data. Each follows the format that produces consistent, structured outputs your team can actually use.

For contract obligation extraction:

Extract all party obligations from the following contract clause. Format your response as:
OBLIGATIONS — [Party Name]: [obligation 1]; [obligation 2]
TIMELINE: [any deadlines or duration]
RISK FLAGS: [ambiguous terms, missing definitions, or unusual provisions]
CROSS-REFERENCE: [sections that should be read alongside this clause]

Clause: [insert clause text]

For research synthesis in consulting contexts:

You are a senior strategy consultant reviewing source material for a client briefing.
Synthesize the following excerpt into: (1) the single most important insight, (2) the supporting evidence, (3) a recommended action, and (4) what additional data would strengthen or challenge this conclusion.

Source: [insert excerpt]

For compliance memo generation:

Draft a compliance memo based on the following regulatory update. Use our standard format:
— SUMMARY (2 sentences max)
— AFFECTED TEAMS: [list]
— REQUIRED ACTIONS BY DATE: [specific, numbered]
— RISK IF NOT ADDRESSED: [one paragraph]
— PREPARED BY: [leave blank]

Regulatory update: [insert update text]

What This Means for Your Team

Fine-tuning Claude on your organization’s documents is not a weekend project, but it’s also not a six-month initiative. A well-curated dataset of 200 to 500 examples, a properly structured JSONL file, and one fine-tuning job gives you a model that speaks your organization’s language from prompt one — no system prompt essays, no constant correction, no explaining what an NDA risk flag means every single time.

The economics work for teams with high query volumes and consistent document types. Legal teams reviewing contracts daily, consultants synthesizing research at scale, and researchers processing literature are the obvious fits. The Batch API amplifies this further: process a week’s worth of documents overnight at half the inference cost, wake up to structured outputs, and let your team spend their time on the 20% of cases the model correctly flags as needing human judgment.

The upfront investment is real — dataset curation takes longer than the actual training job, and the first run is rarely perfect. But a fine-tuned model that consistently produces structured, accurate outputs in your format isn’t a nice-to-have for enterprise knowledge work. It’s the difference between AI that assists and AI that integrates.

author avatar
promptyze

promptyze

ADMINISTRATOR