Fine-Tuning a 32B Legal LLM That Outperformed a Frontier Model at 4× Lower Serving Cost

Everyone assumes bigger models produce better results. LegML set out to prove otherwise.

They fine-tuned a 32B-parameter legal LLM — internally called "Hugo" — that outperformed a leading frontier model on their domain benchmarks, including +10% higher factual precision, with half the parameters. The model was trained in 14 days on distributed H100 GPUs for approximately €22,500 and serves on a single H200 at $3.15/hour.

This post breaks down the training approach, infrastructure decisions, serving economics, and what made a smaller, domain-specific model viable against a general-purpose frontier model in a precision-critical domain.

Why Legal AI Needs Sovereign, Domain-Specific Models

LegML's requirements were shaped by the legal domain's constraints. Client data cannot leave the platform — no black-box API calls, no data routed through third-party inference providers. Models must run inside a company's own infrastructure (on-prem, private cloud, or sovereign cloud), and outputs must be auditable for citation accuracy and legal consistency.

These constraints rule out hosted frontier model APIs for production legal workflows. The alternative is fine-tuning an open-weight model on domain-specific data, which introduces its own set of infrastructure and methodology challenges.

Training Methodology: Supervised Fine-Tuning + GRPO

LegML combined two approaches to push Hugo's performance on legal tasks:

Supervised fine-tuning (SFT) on legal workflows — contract drafting, regulatory compliance, and legal Q&A — using curated corpora of domain-specific examples. This established the model's baseline capability on legal language, structure, and citation conventions.

Group-Relative Policy Optimization (GRPO) to improve reasoning quality beyond what SFT alone could achieve. GRPO is a reinforcement learning technique that optimizes model outputs relative to grouped comparisons rather than absolute scores, making it particularly effective for tasks where quality is contextual — like legal reasoning, where a correct answer depends on jurisdiction, precedent, and statutory interpretation.

The combination pushed Hugo to outperform the frontier model baseline on LegML's benchmark suite across factual precision, citation accuracy, and legal consistency. Two independent legal experts reviewed outputs and confirmed Hugo produced more accurate and complete answers on legal reasoning tasks.

Infrastructure Challenges Before Scaling Fine-Tuning

Before settling on their final training setup, LegML hit the infrastructure problems that most teams encounter when fine-tuning models above 20B parameters:

Capacity quotas on European neoclouds. Strict GPU quotas and studio environments capped at ≤20B models made it impossible to run full-parameter fine-tuning of a 32B model.
Cost-performance gaps. Alternative providers offered capacity but at higher cost and lower throughput relative to H100/H200 clusters.
Ops overhead. Manual cluster setup, checkpoint management, and ad-hoc monitoring elongated each training run. Engineering time went to infrastructure instead of model quality.

The net effect was unreliable schedules, unpredictable spend, and iteration cycles that were infrastructure-bound rather than experiment-bound.

Training Infrastructure: Distributed H100s with Reproducible Environments

LegML trained Hugo on FlexAI, which handled the infrastructure layer so the team could focus on data and training signals. The key infrastructure properties that mattered for this workload:

Reproducible training environments. The full stack — CUDA, cuDNN, NCCL, drivers, container image, Python dependencies — was pinned per workload and versioned as an immutable spec. Artifacts (datasets, checkpoints, logs) were tracked in a versioned object store with lineage. Each run launched from a single job spec with GPUs allocated on demand and topology-aware placement (NVLink/PCIe domains, InfiniBand RDMA) enforced for stable throughput.

Preemption-aware distributed training. Jobs used DDP/FSDP/ZeRO with mixed precision (bf16/fp8 where supported). Checkpointing was sharded and incremental; resume was idempotent after spot preemption or node drain. Data loaders were elastic with epoch/step accounting preserved. Observability covered tokens/sec, step latency, GPU/SM/KV-cache utilization, network bandwidth, and memory pressure — exported to Prometheus/Grafana with structured logs for failure triage.

Hardware-aware placement. The scheduler matched the job profile (dense model, parameter/shard size, KV footprint, interconnect needs) to the right accelerator class and memory tier. Training ran on premium NVIDIA clusters for interconnect maturity and kernel support; serving shifted to cost-efficient accelerators with runtime optimizations (vLLM, quantization to INT8/FP8) without changing application code.

Serving Economics: 32B on H200 vs. 70B on B200

The economic case for a smaller, domain-tuned model is stark when you look at serving costs over time:

Configuration	GPU	Hourly Cost	6-Month Continuous Cost
Hugo (32B)	1× H200	$3.15/hr	~$9,072
Comparable 70B baseline	2× B200	$6.25/hr	~$36,000

Hugo delivers higher accuracy on legal benchmarks at roughly 4× lower operating cost. Payback on the €22,500 training investment is approximately 5.3 months.

The economics compound further when you factor in that legal workloads are typically bursty — peak demand during business hours, minimal traffic overnight — so serving costs can be reduced further with autoscaling. And because the model is sovereign (runs inside the customer's infrastructure), there are no per-token API fees scaling with usage.

What Made the Smaller Model Win

Three factors worked in Hugo's favor over the frontier model baseline:

Domain-specific training data. Legal language is highly structured and convention-heavy. A model trained on curated legal corpora learns domain conventions that a general-purpose model treats as edge cases. This is especially true for citation format, jurisdictional nuance, and regulatory cross-references.

GRPO for reasoning quality. Reinforcement learning on grouped legal comparisons pushed the model beyond SFT ceiling on tasks that require multi-step reasoning — contract clause analysis, compliance gap identification, and precedent-based Q&A.

Right-sized architecture. At 32B parameters, Hugo fits on a single H200 for inference. This eliminates the multi-GPU coordination overhead, tensor parallelism complexity, and inter-node communication latency that a 70B+ model requires. Simpler serving infrastructure means fewer failure modes in production.

A Reproducible Blueprint for Vertical LLMs

LegML and FlexAI are turning this into a repeatable pattern: full-parameter fine-tuning on curated sector corpora, combined with GRPO for reasoning quality and continuous learning pipelines with hybrid human + LLM evaluation.

The approach is already extending from law into finance, insurance, and public administration — domains where precision, governance, and data sovereignty are non-negotiable. The core tradeoff holds across verticals: if your domain has structured conventions, high accuracy requirements, and data residency constraints, a smaller fine-tuned model will likely outperform and out-economize a frontier model API.

If you're exploring a domain-specific LLM and want to understand the cost/performance tradeoffs for your use case, reach out to LegML at legml.com or FlexAI at hello@flex.ai.

Fine-Tuning a 32B Legal LLM That Outperformed a Frontier Model at 4× Lower Serving Cost

Why Legal AI Needs Sovereign, Domain-Specific Models#

Training Methodology: Supervised Fine-Tuning + GRPO#

Infrastructure Challenges Before Scaling Fine-Tuning#

Training Infrastructure: Distributed H100s with Reproducible Environments#

Serving Economics: 32B on H200 vs. 70B on B200#

What Made the Smaller Model Win#

A Reproducible Blueprint for Vertical LLMs#

Get Started Today