Selecting the Right LLM for On-Premise Security Operations

The decision to run a large language model inside your own security perimeter — rather than calling an external API — is not primarily a technical one. It is a data governance decision with technical implications.

For government departments, defence contractors, financial regulators, and any organisation handling sensitive personal data, the question is not whether to keep AI inference on-premise. The answer to that is obvious. The question is which model to run, on what hardware, for which specific tasks.

This is a guide to thinking through that decision systematically.

Why On-Premise Matters for Security Operations

Before getting into model selection, it is worth being explicit about why this matters in a security context specifically.

Your SOC telemetry contains some of the most sensitive data in your organisation: user behaviour patterns, vulnerability exposure, incident details, and — critically — the specific gaps in your defences. Sending this data to an external AI provider, even a reputable one, creates risks that most security policies explicitly prohibit:

Data residency violations — for any EU public sector organisation, processing this data outside EU borders may breach AVG/GDPR obligations
Counterintelligence exposure — threat intelligence sharing with an external system, even encrypted in transit, expands your attack surface
Legal and contractual risk — many government contracts explicitly prohibit sending operational data to third-party cloud services

The on-premise requirement is a constraint, not a preference. The goal is to find the best model that fits within it.

The Performance-Size Trade-off

There is a persistent misconception that bigger models are always better. For most security operations tasks, this is not true — and the operational costs of larger models (hardware, latency, power) are significant.

We have evaluated a range of open-weight models for security-specific tasks. The relevant task categories are:

Alert triage and classification — Reading a structured alert with context and producing a risk rating and recommended action. This is largely pattern recognition on known schemas. Models in the 7B–14B range perform very well on this task when fine-tuned or given good prompting. Latency matters here; an 8B model on a capable GPU processes a triage request in under 2 seconds.

Log summarisation — Condensing hundreds of log lines into a coherent narrative of what happened. Again, 14B-class models handle this well. The key capability is maintaining context across long inputs, where sliding-window attention becomes important.

Threat intelligence correlation — Connecting an observed indicator to known threat actor TTPs, historical incidents, and your specific environment. This benefits from larger models (34B+) with stronger reasoning, particularly for novel or ambiguous indicators where pattern matching alone is insufficient.

Incident report drafting — Generating structured incident reports from raw analyst notes and log summaries. This is a language task where quality matters more than latency, making it a good candidate for a larger, slower model running asynchronously.

Hardware Considerations for Government Environments

The hardware landscape for on-premise AI inference has improved dramatically. A few practical observations:

Consumer GPU clusters are viable for smaller models — Two or three high-end consumer GPUs can comfortably run a 14B model at production inference speeds. For air-gapped or highly restricted environments, this is often more practical than enterprise GPU servers.

AMD and Intel GPU options are maturing — NVIDIA retains a significant software ecosystem advantage, but AMD’s ROCm support for major inference frameworks has improved enough to be viable for organisations that cannot use NVIDIA hardware for procurement or policy reasons.

CPU inference for async tasks — For tasks where latency is not critical (report drafting, end-of-shift summaries), modern CPU-based inference with quantised models (4-bit, 8-bit) can be surprisingly capable and requires no GPU infrastructure at all.

Memory bandwidth matters more than raw compute — For inference (as opposed to training), the constraint is usually how fast you can move model weights to the compute units. A GPU with high memory bandwidth often outperforms a higher-FLOP GPU with slower memory.

A Practical Selection Framework

When selecting a model for a specific security operations task, evaluate against these criteria in order:

Task fit — Does the model perform well on the specific task type? Test with representative examples from your actual environment, not generic benchmarks.
Context window — Security tasks often involve long inputs (many log lines, large alert payloads). Ensure the model’s context window is large enough for realistic inputs without truncation.
Latency at your hardware tier — Benchmark inference latency on the hardware you will actually deploy. A model that performs beautifully on a high-end lab machine may be unusable on your production hardware.
Quantisation compatibility — Can the model be quantised (reduced to 4-bit or 8-bit precision) without significant quality degradation on your task? Quantisation dramatically reduces hardware requirements.
Fine-tuning viability — If you plan to adapt the model to your specific environment (recommended), check that fine-tuning tooling is available and that your hardware can support it.
Licence — Verify the model’s licence permits commercial use and any modifications you plan to make. Several popular open-weight models have non-commercial restrictions.

The Multimodal Question

Should your security AI be multimodal — capable of processing images as well as text? For most SOC applications, the answer is no, or not yet. Text-only models are smaller, faster, and better understood for security tasks. The exceptions are:

Visual analysis of phishing pages or malicious documents (screenshots of content that is hard to parse as text)
Dashboard monitoring where visual anomaly detection has value

For these niche cases, a separate, purpose-specific multimodal model is preferable to a large general multimodal model running in your main SOC pipeline.

Getting Started Without a Dedicated AI Team

If your organisation lacks AI engineering capability in-house, the path to on-premise AI for security operations typically involves:

Starting with a managed service that deploys and operates the AI infrastructure within your perimeter, rather than trying to build the capability from scratch
Investing in upskilling two or three existing security staff who have an interest in AI — the combination of security domain knowledge and AI literacy is genuinely rare and valuable
Treating the first deployment as a learning exercise: pick a bounded, well-defined task (alert triage for one specific alert type), deploy conservatively, and measure carefully before expanding scope

The organisations that get the most from on-premise AI in security operations are those that start small, measure rigorously, and expand deliberately — not those that attempt a comprehensive AI transformation in a single project.