AI

Inference

Running the trained model to generate output — activating the Sorting Hat after all the training: it just decides.

Inference is the operational phase of AI: running a trained model on new, unseen input to generate outputs. In the machine learning lifecycle, training is where the model learns from large datasets (compute-intensive, done infrequently), and inference is where the learned model is applied to produce predictions or generated content (done continuously, on demand, at scale). For LLMs, inference takes a prompt as input and generates tokens one at a time until the response is complete — the model processes the prompt and generates each output token by running the prompt plus all previously generated tokens through its neural network. Every time you send a message to an AI assistant, you trigger an inference call; the model runs, produces tokens, and the output appears.

Inference has distinct technical and economic properties from training. Training requires clusters of specialized GPUs running for days, weeks, or months; inference is done on demand and can be distributed across many servers to handle concurrent requests. Inference cost is measured per token (input tokens processed plus output tokens generated), and at scale, inference costs dominate AI budgets. Inference latency — how long it takes to generate a response — matters enormously for user experience; a model that produces excellent output in 30 seconds is unsuitable for real-time conversational applications where users expect sub-2-second responses. Techniques like quantization (reducing model precision to reduce compute requirements) and speculative decoding (predicting multiple tokens at once) optimize inference speed and cost without retraining.

For B2B teams evaluating or building AI applications, inference considerations shape technology choices. Model selection involves tradeoffs between capability (larger models, more capability, higher inference cost) and speed and cost (smaller models, lower latency, lower cost). Latency requirements differ by use case: a real-time copilot feature requires fast inference; a batch document processing job can tolerate slower but cheaper inference; an overnight analysis run can use the largest, most capable models without user-facing latency concerns. Understanding inference as a distinct cost center — tracked separately from training, billed per token, optimized independently — is essential for managing AI application economics at any meaningful scale.

inferenceLLMAI deploymentmodel servingmachine learningproduction AI

Related terms