Optimizing LLM Inference: High-Throughput Serving with vLLM and Quantization

Deploying Large Language Models (LLMs) in production is notoriously resource-intensive. Monolithic models like Llama 3 or Mistral require immense GPU memory, leading to bottlenecked response rates and soaring cloud computing costs. For organizations looking to serve AI capabilities at scale, optimization is key.

Two major techniques stand out to unlock high-throughput LLM deployment: **PagedAttention** (via the open-source library vLLM) and **Quantization** (such as AWQ, GPTQ, and GGUF). Combined, these systems allow you to serve models up to 30x faster while slicing hardware requirements in half.

1. The Memory Bottleneck: KV Cache Inefficiency

During LLM generation, the model processes tokens sequentially. To predict the next word, it needs to evaluate all preceding tokens in the prompt and response history. To avoid redundant matrix calculations, the system stores key-value (KV) activations of past tokens in a memory cache (the KV Cache).

In standard transformer libraries (like Hugging Face Transformers), this KV cache is allocated statically. The system reserves a large, contiguous memory block for the maximum generation length (e.g., 4096 tokens). Because actual inputs vary in size, this approach results in three types of memory waste:

Internal Fragmentation: Reserving memory for tokens that are never actually generated.
External Fragmentation: Scattered free slots that cannot be consolidated due to non-contiguous layout constraints.
Oversubscription: Reserving memory for future tokens that the system cannot utilize during concurrent requests.

"By treating GPU memory similarly to operating system virtual memory page tables, PagedAttention splits the KV Cache into small blocks, eliminating up to 96% of memory waste."

2. Implementing PagedAttention with vLLM

The vLLM library implements PagedAttention natively. Rather than storing key-value tensors in contiguous GPU RAM blocks, vLLM divides them into fixed-size blocks (pages). A lookup table maps logical token blocks to physical GPU memory pages, allowing vLLM to dynamically allocate memory, stream inputs, and manage batches concurrently.

Let's look at how to initialize and serve an LLM endpoint locally with vLLM in Python:

# Import vLLM components
from vllm import LLM, SamplingParams

# Initialize the model (automatically utilizes PagedAttention)
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
llm = LLM(model=model_name, trust_remote_code=True, tensor_parallel_size=1)

# Define generation parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256
)

# Batch inputs for concurrent processing
prompts = [
    "Explain quantum computing in three sentences.",
    "Write a Python function to check for prime numbers.",
    "What are the main benefits of using Docker containers?"
]

# Run high-throughput parallel inference
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated text: {output.outputs[0].text}\n")

3. Model Compression: The Role of Quantization

While PagedAttention optimizes memory footprint, **Quantization** reduces model file sizes directly. Deep learning weights are typically stored as 16-bit floating-point numbers (FP16). Quantization maps these weights to lower precision formats, such as 8-bit integers (INT8) or 4-bit integers (INT4), with minimal loss of accuracy.

The three dominant quantization frameworks in 2026 are:

AWQ (Activation-aware Weight Quantization): Keeps the most critical 1% of weights (salient weights) at FP16 while mapping the remaining 99% to 4-bit representation. AWQ is highly optimized for GPU-bound real-time serving.
GPTQ (Generalized Post-Training Quantization): Compresses weights by solving second-order optimization equations, yielding high performance for bulk batch workloads.
GGUF (GPT-Generated Unified Format): The format of choice for CPU and local-Mac (Metal API) inference, permitting models to run natively on consumer desktop GPUs and laptops.

Serving quantized AWQ models with vLLM

Using vLLM, you can serve quantized models directly, yielding massive speedups and reducing GPU memory requirements by up to 70%:

# Launch a vLLM server with a 4-bit AWQ quantized model
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
    --quantization awq \
    --port 8000

This command starts an OpenAI-compatible web server on port 8000, allowing you to route client queries from your applications directly into a highly compressed, optimized model sandbox.