Deploying vLLM on Google Cloud: A Guide to Scalable Open LLM Inference

7 min readDec 2, 2024

Large Language Models (LLMs) have become central to many modern applications, but deploying them efficiently at scale presents unique challenges. In this guide, we’ll explore how to deploy a production-ready LLM inference service on Google Cloud Platform (GCP) using vLLM, an open-source library that dramatically improves inference performance through memory management techniques inspired by OS virtual memory.

Understanding the Memory Challenge in LLM Serving

KV Cache is bloat…

Before diving into the deployment, it’s worth a brief overview of the problem vLLM solves. To understand what problem vLLM solves, we need to understand why memory management is crucial for efficient LLM serving. According to the paper Efficient Memory Management for Large Language Model Serving with PagedAttention, Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023)., close to 30% of the GPU memory for serving LLM’s is used to store the dynamic states of the model (ie. the KV Cache). Due to deep learning frameworks (eg. PyTorch) requiring tensors to be stored in contiguous memory, and the dynamic nature of the KV Cache — existing serving systems suffer from severe memory fragmentation. Only ~20–40% of the KV Cache memory actually stores token states. This reveals that the actual effective memory of previous systems can be as low as ~20%.

Figure 3. KV cache memory management in existing systems. Three types of memory wastes — reserved, internal fragmentation, and external fragmentation — exist that prevent other requests from fitting into the memory. The token in each memory slot represents its KV cache. Note the same tokens can have different KV cache when at different positions.

There are three types of memory waste:

Internal Fragmentation: Over provisioning for maximum sequence length
Reservation: Reservation for future token slots
External Fragmentation: From the memory allocator between batches

This reveals that the sequential generation of tokens during inference is memory bound. This means that the bottle neck in serving requests does not come from the highly optimized matrix-matrix or matrix-vector multiplication (compute) during the forward pass of the model, it comes from memory access.

PagedAttention: A Virtual Memory System for LLMs

PagedAttention is all you need

vLLM introduces PagedAttention, an algorithm inspired by virtual memory allocators in operating systems. The key insight, as described in the paper, is dividing the KV cache into fixed-size blocks rather than storing it as continuous memory. Each block contains key and value vectors for a fixed number of tokens.

vLLM distributes these blocks across non-contiguious memory resulting in near zero waste of GPU memory. This system significantly increases the overall throughput of serving a LLM.

This approach brings several benefits:

Near-zero waste in KV cache memory (compared to 60–80% waste in traditional systems)
Flexible sharing of KV cache within and across requests and complex decoding
Support for variable sequence lengths without pre-allocation

Deployment Guide

Let’s deploy some services!

This deployment strategy follows that found in Open Sauced, How we Saved Thousands of Dollars Deploying Low Cost Open Source AI Technologies. I’d highly recommend reading their article for an in depth overview of this deployment and the problem it solved for their usecase.

A simple LLM serving deployment using vLLM as the model serving engine.

Now that we’re up to speed on vLLM’s benefits, let’s deploy a vLLM service! We’re going to deploy vLLM as a DaemonSet into a GKE (Google Kubernetes Engine) cluster. We’ll first start by listing out the steps to our deployment.

Provision a GKE cluster with a GPU node pool (in this example, we’ll use NVIDIA L4)
Taint the GPU nodes to avoid CPU workloads being scheduled on GPU nodes
Create a DaemonSet that will run vLLM with a model (in this example, we’ll use LLaMa)
Your DaemonSet needs to have a toleration and a GPU resource request to ensure that it gets scheduled on a GPU
Create a Kubernetes service that will allow load balance between our vLLM service(s)
Optionally: create an Ingress deployment and configure a load balancer to accept external traffic
Configure kubectl
Apply the DaemosSet and Service Kubernetes resources to your cluster

This article comes with scripts that will help you get up to speed, quickly. For a deeper dive into the following code snippets, just check out the full repo here.

Hugging Face Secret

Saftey first kids

Before continuing you’ll want to download your Hugging Face Token and store it in a secrets file. The setup script will automatically configure this secret inside of your Kubernetes cluster. I store my secret in .env.secrets and the script expects this file to be at the same level as setup.sh. If you change the name of your secrets file, make sure to update the .gitignore .

create_secrets() {
    log "Creating secrets..."
    kubectl create secret generic hf-token \
        --from-env-file=.env.secrets \
        --dry-run=client -o yaml | kubectl apply -f -
}

Setting Up Your Environment

This is MY environment!

First, examine our configuration file config.sh. It defines important parameters for our deployment:

export PROJECT_ID="<your_project_id>"
export REGION="<your_region>"
export ZONE="<your_zone>"
export CLUSTER_NAME="<your_cluster_name>"
export GPU_MACHINE_TYPE="<your_machine_type>"

The configuration uses NVIDIA L4 GPUs, which offer an excellent balance of performance and cost for LLM inference. The g2-standard-16machine type provides adequate CPU and memory resources to support the GPU operations of this model.

Enable APIs and Provision GKE

You’re an enabler!

Our setup.sh script automates the cluster creation process. Let’s examine its key components:

The script first ensures all necessary Google Cloud APIs are enabled:

apis=(
"container.googleapis.com"
"containerregistry.googleapis.com"
"cloudbuild.googleapis.com"
"secretmanager.googleapis.com"
)

The cluster creation command is particularly important:

gcloud container clusters create ${CLUSTER_NAME} \
 - zone ${ZONE} \
 - num-nodes ${CPU_NUM_NODES} \
 - enable-autoscaling \
 - machine-type ${CPU_MACHINE_TYPE} \
 - enable-vertical-pod-autoscaling

This creates a cluster with autoscaling capabilities, which is crucial for handling variable workloads efficiently.

GPU Node Pool Configuration

Give me those sweet sweet GPUs

The GPU node pool is configured specifically for our inference workload:

gcloud container node-pools create ${GPU_NODE_POOL_NAME} \
 - accelerator type=${GPU_ACCELERATOR_TYPE},count=${GPU_ACCELERATOR_COUNT} \
 - node-taints="nvidia.com/gpu=present:NoSchedule"

The node taint ensures that only pods specifically tolerating GPU workloads will be scheduled on these expensive nodes.

Creating K8s Resources

Getting to the good stuff

Our vLLM deployment uses a DaemonSet configuration vllm-daemon.yaml:

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: vllm-daemonset
spec:
template:
spec:
containers:
- args:
- - model
- meta-llama/Llama-3.2–1B-Instruct
- - tensor-parallel-size
- "1"

This configuration:

Deploys one vLLM instance per GPU node
Uses the Llama 3.2 1B Instruct model
Configures tensor parallelism for efficient model loading
Includes proper GPU resource requests and limits

The accompanying service vllm-service.yaml creates a stable endpoint:

apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000

Deploying the Solution

Let there be light!

To deploy the complete solution, run (source):

./setup.sh --all

This will:

Configure networking and security
Create the GKE cluster
Set up the GPU node pool
Configure kubectl
Deploy vLLM and its Service

Testing and Verification

It’s alive….it’s alive!

After deployment, you can test the service using port forwarding:

kubectl port-forward svc/vllm-service 8000

Then use curl:

curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ -d '{"model": "meta-llama/Llama-3.2–1B-Instruct", "prompt": "Tell me a story", "max_tokens": 100}'

Production Considerations

Hold your horses partner

When deploying vLLM in production, several key technical and infrastructure considerations need to be addressed:

Memory Management Configuration

For an in depth overivew of these topics and how to configure them in vLLM please visit the vLLM Docs.

1. Block Size Selection:

The researchers found that a block size of 16 tokens provides optimal performance for most workloads
For ShareGPT-like workloads with longer sequences, block sizes between 16–128 showed the best performance
For Alpaca-like workloads with shorter sequences, block sizes of 16–32 are optimal
Larger block sizes can significantly degrade performance when sequences are shorter than the block size

2. Recovery Strategy: When GPU memory is exhausted, vLLM supports two approaches:

Swapping:

Copies evicted blocks to CPU memory
More efficient for larger block sizes (64+ tokens)
Limited by PCIe bandwidth for small block sizes due to numerous small transfers
Total swap space on CPU RAM is bounded by GPU memory allocated for KV cache

Re-computation:

Regenerates KV cache by rerunning the model on the prompt
More efficient for smaller block sizes
Maintains consistent overhead regardless of block size
Never exceeds 20% of swapping latency in benchmarks
For block sizes 16–64, both methods show comparable end-to-end performance

Request Handling and Scheduling

Scheduling Policy:

First-come-first-serve (FCFS) with preemption ensures fairness
All-or-nothing preemption policy: either evict all or none of a sequence’s blocks
Gang scheduling for sequence groups (e.g., beam search candidates)
Stop accepting new requests when preemption occurs until preempted sequences complete

2. Batch Processing:

Dynamic batching based on available GPU memory
Sequences within one request (e.g., beam candidates) are always scheduled together
Support for mixed decoding methods in the same batch

Kubernetes Infrastructure Setup

If you’re considering a production deployment, you’d also want to consider additional K8s resources and configurations.

Ingress Configuration:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
annotations:
kubernetes.io/ingress.class: "gce"
kubernetes.io/ingress.global-static-ip-name: "vllm-ip"
networking.gke.io/v1beta1.FrontendConfig: "vllm-frontend-config"
spec:
rules:
- http:
paths:
- path: /*
pathType: ImplementationSpecific
backend:
service:
name: vllm-service
port:
number: 8000

2. Load Balancer Setup:

Note: if you’re using GKE on GCP, you don’t need to worry about setting up a separate LoadBalancer for your ingress. GKE takes care of this for you.

apiVersion: v1
kind: Service
metadata:
name: vllm-service
annotations:
cloud.google.com/neg: '{"ingress": true}'
cloud.google.com/backend-config: '{"default": "vllm-backend-config"}'
spec:
type: LoadBalancer
ports:
- port: 8000
targetPort: 8000
protocol: TCP
selector:
app: vllm

4. Resource Management:

Set appropriate resource requests and limits for GPU nodes
Configure node auto-scaling based on GPU utilization
Implement pod disruption budgets for high availability
Set up monitoring for GPU memory usage and KV cache utilization

5. Security Considerations:

Configure network policies to restrict access to the vLLM service
Set up workload identity for accessing GCP resources
Implement rate limiting at the ingress level
Use secrets management for model credentials and API keys

Conclusion

You made it! Thank you ❤

If you’re still reading up to this point, thank you! I hope you found this article and the accompanying code helpful. By leveraging vLLM we can serve OpenAI API compatible LLMs at scale with state of the art performance. If you liked this article, please like and share :) Happy Hacking!