Best Practices for Local LLM Deployment Using vLLM on AWS

3 min readJan 9, 2025

Introduction

Deploying local large language models (LLMs) such as Llama using vLLM can significantly improve inference efficiency and reduce costs. This blog explores best practices for deploying LLMs on AWS, including an overview of vLLM’s origin, licensing, and its capabilities. We will cover the setup of an inference layer using EC2 or EKS (Elastic Kubernetes Service), recommend cost-effective instances, and provide a Python example that performs streaming inference using LangChain or APIs.

What is vLLM?

vLLM is an optimized serving engine for transformer-based language models, designed to achieve high throughput and efficient memory usage.

Origin and Development

Created by: vLLM was developed by researchers from UC Berkeley and contributors in the open-source AI community.
Purpose: vLLM’s primary goal is to overcome inefficiencies in LLM serving by using a memory-efficient, parallelized approach to token streaming and dynamic batching.

Licensing and Availability

Open Source: vLLM is open source.
License: vLLM is released under the Apache 2.0 License, making it permissive for both commercial and non-commercial use.

Key Features

Optimized Batching: Supports dynamic request batching for higher throughput.
Token Streaming: Enables streaming token outputs for faster user response.
Memory Efficiency: Reduces the memory overhead compared to traditional inference engines.

Architecture Overview for Local LLM Deployment

When deploying LLMs on AWS, the architecture typically includes:

Compute Layer: AWS EC2 or EKS instances for hosting the LLM service.
Storage Layer: EBS (Elastic Block Store) for persistent model storage.
Networking: Load balancers and autoscaling configurations for high availability.
Inference Layer: Python APIs or LangChain framework for querying the LLM.

Best Practices for LLM Deployment Using vLLM

1. Choosing Between EC2 and EKS

EC2 Instances: Suitable for simple deployments and when minimal orchestration is needed.
EKS (Elastic Kubernetes Service): Ideal for large-scale deployments, containerized workflows, and multi-replica deployments.

2. Cost-Effective Instance Recommendations

Deploying Llama with vLLM on AWS EC2

Step 1: Launch EC2 Instance

Choose a g5.xlarge instance (for GPU-based inference) or c7g.4xlarge (for CPU-based inference).
Install dependencies:

sudo apt update && sudo apt install python3-pip
pip install torch vllm

Step 2: Load Llama Model Using vLLM

from vllm import LLM, SamplingParams

# Load Llama model from local weights or S3 path
llm = LLM(model="path/to/llama-model", tensor_parallel_size=1)

# Example streaming prompt inference
prompt = "Explain the theory of relativity."
sampling_params = SamplingParams(temperature=0.7, max_tokens=200, stream_tokens=True)

# Stream tokens as they are generated
for output in llm.generate(prompt, sampling_params):
    print(output.text, end="", flush=True)

Deploying Llama with vLLM on AWS EKS

Step 1: Set Up EKS Cluster

Create a Kubernetes cluster using eksctl:

eksctl create cluster --name llm-cluster --region us-east-1 --nodes 3 --node-type g5.2xlarge

Step 2: Create Deployment and Service

Create deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-app
  template:
    metadata:
      labels:
        app: llm-app
    spec:
      containers:
      - name: llm-container
        image: my-llm-image:v1
        ports:
        - containerPort: 8080
        env:
        - name: VLLM_ENABLE_STREAMING
          value: "true"

Apply the deployment:

kubectl apply -f deployment.yaml

Expose the service via a load balancer:

kubectl expose deployment llm-deployment --type=LoadBalancer --name=llm-service

Building an Inference Layer Using LangChain

LangChain provides a seamless way to build inference pipelines around LLMs with support for streaming outputs.

from langchain import LLMChain, PromptTemplate
from langchain.llms import OpenAI

# Define the prompt template
prompt = PromptTemplate(template="Summarize the following text: {text}", input_variables=["text"])

# Load custom Llama model (with streaming enabled)
llm = OpenAI(temperature=0.7, openai_api_base="http://localhost:8080/v1", stream=True)

chain = LLMChain(llm=llm, prompt=prompt)
response = chain.run("The Industrial Revolution was a period of major industrialization...")

for token in response:
    print(token, end="", flush=True)

Conclusion

By following these best practices, you can deploy Llama models effectively using vLLM on AWS infrastructure. Whether you choose EC2 for simplicity or EKS for scalable, containerized solutions, using the right configurations and tools like LangChain or custom APIs ensures robust, high-performance inference. Enable streaming for a responsive user experience and consider using Spot Instances for cost optimization.