Advanced RAG Using AWS Opensearch
Retrieval-Augmented Generation (RAG) has emerged as a game-changing architecture in leveraging unstructured data for generating accurate and contextually aware responses. While vector databases like Pinecone and Chroma are often used in RAG solutions, AWS OpenSearch offers a compelling alternative by combining traditional syntactic search capabilities with advanced semantic search. This blog will walk you through the benefits of using AWS OpenSearch for RAG, the process of ingesting documents like PDFs, DOCX, and text files, and how to build a retriever using LangChain.
Why Choose AWS OpenSearch for RAG Over Vector Databases?
1. Dual Search Capabilities
OpenSearch allows you to leverage both semantic search (via dense embeddings) and traditional syntactic search (using Elasticsearch’s full-text search capabilities). This duality provides:
- Better Query Understanding: Semantic search handles natural language queries, capturing intent and meaning.
- Precision Matching: Syntactic search ensures keyword-based retrieval for cases where exact matches matter.
2. Integrated Ecosystem
Unlike standalone vector databases, OpenSearch integrates with the broader AWS ecosystem, enabling seamless connections to services like S3, Lambda, and SageMaker.
3. Cost-Effectiveness
Managing OpenSearch eliminates the need for additional vector database infrastructure, reducing operational complexity and cost.
Document Ingestion Process for OpenSearch
AWS OpenSearch can ingest documents from various sources, including PDFs, DOCX, and plain text files. Here’s how you can achieve this:
Step 1: Preprocessing Documents
Documents need to be tokenized, converted into embeddings (for semantic search), and indexed.
from langchain.document_loaders import PyPDFLoader, DocxLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
# Load documents
pdf_loader = PyPDFLoader("example.pdf")
docx_loader = DocxLoader("example.docx")
text_loader = TextLoader("example.txt")
documents = pdf_loader.load() + docx_loader.load() + text_loader.load()
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_docs = text_splitter.split_documents(documents)
# Generate embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
embedded_docs = [{"id": i, "embedding": embedding_model.embed_query(doc.page_content), "content": doc.page_content} for i, doc in enumerate(split_docs)]
Step 2: Indexing in OpenSearch
The embeddings and text chunks are indexed in OpenSearch for hybrid search capabilities.
from opensearchpy import OpenSearch
# Initialize OpenSearch client
client = OpenSearch(
hosts=[{'host': 'localhost', 'port': 9200}],
http_auth=('admin', 'admin') # Replace with actual credentials
)
# Define index mapping for semantic and syntactic fields
index_name = "documents"
mapping = {
"mappings": {
"properties": {
"content": {"type": "text"},
"embedding": {"type": "dense_vector", "dims": 384}
}
}
}
# Create index
client.indices.create(index=index_name, body=mapping)
# Index documents
for doc in embedded_docs:
client.index(
index=index_name,
body={"content": doc["content"], "embedding": doc["embedding"]}
)
Building the RAG Retriever with LangChain
Step 1: Hybrid Search Retrieval
LangChain allows you to seamlessly integrate with OpenSearch for RAG.
from langchain.vectorstores import OpenSearchVectorStore
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# Connect to OpenSearch
vectorstore = OpenSearchVectorStore(client, index_name=index_name, embedding_model=embedding_model)
# Create a retriever with hybrid search
retriever = vectorstore.as_retriever(
search_type="hybrid", # Use hybrid (semantic + syntactic) search
search_params={"k": 5}
)
# Build RAG chain
llm = OpenAI(model="text-davinci-003")
qa_chain = RetrievalQA(llm=llm, retriever=retriever)
# Example query
query = "Explain the key points in the document about climate change."
response = qa_chain.run(query)
print(response)
Advantages of OpenSearch in RAG
- Granular Relevance Tuning: By adjusting the weights of semantic vs. syntactic search, you can fine-tune relevance for different applications.
- Scalability: OpenSearch supports massive datasets and complex queries, making it ideal for enterprise-scale RAG solutions.
- Compliance and Security: With AWS-native features, you gain robust security controls and compliance certifications.
Conclusion
AWS OpenSearch is a powerful tool for building advanced RAG solutions, offering the best of both semantic and syntactic search. By integrating document ingestion, embedding indexing, and hybrid retrieval into one cohesive workflow, businesses can create robust systems that deliver precise and meaningful results.
Using LangChain, you can accelerate this process and seamlessly implement an OpenSearch-backed RAG retriever for your needs. This hybrid approach often outperforms solutions relying solely on vector databases, making it a strategic choice for enterprises aiming for precision, scalability, and efficiency.