GenAI — Managing Context History Best Practices

4 min readAug 16, 2024

When interacting with Large Language Models (LLMs), maintaining context across chat history is crucial for generating accurate responses. However, as the conversation history grows, token limits, response quality, and cost efficiency become key concerns. In this blog post, we’ll dive into strategies to manage chat history effectively, with Python code examples to guide you. We’ll explore summarization, token checking, thresholding, and frameworks like LangChain and LlamaIndex.

Why Chat History Management is Important

Context Preservation: The LLM relies on prior messages to generate relevant answers. Losing context can lead to incorrect or vague responses.
Token Management: LLMs have token limits, and maintaining too much history can exceed these limits, leading to truncation of important context.
Cost Efficiency: Many LLM APIs, like OpenAI’s GPT, charge based on token usage. Minimizing unnecessary tokens can save costs.

Strategies for Managing Chat History

1. Summarization

Summarization can be used to condense chat history while retaining the essential information. Summaries take fewer tokens, making it easier to manage long conversations.

Example with OpenAI’s GPT API:

import openai

def summarize_chat_history(chat_history):
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=f"Summarize the following conversation: {chat_history}",
        max_tokens=150
    )
    summary = response.choices[0].text.strip()
    return summary

# Sample usage
chat_history = "User: Hello! Can you help me with Python?\nAI: Sure! What do you need help with?\nUser: I need help with managing chat history..."
summary = summarize_chat_history(chat_history)
print("Summary:", summary)

2. Token Checking and Thresholding

Before sending the chat history to the LLM, you can check the number of tokens. If it exceeds a certain threshold, you can apply techniques like summarization, truncation, or selective context inclusion.

Example with the tiktoken Library:

import tiktoken

def count_tokens(text, model_name="gpt-3.5-turbo"):
    encoding = tiktoken.encoding_for_model(model_name)
    tokens = encoding.encode(text)
    return len(tokens)

def manage_chat_history(chat_history, token_limit=4096, summary_length=150):
    total_tokens = count_tokens(chat_history)
    
    if total_tokens > token_limit:
        summary = summarize_chat_history(chat_history)
        if count_tokens(summary) > summary_length:
            # Further reduce the summary if necessary
            summary = summarize_chat_history(summary)
        return summary
    return chat_history

# Example usage
chat_history = "..."  # Chat history string
managed_history = manage_chat_history(chat_history)
print("Managed Chat History:", managed_history)

3. Dynamic Context Management

Instead of always sending the entire history, you can dynamically choose what parts of the conversation to include. For instance, you can send the last few exchanges and a summary of earlier conversations.

Example of Combining Context with New Questions:

def get_context_with_question(chat_history, new_question, token_limit=4096):
    managed_history = manage_chat_history(chat_history, token_limit)
    full_context = f"{managed_history}\nUser: {new_question}"
    
    if count_tokens(full_context) > token_limit:
        raise ValueError("Context with question exceeds token limit")
    
    return full_context

# Example usage
new_question = "Can you explain how recursion works in Python?"
context_with_question = get_context_with_question(chat_history, new_question)
print("Final Context to Send:", context_with_question)

Leveraging Frameworks: LangChain & LlamaIndex

LangChain

LangChain helps manage prompts and chain multiple LLM calls together. It can be particularly useful when you need to store conversation history and selectively retrieve parts of it.

Example of Using LangChain for Memory Management:

from langchain.prompts import ConversationBufferMemory

# Initialize conversation memory
memory = ConversationBufferMemory()

# Add messages to the memory
memory.add_user_message("Hello! Can you help me with Python?")
memory.add_ai_message("Sure! What do you need help with?")
memory.add_user_message("I need help with managing chat history...")

# Retrieve context
context = memory.load_memory_variables()["history"]
print("Chat History from Memory:", context)

LangChain makes it easier to keep track of what was discussed without manually managing the chat history, and you can combine this with the summarization approach we discussed earlier.

LlamaIndex (formerly GPT Index)

LlamaIndex (GPT Index) can be used to organize long documents or conversation history into a structure that allows easy retrieval. This is helpful for managing very long conversations, where you might want to index older parts and refer back to them without keeping everything in memory.

Example of Using LlamaIndex for Indexing Chat History:

from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex

# Assume chat history is stored in files
documents = SimpleDirectoryReader('chat_history_folder').load_data()
index = GPTSimpleVectorIndex(documents)

# Retrieve relevant context for a new question
query = "How can I manage chat history effectively?"
response = index.query(query)
print("Indexed Response:", response)

LlamaIndex is particularly powerful for longer-term context management, where you might want to refer back to parts of a conversation or document that aren’t directly in the latest context window.

Best Practices

Thresholding: Set a reasonable token threshold (e.g., 4096 for GPT-3.5-Turbo) and apply summarization or context reduction when needed.
Selective Context: Only include the most relevant recent exchanges along with a summary of older parts of the conversation.
Frameworks: Use LangChain for automatic memory management and LlamaIndex for indexing longer conversations.

Conclusion

Effective chat history management is crucial for optimizing LLM interactions, especially when dealing with token limits and cost considerations. By leveraging summarization, token checking, and frameworks like LangChain and LlamaIndex, you can maintain context effectively without overwhelming the model or exceeding token limits.

By implementing these strategies in Python, you can ensure that your applications remain both contextually rich and cost-effective.

About:

VerticalServe Inc — Niche Cloud, Data & AI/ML Premier Consulting Company, Partnered with Google Cloud, Confluent, AWS, Azure…60+ Customers and many success stories..

Website: http://www.VerticalServe.com

Contact: contact@verticalserve.com

Successful Case Studies: http://verticalserve.com/success-stories.html

InsightLake Solutions: Our pre built solutions — http://www.InsightLake.com