GenAI — Transforming Natural Language Questions into SQL Queries for CSV/Excel Data

VerticalServe Blogs
3 min readAug 12, 2024

--

In the era of data-driven decision-making, the ability to query datasets efficiently is crucial. However, not everyone is proficient in SQL or database querying languages. This blog post will guide you through loading CSV/Excel files into a Pandas DataFrame, transforming natural language questions into SQL queries, and executing these queries against the loaded data. We’ll leverage Python libraries to achieve this, making it accessible even for those without extensive technical expertise.

Prerequisites

  1. Python Environment: Ensure you have Python installed.
  2. Install Required Libraries: Install pandas, sqlite3, and openai (for natural language processing).
pip install pandas openai

Step-by-Step Guide

1. Loading CSV/Excel Data

First, we’ll load a CSV or Excel file into a Pandas DataFrame. Pandas is a powerful data manipulation library that provides easy-to-use data structures and data analysis tools.

import pandas as pd

def load_data(file_path, file_type='csv'):
"""Load CSV or Excel data into a Pandas DataFrame."""
if file_type == 'csv':
df = pd.read_csv(file_path)
elif file_type == 'excel':
df = pd.read_excel(file_path)
else:
raise ValueError("Unsupported file type. Use 'csv' or 'excel'.")
return df

# Example usage
file_path = 'data.csv' # Replace with your file path
df = load_data(file_path, 'csv')
print(df.head())

2. Setting Up an In-Memory SQLite Database

Once the data is loaded into a DataFrame, we can use an in-memory SQLite database to execute SQL queries. This approach allows us to treat the DataFrame as a SQL table.

import sqlite3

def create_sqlite_db(df):
"""Create an in-memory SQLite database and load the DataFrame into it."""
conn = sqlite3.connect(':memory:')
df.to_sql('data', conn, index=False, if_exists='replace')
return conn

# Create SQLite database
conn = create_sqlite_db(df)

3. Generating SQL Queries from Natural Language

To convert natural language questions into SQL queries, we can use OpenAI’s GPT model. This requires setting up an OpenAI API key and using the openai library to interact with the model.

import openai

openai.api_key = 'your-openai-api-key' # Replace with your OpenAI API key

def generate_sql_query(natural_language_question, column_mapping):
"""Generate an SQL query from a natural language question."""
prompt = f"Convert the following natural language question into an SQL query: '{natural_language_question}'.\n"
prompt += f"Column mapping: {column_mapping}\n"

response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
max_tokens=150
)

sql_query = response.choices[0].text.strip()
return sql_query

# Example natural language question
question = "What is the average sales for each region?"
column_mapping = {
'average sales': 'avg(sales)',
'region': 'region'
}

sql_query = generate_sql_query(question, column_mapping)
print(sql_query)

4. Executing the SQL Query

With the generated SQL query, we can execute it against the SQLite database and retrieve the results.

def execute_sql_query(conn, sql_query):
"""Execute an SQL query against the SQLite database."""
result = pd.read_sql_query(sql_query, conn)
return result

# Execute the SQL query
result = execute_sql_query(conn, sql_query)
print(result)

Conclusion

This approach allows users to interact with CSV/Excel data using natural language, making data querying more accessible. By leveraging the power of Python, Pandas, SQLite, and OpenAI’s language models, we can transform natural language questions into actionable SQL queries, enabling users to extract insights from their data without needing to write complex SQL code.

Considerations

  • OpenAI API Costs: Using OpenAI’s API may incur costs, so ensure you understand the pricing model.
  • Accuracy of SQL Generation: The accuracy of the generated SQL queries depends on the quality of the natural language input and the model’s understanding.
  • Security: Be cautious when executing dynamically generated SQL queries to prevent SQL injection attacks, especially in a production environment.

This framework can be extended to handle more complex queries and datasets, providing a robust solution for natural language data interaction.

About — The GenAI POD — GenAI Experts

GenAIPOD is a specialized consulting team of VerticalServe, helping clients with GenAI Architecture, Implementations etc.

VerticalServe Inc — Niche Cloud, Data & AI/ML Premier Consulting Company, Partnered with Google Cloud, Confluent, AWS, Azure…50+ Customers and many success stories..

Website: http://www.VerticalServe.com

Contact: contact@verticalserve.com

--

--