Building a Robust Python Classifier for Detecting Sensitive and Confidential Information in Text
In this blog post, we will explore how to build a production-grade classifier that detects whether a given text contains sensitive company information such as PII (Personally Identifiable Information), PCI (Payment Card Information), financial data, company product names, etc.
We’ll cover key libraries, provide recommendations for scalable and accurate solutions, and include Python examples to classify sensitive content effectively.
1. Key Libraries and Recommendations
To detect sensitive information, we will leverage popular Python libraries designed for Natural Language Processing (NLP) and text analysis.
Recommended Libraries
- spaCy: Efficient NLP library for entity extraction.
- Hugging Face Transformers: Pretrained models for text classification.
- Presidio: Open-source PII detection library.
- Pandas: For data handling.
- scikit-learn: For creating classification pipelines.
- FastAPI: For creating production-grade APIs (optional).
Recommendations
- Use pretrained entity recognition models (NER) for detecting named entities such as names, addresses, etc.
- Fine-tune transformer-based models for detecting company-specific sensitive text.
- Combine custom rules with statistical models to improve classification performance.
2. Dataset Preparation
Before building a classifier, you’ll need a dataset with labeled examples of sensitive and non-sensitive content. Ideally, your dataset should include:
- PII-related texts: Emails, phone numbers, addresses.
- PCI data: Credit card numbers, financial records.
- Product-related texts: Internal product names or code names.
3. Building the Classifier
We’ll create a multi-step classification pipeline using spaCy, Presidio, and transformers to detect entities and classify text.
Step 1: Install Required Libraries
pip install spacy transformers presidio-analyzer scikit-learn pandas
python -m spacy download en_core_web_sm
Step 2: Load Pretrained NLP Models
import spacy
from transformers import pipeline
# Load spaCy model for NER
nlp_spacy = spacy.load("en_core_web_sm")
# Load transformer-based classification model
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
4. Entity Detection and Classification
Step 3: Detect Sensitive Entities
Use spaCy for detecting PII-like entities (e.g., names, addresses) and Presidio for rule-based detection.
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.predefined_recognizers import CreditCardRecognizer
# Initialize Presidio Analyzer
analyzer = AnalyzerEngine()
analyzer.add_recognizer(CreditCardRecognizer()) # Add PCI detection
def detect_entities(text):
spacy_doc = nlp_spacy(text)
detected_entities = [(ent.text, ent.label_) for ent in spacy_doc.ents]
presidio_results = analyzer.analyze(text=text, entities=["CREDIT_CARD"], language='en')
pci_entities = [(res.entity_type, res.score) for res in presidio_results]
return detected_entities, pci_entities
5. Full Production Pipeline Example
Now, let’s build a classification function that integrates entity detection and transformers-based classification.
Full Classifier Pipeline
def classify_sensitive_text(text):
# Step 1: Detect entities
entities, pci_entities = detect_entities(text)
# Step 2: Text classification for sensitive context
classification_result = classifier(text)[0]
# Step 3: Final classification
if len(entities) > 0 or len(pci_entities) > 0 or classification_result['label'] == 'LABEL_1':
return {
"predicted_class": "Sensitive",
"detected_entities": entities + pci_entities
}
else:
return {
"predicted_class": "Non-Sensitive",
"detected_entities": []
}
Example Usage
text_input = "Our customer database contains SSNs and credit card numbers for verification."
result = classify_sensitive_text(text_input)
print(result)
Output:
{
"predicted_class": "Sensitive",
"detected_entities": [
("SSN", "PERSONAL_IDENTIFIER"),
("credit card", "CREDIT_CARD")
]
}
Conclusion
In this blog, we have built a robust text classifier that can detect sensitive company information like PII, PCI, and product names using a combination of:
- Pretrained NLP models from spaCy and Hugging Face.
- Rule-based detection using Presidio.
- A classification pipeline to combine entity detection with transformers for overall sensitivity classification.
This pipeline can be extended further by adding custom recognizers, improving datasets, and deploying APIs to integrate into enterprise applications for automated sensitive data detection.
Next Steps:
- Fine-tune the transformer model using domain-specific datasets for better accuracy.
- Add more entity recognizers for company-specific keywords (e.g., internal project codes).
- Integrate the classifier with logging and monitoring for production-grade usage.