Building a Robust Python Classifier for Detecting Sensitive and Confidential Information in Text

3 min readJan 6, 2025

In this blog post, we will explore how to build a production-grade classifier that detects whether a given text contains sensitive company information such as PII (Personally Identifiable Information), PCI (Payment Card Information), financial data, company product names, etc.

We’ll cover key libraries, provide recommendations for scalable and accurate solutions, and include Python examples to classify sensitive content effectively.

1. Key Libraries and Recommendations

To detect sensitive information, we will leverage popular Python libraries designed for Natural Language Processing (NLP) and text analysis.

Recommended Libraries

spaCy: Efficient NLP library for entity extraction.
Hugging Face Transformers: Pretrained models for text classification.
Presidio: Open-source PII detection library.
Pandas: For data handling.
scikit-learn: For creating classification pipelines.
FastAPI: For creating production-grade APIs (optional).

Recommendations

Use pretrained entity recognition models (NER) for detecting named entities such as names, addresses, etc.
Fine-tune transformer-based models for detecting company-specific sensitive text.
Combine custom rules with statistical models to improve classification performance.

2. Dataset Preparation

Before building a classifier, you’ll need a dataset with labeled examples of sensitive and non-sensitive content. Ideally, your dataset should include:

PII-related texts: Emails, phone numbers, addresses.
PCI data: Credit card numbers, financial records.
Product-related texts: Internal product names or code names.

3. Building the Classifier

We’ll create a multi-step classification pipeline using spaCy, Presidio, and transformers to detect entities and classify text.

Step 1: Install Required Libraries

pip install spacy transformers presidio-analyzer scikit-learn pandas
python -m spacy download en_core_web_sm

Step 2: Load Pretrained NLP Models

import spacy
from transformers import pipeline

# Load spaCy model for NER
nlp_spacy = spacy.load("en_core_web_sm")

# Load transformer-based classification model
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

4. Entity Detection and Classification

Step 3: Detect Sensitive Entities

Use spaCy for detecting PII-like entities (e.g., names, addresses) and Presidio for rule-based detection.

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.predefined_recognizers import CreditCardRecognizer

# Initialize Presidio Analyzer
analyzer = AnalyzerEngine()
analyzer.add_recognizer(CreditCardRecognizer())  # Add PCI detection

def detect_entities(text):
    spacy_doc = nlp_spacy(text)
    detected_entities = [(ent.text, ent.label_) for ent in spacy_doc.ents]

    presidio_results = analyzer.analyze(text=text, entities=["CREDIT_CARD"], language='en')
    pci_entities = [(res.entity_type, res.score) for res in presidio_results]

    return detected_entities, pci_entities

5. Full Production Pipeline Example

Now, let’s build a classification function that integrates entity detection and transformers-based classification.

Full Classifier Pipeline

def classify_sensitive_text(text):
    # Step 1: Detect entities
    entities, pci_entities = detect_entities(text)
    
    # Step 2: Text classification for sensitive context
    classification_result = classifier(text)[0]
    
    # Step 3: Final classification
    if len(entities) > 0 or len(pci_entities) > 0 or classification_result['label'] == 'LABEL_1':
        return {
            "predicted_class": "Sensitive",
            "detected_entities": entities + pci_entities
        }
    else:
        return {
            "predicted_class": "Non-Sensitive",
            "detected_entities": []
        }

Example Usage

text_input = "Our customer database contains SSNs and credit card numbers for verification."
result = classify_sensitive_text(text_input)
print(result)

Output:

{
  "predicted_class": "Sensitive",
  "detected_entities": [
    ("SSN", "PERSONAL_IDENTIFIER"),
    ("credit card", "CREDIT_CARD")
  ]
}

Conclusion

In this blog, we have built a robust text classifier that can detect sensitive company information like PII, PCI, and product names using a combination of:

Pretrained NLP models from spaCy and Hugging Face.
Rule-based detection using Presidio.
A classification pipeline to combine entity detection with transformers for overall sensitivity classification.

This pipeline can be extended further by adding custom recognizers, improving datasets, and deploying APIs to integrate into enterprise applications for automated sensitive data detection.

Next Steps:

Fine-tune the transformer model using domain-specific datasets for better accuracy.
Add more entity recognizers for company-specific keywords (e.g., internal project codes).
Integrate the classifier with logging and monitoring for production-grade usage.