InsightPipelines — Build Governed & Secure Databricks Lakehouse for Insurance

5 min readDec 8, 2024

Introduction

In the ever-evolving insurance industry, the volume, velocity, and variety of data generated daily have skyrocketed. Companies must handle data from multiple sources, such as policy management systems, claims processing systems, broker platforms, reinsurance contracts, finance systems, marketing campaigns, and regulatory reports. To stay competitive, insurers need to modernize their data architecture to facilitate secure, governed, and scalable data access for business intelligence, data science, and artificial intelligence (AI/ML) initiatives.

This is where InsightPipelines comes in — a comprehensive, low/no-code ETL solution that enables insurance companies to create a unified, secure, and governed Lakehouse Medallion Architecture on AWS, powered by Databricks. Leveraging S3 as the Bronze (raw) layer and Databricks for the Silver (clean) and Gold (curated) layers, InsightPipelines enables the transformation and movement of both structured and unstructured data into actionable insights. This post dives into how InsightPipelines addresses key pain points in the insurance industry and provides a scalable, future-proof data architecture.

Key Challenges Faced by Insurance Companies

Data Silos Across Business Units: Policies, claims, broker data, finance, marketing, and legal datasets are stored in different systems, creating data silos.
Unstructured Data Complexity: Insurance claims often include documents, images, and sticky notes that are difficult to process and analyze.
Data Governance and Security: Strict regulatory standards require data masking, encryption, and access control.
Scaling Analytics and AI/ML: Traditional ETL systems cannot keep pace with the growing demands of predictive models and Generative AI.
Operational Inefficiencies: Manual processes hinder data availability for downstream applications like analytics, underwriting, and audit reporting.

What is InsightPipelines?

InsightPipelines is a low-code/no-code ETL platform that enables companies to bring data from multiple sources into a unified Lakehouse Medallion architecture. It supports the complete data lifecycle from ingestion to analytics and data product creation. Insurance companies can seamlessly collect, clean, transform, and serve data to different business units through APIs, SFTP, and other interfaces.

Core Capabilities of InsightPipelines

Multi-Source Data Ingestion: Connectors for Policy, Claims, Operations, Broker, Reinsurance, Finance, Marketing, Legal, and third-party sources.
Data Transformation: Supports dBT, Spark, and Python for data cleansing, normalization, and deduplication.
Data Governance & Security: Includes data masking, tokenization, and access control for secure data access.
Lineage & Auditability: Complete lineage tracking for regulatory compliance.
Data Products & API Creation: Exposes curated data as APIs for consumption by applications and downstream systems.
Support for Unstructured Data: Enables image, document, and PDF data ingestion to handle claims notes, binder letters, and ACORD forms.

How InsightPipelines Builds a Lakehouse Medallion Architecture

The Lakehouse Medallion Architecture divides the data journey into three logical layers — Bronze (raw), Silver (cleaned), and Gold (curated) — for better data quality, security, and governance. This architecture is crucial for insurance companies where data consistency, accuracy, and security are paramount.

1. Bronze Layer (Raw Data)

Location: AWS S3
Data Type: Structured, semi-structured, and unstructured data (JSON, CSV, PDF, images)
Purpose: Captures raw, immutable data from multiple sources (Policy, Claims, Finance, Marketing, Reinsurance, etc.) with minimal transformations.
Usage: Used for audit trails, lineage tracking, and incident investigations.

Example: Ingesting ACORD forms, policy binders, claim notes, emails, and other unstructured documents from brokers and underwriters into an S3 bucket.

2. Silver Layer (Clean Data)

Location: Databricks (Delta Lake)
Data Type: Structured datasets, normalized and deduplicated.
Purpose: Applies transformations, deduplication, and schema standardization using Spark, dBT, and Python.
Usage: Data is ready for analytics, underwriting, and operational dashboards.

Example: Cleaning claims sticky notes to extract essential claim details and merging broker data with policyholder details to create unified customer views.

3. Gold Layer (Curated Data)

Location: Databricks (Delta Lake)
Data Type: Business-ready, curated data used for analytics, reporting, and AI models.
Purpose: Provides clean, governed data to internal and external stakeholders.
Usage: AI/ML models, dashboards, and business intelligence (BI) tools like Tableau or PowerBI consume Gold-layer datasets.

Example: Creating a curated dataset for Predictive Claims Analysis that feeds dashboards and underwriting risk models.

Core Data Operations in InsightPipelines

Here’s how InsightPipelines powers data transformation, security, and lineage in a modern insurance company.

1. Data Ingestion & Integration

Source Systems: Policy admin systems, Claims management, Broker platforms, Reinsurance agreements, CRM, and external data.
ETL Tools: InsightPipelines has ready connectors for ACORD forms, broker uploads, and policy documents.
Data Formats: JSON, CSV, Parquet, PDF, and Image (OCR capabilities).

2. Data Transformation & Cleansing

dBT Integration: Supports data modeling, deduplication, and normalization.
Spark and Python: For large-scale transformations, especially with large datasets from policy and claims management systems.
Data Enrichment: Joins claims, policy, and reinsurance data to enrich the data for AI models.

3. Security & Governance

Data Masking and Tokenization: Encrypts sensitive data, like policyholder names and addresses, to comply with regulations.
Role-Based Access Control (RBAC): Ensures only authorized users access specific datasets.
Auditability & Lineage: Tracks data lineage and changes to meet compliance (SOX, GDPR).

4. Data Product & API Creation

API-First Approach: Curated data is exposed through REST APIs.
Data Sharing: Shares curated data for reinsurers, brokers, and auditors using SFTP or API calls.

AI and Data Science Use Cases

With InsightPipelines, insurance companies can scale AI, ML, and Generative AI (GenAI) capabilities. Here are some key use cases:

Predictive Claims Analytics: Use curated claims and broker data to predict claim settlement times.
AI-Driven Underwriting: Use Gold-layer data to improve risk models and generate underwriting recommendations.
Chatbots & Generative AI: Ingest data from claim notes and documents to create AI-powered virtual assistants for customer service.

Architectural Overview

Here is the high-level architecture for InsightPipelines’ ETL Solution and Lakehouse Medallion Architecture.

Sources: Policy, Claims, Brokers, Reinsurance, Finance, Marketing, Legal
    |
    v
Ingestion Layer (ETL) --> Bronze (AWS S3)
    |
    v
Transformation Layer (Spark, dBT, Python) --> Silver (Databricks Delta Lake)
    |
    v
Curation Layer (Data Products) --> Gold (Databricks Delta Lake)
    |
    v
Consumption: APIs, SFTP, Tableau, Power BI, AI Models, Dashboards

Benefits of InsightPipelines

Faster Time to Insights: Business units can access clean, curated data in days, not weeks.
Unified Data Access: Breaks down data silos across brokers, reinsurance, claims, and underwriting.
Data Security: Tokenization, masking, and role-based access ensure compliance.
Operational Efficiency: Automation of manual data pipelines, lineage, and auditing.
AI/ML Enablement: Powers predictive underwriting and claims analysis.

Closing Thoughts

InsightPipelines delivers a transformative ETL platform for insurance companies to build a Lakehouse Medallion architecture on AWS and Databricks. By connecting disparate business units (Policy, Claims, Brokers, etc.), automating ETL processes, supporting dBT, Spark, and Python, and facilitating secure data sharing, InsightPipelines establishes a unified, governed data platform. This platform empowers insurance firms to leverage BI, Data Science, and Generative AI capabilities.

If you’re looking to modernize your insurance company’s data architecture, it’s time to unlock the power of InsightPipelines and the Lakehouse Medallion architecture. Drive growth, accelerate AI, and power decision-making like never before.