Blog/March 10, 2026/

Product,
Tutorials,
Community

Building production-ready data pipelines in Microsoft Fabric: A complete data quality framework with dlthub

Add data quality gates to Microsoft Fabric with dlt. Validate schemas, catch bad records, and mask PII before data reaches your lakehouse and downstream analytics.

Rakesh Gupta,
Director and Principal Consultant (SketchMyView)

On this page

1. Introduction
Why data quality matters
The impact of poor data on analytics, ML, and business decisions
2. The challenges of data quality in Microsoft Fabric
Tool fragmentation: Multiple services but no unified DQ engine
Lack of pre-load (write-audit-publish) validation
Schema drift and semi-structured API data
Downstream errors from unvalidated data
Limited monitoring and observability for small teams
3. The dltHub solution
What dlt from dltHub is
How it integrates with Microsoft Fabric
Benefits for small teams (1–2 engineers)
Data quality gates for Microsoft Fabric
4. Mapping the DQ lifecycle to dlthub
Stage 1: Source profiling
Stage 2: Data contract and schema enforcement
Stage 3: Pre-load / WAP validation
Stage 4: Controlled load into lakehouse
Stage 5: Logging and monitoring
Stage 6: Feedback loop and iterative improvement
5. Protecting sensitive data (PII)
Common PII fields in APIs
6. Integrating into a Microsoft Fabric pipeline
How dltHub acts as a gatekeeper for DQ
Monitoring metrics and alerting
6.5 Alternative pattern: dlt quality gates between medallion layers
Two deployment strategies
Pattern A: dlt at ingestion (Bronze as Validated Data)
Pattern B: dlt between Layers (Bronze as True Raw)
When to Use Each Pattern
Implementation Example: Bronze to Silver with dlt
The Quarantine Table Pattern
Hybrid approach: Both patterns together
Benefits for small teams
7. Visual representation
Flow Highlighting Failure Points
8. Benefits for small teams
Reduces operational firefighting
Increases trust in analytics
Simplifies end-to-end pipeline management
9. Conclusion
Microsoft fabric provides compute and storage but not built-in DQ
dltHub completes the DQ lifecycle
Pre-Load Validation with PII Protection Is Critical

1. Introduction

Why data quality matters

Data quality isn't just a technical concern; it's a business imperative. In today's data-driven landscape, organisations rely on accurate, timely, and trustworthy data to power everything from daily operational decisions to complex machine learning models. Yet, the reality is that most data teams spend an inordinate amount of time firefighting data quality issues rather than delivering value.

The impact of poor data on analytics, ML, and business decisions

Poor data quality creates a cascade of problems throughout an organization. Analytics dashboards displaying incorrect metrics erode trust and lead to misguided strategic decisions.

Machine learning models trained on flawed data produce unreliable predictions, potentially causing costly business errors. Customer facing applications might expose sensitive information or deliver poor user experiences due to incomplete or invalid data.

The financial impact is staggering. Studies estimate that poor data quality costs organisations an average of $12.9 million annually. Beyond direct costs, there's the opportunity cost of delayed insights, the reputational damage from data breaches, and the productivity drain on teams constantly debugging data issues.

For small data teams, often just one or two engineers; the challenge is even more acute.

Without the resources for extensive data quality tooling or dedicated governance teams, these engineers must find efficient, scalable solutions that prevent problems before they cascade downstream.

2. The challenges of data quality in Microsoft Fabric

Tool fragmentation: Multiple services but no unified DQ engine

Microsoft Fabric offers a comprehensive suite of data services, from data ingestion and storage to transformation and analytics. It provides data pipelines for orchestration, OneLake for storage, Spark for processing, and Power BI for visualization.

However, what Fabric doesn't provide is a unified, built-in data quality engine that operates across these services.

This fragmentation means data quality checks are often implemented ad-hoc, scattered across different pipeline stages, or worse; implemented only after data has already landed in production tables.

Each team might develop their own validation logic, leading to inconsistent quality standards and duplicated effort.

Lack of pre-load (write-audit-publish) validation

One of the most critical gaps in many Fabric implementations is the absence of a structured Write-Audit-Publish (WAP) pattern.

In this pattern, data is written to a staging area, thoroughly audited against quality rules, and only published to production tables if it passes all checks.

Without WAP validation, bad data flows directly into trusted tables. By the time quality issues are discovered, often by downstream consumers or business users; the damage is done.

Correcting these issues requires costly data remediation, re-running of downstream processes, and explanations to stakeholders about why their reports suddenly changed.

Schema drift and semi-structured API data

Modern data pipelines increasingly consume data from APIs that return semi-structured JSON or XML. GitHub pull requests, Salesforce records, and countless other API sources don't come with strict schema guarantees. Fields appear and disappear, nested structures change, and data types aren't always consistent.

Microsoft Fabric's native tools handle schema evolution, but they don't enforce it. A new field appearing in your API response will simply be added to your table—whether or not it should be there, whether it contains sensitive information, and whether downstream processes expect it.

Downstream errors from unvalidated data

When validation happens only in Spark transformations, after data has been ingested; failures cascade downstream. A transformation fails because a required field is missing. The failure isn't discovered until the scheduled job runs. Downstream dashboards go stale. Alerts fire. Engineers scramble to diagnose the issue, fix the data, and rerun the pipeline.

This reactive approach to data quality is exhausting and unsustainable, especially for small teams who cannot afford to be on-call for data issues around the clock.

Limited monitoring and observability for small teams

Enterprise data platforms often include sophisticated monitoring solutions; DataOps platforms, data observability tools, and custom alerting systems. Small teams rarely have the budget or bandwidth for these tools. They need lightweight, integrated solutions that provide visibility into data quality without requiring complex setup or ongoing maintenance.

3. The dltHub solution

What dlt from dltHub is

dlthub is an open-source Python library designed to simplify the creation of robust, production-ready data pipelines. Think of it as a framework that handles the "plumbing" of data engineering, schema management, incremental loading, data typing, and quality validation; so engineers can focus on business logic rather than boilerplate code.

Unlike traditional ETL tools that require learning proprietary interfaces or drag-and-drop workflows, dlt is pure Python. It's lightweight, flexible, and designed for engineers who want control without complexity. You define resources (data sources), configure validation rules, and dlt handles the rest: tracking state, managing schemas, enforcing types, and detecting anomalies.

How it integrates with Microsoft Fabric

dlthub integrates seamlessly with Microsoft Fabric through its filesystem destination support. dlt pipelines can write directly to Fabric Lakehouses using the filesystem connector, storing data in Parquet or Delta format. From there, Fabric's native Spark pools can read and transform the validated data, confident that it meets defined quality standards.

The integration pattern is straightforward run dlt pipelines in Fabric notebooks or scheduled jobs, leverage dlt's validation capabilities as a pre-load gate, then proceed with standard Spark transformations on clean, validated data. This approach maintains the flexibility of Fabric's ecosystem while adding a critical quality layer that was previously missing.

Benefits for small teams (1–2 engineers)

For small teams, dlt is a force multiplier. Instead of building custom validation logic for every pipeline, teams define quality rules once using dlt's declarative schema system.

Instead of manually tracking schema changes across dozens of tables, dlt automatically detects and manages evolution. Instead of writing complex deduplication logic, teams simply declare primary keys and let dlt handle merge operations.

Most importantly, dlt's approach to data quality is proactive rather than reactive. Issues are caught at the source, before bad data pollutes downstream systems.

This shift from firefighting to prevention dramatically reduces the operational burden on small teams, freeing them to deliver value rather than debug failures.

Data quality gates for Microsoft Fabric

Explore the end-to-end implementation of the dlt quality lifecycle. This collection of notebooks covers the five essential pillars of data integrity and production-ready transformations.

Pillar 1: Structural Integrity
Pillar 2: Semantic Validity
Pillar 3: Uniqueness & Relations
Pillar 4: Privacy & Governance
Pillar 5: Operational Health

View the Notebooks: dlt x Microsoft Fabric Quality Gates
Access Python Source: .py versions

4. Mapping the DQ lifecycle to dlthub

Stage 1: Source profiling

Before enforcing quality rules, you must understand your data. Source profiling involves examining raw API responses or file structures to identify patterns: Which fields are always present? Which contain nulls? What data types appear? Are there nested structures or arrays?

Dlthub facilitates profiling by allowing teams to run initial ingestions with minimal schema constraints. During these exploratory runs, dlt captures metadata about column distributions, null frequencies, and type variations. This intelligence informs the design of explicit schemas and validation rules for production pipelines.

For example, when first connecting to the GitHub API, a profiling run might reveal that the closed_at field is frequently null for open pull requests, that user.email is sometimes missing even for authenticated users, and that PR numbers are always positive integers. These insights become the foundation for targeted validation rules.

Stage 2: Data contract and schema enforcement

Once you understand your source data, you can define a data contract, an explicit agreement about what valid data looks like. In dlt, this contract is expressed through schema definitions that declare:

Required fields (nullable: false) that must always be present
Data types (bigint, text, timestamp) that enforce structural integrity
Primary keys that ensure uniqueness
Business rules encoded as validation functions

For GitHub pull requests, a data contract might specify that every PR must have an id (unique integer), a number (positive integer), a state (either "open" or "closed"), and a created_at timestamp (not in the future). Users must have a login that follows GitHub's username conventions.

These contracts serve as executable documentation. They prevent schema drift by rejecting unexpected columns in frozen mode, or they track evolution in evolve mode so teams understand exactly when and how schemas change.

Stage 3: Pre-load / WAP validation

The heart of dlt's data quality approach is pre-load validation, the Write-Audit-Publish pattern that gates data before it enters trusted tables. This validation operates at multiple levels:

Schema Checks ensure that incoming data matches expected structure. Are all required fields present? Do values conform to declared types? Has the source introduced new columns that might contain sensitive data or break downstream processes?

Business Rules validate that data makes logical sense. Pull request states must be valid (not "pending" or "banana"). Creation dates cannot be in the future or before GitHub existed. PR numbers must be positive. Closed pull requests must have a closed_at timestamp.

Uniqueness and Referential Integrity prevent duplicates and orphaned records. Primary keys must be unique across loads. Child records (like comments or reviews) must reference valid parent records (pull requests). Merge operations correctly update existing records rather than creating duplicates.

PII Detection and Masking identify sensitive fields like emails, phone numbers, or IP addresses, then automatically mask, hash, or redact them according to governance policies. This happens before data reaches the lakehouse, ensuring compliance from the start.

When validation fails, dlt can be configured to reject entire batches, discard invalid rows, or null out problematic values while preserving the rest of the record. The choice depends on your tolerance for data loss versus your tolerance for invalid data.

Stage 4: Controlled load into lakehouse

Only after data passes all pre-load validations does dlt write it to the Fabric Lakehouse. At this point, the data is clean, validated, and safe for downstream consumption. Spark transformations can proceed with confidence, knowing that structural integrity, business rules, uniqueness, and privacy protections have already been enforced.

This gated approach prevents the "garbage in, garbage out" problem. Spark jobs no longer fail due to unexpected nulls, invalid states, or schema changes. Analysts no longer discover duplicate records in their dashboards. Compliance teams no longer worry about exposed PII in analytics tables.

The load process itself is optimized. dlt uses efficient Parquet or Delta formats, partitions data appropriately, and tracks incremental state so subsequent loads only process new or changed data. For small teams, this means pipelines run faster and cost less, without sacrificing quality.

Stage 5: Logging and monitoring

Every dlt pipeline run generates rich metadata about the quality and health of your data operations. How many rows were ingested? How many failed validation? Which business rules triggered rejections? Did the schema evolve? Were any PII fields detected and masked?

This metadata is captured automatically and can be persisted to monitoring tables or logging systems.

For small teams without dedicated observability platforms, dlt provides a lightweight built-in solution that surfaces critical information without requiring additional infrastructure.

Monitoring includes operational health metrics: pipeline duration, API response times, load success rates, and freshness indicators (how old is the data?). It also includes data quality metrics: completeness scores (what percentage of fields are null?), validity rates (what percentage passed business rules?), and uniqueness violations.

Stage 6: Feedback loop and iterative improvement

Data quality isn't static. Sources evolve, business requirements change, and new edge cases emerge. The final stage of the DQ lifecycle is continuous improvement, adjusting validation rules, refining schemas, and enhancing monitoring based on operational experience.

dlthub’s declarative approach makes iteration straightforward. Validation logic lives in code, versioned alongside your pipeline definitions. When you discover that a new field has appeared in your API responses, you update your schema definition and redeploy. When analysts request additional validation rules, you add them to your resource decorators and test them in development before promoting to production.

This feedback loop transforms data quality from a one-time project into an ongoing practice. Small teams build institutional knowledge about their data, encoded in schemas and validation rules that grow more sophisticated over time.

5. Protecting sensitive data (PII)

Common PII fields in APIs

APIs frequently return personally identifiable information, often without clear indication that sensitive data is present. GitHub API responses include user emails (when available), usernames that might reveal real identities, and IP addresses in certain contexts.

Salesforce APIs return customer names, addresses, and phone numbers. Social media APIs expose location data and private messages.

Even fields that seem innocuous can become PII in combination. A username plus a creation timestamp plus a repository name might uniquely identify an individual. Aggregate data about small groups can enable re-identification attacks.

For small teams without dedicated privacy officers, identifying PII is challenging. Field names don't always indicate sensitivity (is user_id identifying?. What about login?). API documentation may not highlight privacy concerns. And regulations like GDPR and CCPA define PII broadly, capturing fields you might not expect.

How dltHub detects and masks PII before loading

dlthub addresses PII protection through multiple mechanisms built into the pipeline ingestion process. First, it allows teams to define PII policies at the resource level, declaring which fields contain email addresses, phone numbers, or other sensitive data.

Second, dlt provides transformation functions that automatically detect patterns matching common PII formats: email addresses (regex patterns), phone numbers (international formats), Social Security numbers, credit card numbers, and IP addresses. These detections happen before data is written to the lakehouse.

Third, dlt applies appropriate protections based on the type of PII and your governance requirements:

Masking obscures values while preserving some information (showing the last 4 digits of a phone number)
Hashing enables privacy-preserving joins (matching records across systems without exposing actual values)
Redaction completely removes sensitive data, replacing it with [REDACTED]
Dropping eliminates entire fields that shouldn't be stored

These transformations preserve data utility for analytics while minimizing privacy risk and ensuring regulatory compliance.

6. Integrating into a Microsoft Fabric pipeline

A production-ready Microsoft Fabric pipeline with dlt-powered data quality follows a clear flow from ingestion through transformation to trusted analytics:

Step-by-step pipeline architecture

Step 1: API (or various other data ingestion) Ingestion with dlt begins with a Fabric notebook or scheduled job that uses dlt to connect to external APIs (GitHub, Salesforce, etc.). The dlt resource defines the extraction logic, how to fetch data, handle pagination, and manage authentication.

Step 2: Pre-Load Validation happens immediately within the dlt pipeline. As each record is extracted, it passes through validation gates: schema checks ensure structural integrity, business rules verify logical validity, uniqueness checks prevent duplicates, and PII detection identifies sensitive fields for masking.

Step 3: Controlled Write to Raw Layer occurs only for data that passes all validations. dlt writes clean, validated data to the Fabric Lakehouse's raw or staging area in efficient Parquet format. Failed records are logged separately for investigation, ensuring they don't pollute production tables.

Step 4: Spark Transformations proceed with confidence. Fabric Spark notebooks or Spark jobs read the validated data, knowing it's structurally sound and logically valid. Transformations focus on business logic, aggregations, joins, feature engineering rather than defensive null checking and data cleansing.

Step 5: Load to Trusted Layer promotes transformed data to gold or trusted tables used by analysts and BI tools. Because data quality was enforced upstream, these tables are reliable, complete, and safe for self-service analytics.

Step 6: Monitoring and alerting continuously tracks pipeline health, generated metadata feeds into monitoring dashboards or alerting systems, surfacing quality issues, freshness violations, and operational anomalies before they impact business users.

How dltHub acts as a gatekeeper for DQ

In this architecture, dlt functions as a quality gatekeeper, the critical control point where data must prove its worthiness before entering the lakehouse. This pattern prevents the common antipattern of "load everything and hope for the best," where pipelines ingest raw API responses directly into staging tables and only discover problems when downstream jobs fail.

The gatekeeper pattern has profound operational benefits. Failed validations stop at the ingestion layer, generating clear error messages about exactly what went wrong (missing required field, invalid state, future timestamp).

Engineers don't waste time debugging Spark failures caused by malformed inputs. Business users don't see empty dashboards or incorrect metrics because bad data never reaches production.

For small teams, this prevention-focused approach is sustainable. Instead of constant firefighting, engineers invest effort upfront in defining comprehensive validation rules. The return on this investment comes daily, as the pipeline automatically rejects problematic data without manual intervention.

Monitoring metrics and alerting

Comprehensive monitoring built into dlt-powered pipelines provides visibility into both data quality and operational health. Critical metrics include:

Volume metrics track records ingested per run, detecting unexpected drops (empty API responses) or spikes (backfilling) that might indicate problems.

Quality metrics measure validation pass rates, null frequencies, and business rule violations, highlighting degrading data quality trends before they become crises.

Freshness metrics monitor the age of data, alerting when SLAs are breached (data is more than 24 hours old) or approaching breach thresholds.

Schema metrics track column additions, removals, and type changes, surfacing potentially breaking changes in source APIs.

PII metrics count detected sensitive fields and applied protections, ensuring governance policies are consistently enforced.

These metrics feed into alerting systems that notify teams proactively. A critical alert fires when a pipeline fails all validations and produces zero output. A warning fires when freshness approaches SLA thresholds or when unexpected schema changes are detected. For small teams, these alerts prevent surprises and enable proactive management rather than reactive firefighting.

6.5 Alternative pattern: dlt quality gates between medallion layers

Two deployment strategies

While the previous section described using dlt as a quality gate at the point of ingestion, there's an equally valid and often more practical pattern: loading raw API data directly to Bronze, then using dlt’s quality framework as a gatekeeper between Bronze and Silver layers.

This alternative approach recognizes a key reality: sometimes you want to preserve the raw API response exactly as received, without any transformation or validation that might obscure what the source system actually sent.

Pattern A: dlt at ingestion (Bronze as Validated Data)

In the first pattern, dlt validates data before it ever touches the lakehouse:

API → dlt Quality Gate → Bronze (clean) → Spark → Silver (enriched)

This approach ensures Bronze contains only validated, quality checked data. The advantages are immediate protection and simplified downstream processing. Spark transformations never encounter invalid data because it was rejected before reaching Bronze.

However, this pattern has a tradeoff: you lose visibility into what the API actually returned. If a source starts sending malformed data, you only see rejections in logs, you can't query the raw responses to understand patterns of failure.

Pattern B: dlt between Layers (Bronze as True Raw)

The alternative pattern treats Bronze as a true landing zone for completely raw, unvalidated data:

API → Direct Load → Bronze (raw) → dlt Quality Gate → Silver (validated) → Spark → Gold (refined)

Here, the initial ingestion is fast and simple; just dump whatever the API returns into Bronze tables. No validation, no transformation, no filtering. Bronze becomes an immutable audit trail of exactly what external systems sent.

Then, dlt operates as a quality gate between Bronze and Silver. A scheduled notebook reads from Bronze tables, applies the full DQ lifecycle (schema validation, business rules, PII masking, and uniqueness checks), and writes only passing records to Silver. Failed records are logged to a quarantine table for investigation.

When to Use Each Pattern

Use Pattern A (dlt at Ingestion) when:

The source API is known to be unreliable or frequently sends bad data
Storage costs are a concern (you don't want to store invalid data)
PII must be masked immediately for compliance reasons
You need fast detection of source degradation
Your team is comfortable with in-flight validation

Use Pattern B (dlt Between Layers) when:

You want an immutable record of what sources actually sent
Debugging requires examining raw API responses
Multiple downstream processes need different quality standards
You're migrating from an existing "dump everything to Bronze" pattern
Compliance requires preserving original data for audit trails
Your ingestion pipeline is already fast and reliable

Implementation Example: Bronze to Silver with dlt

The code structure for Pattern B is straightforward. Your Bronze ingestion becomes trivial:

Python

# Simple Bronze ingestion - no validation
df_raw = spark.read.json(api_response)
df_raw.write.mode("append").saveAsTable("bronze.github_prs")
Then, your DLT pipeline reads from Bronze and applies quality gates:
python
@dlt.resource(
    name="silver_pull_requests",
    primary_key="id",
    write_disposition="merge"
)
def validate_bronze_to_silver():
    """Read from Bronze, validate, write to Silver"""
    
    # Read raw Bronze data
    spark = SparkSession.builder.getOrCreate()
    df_bronze = spark.table("bronze.github_prs")
    
    for row in df_bronze.collect():
        pr = row.asDict()
        
        try:
            # Apply full DQ lifecycle
            validated = validate_structural_integrity(pr)
            validated = validate_semantics(validated)
            validated = filter_bad_rows(validated)
            validated = apply_pii_governance(validated)
            yield validate
        except ValueError as e:
            # Log to quarantine instead of rejecting
            quarantine_record(pr, error=str(e))
            continue

This pattern gives you the best of both worlds: complete raw data preservation in Bronze, and rigorously validated data in Silver ready for analytics.

The Quarantine Table Pattern

A critical component of Pattern B is the quarantine table, a dedicated location for records that failed validation. Rather than silently dropping bad records, you preserve them for investigation:

Python

def quarantine_record(record, error, layer="bronze_to_silver"):
    """Store failed records for investigation"""
    quarantine_entry = {
        "original_record": json.dumps(record),
        "validation_error": error,
        "layer_transition": layer,
        "quarantine_timestamp": datetime.now(timezone.utc).isoformat(),
        "record_id": record.get("id")
    }
    
    # Write to quarantine table
    spark.createDataFrame([quarantine_entry]).write.mode("append").saveAsTable("quarantine.failed_validations")

Data engineers can then query the quarantine table to understand patterns:

Is one particular API endpoint consistently sending bad data?
Did a recent API change break our validation assumptions?
Are certain fields frequently null when we expect them to be populated?

These insights feed back into improving both validation rules and source system integrations.

Hybrid approach: Both patterns together

Some organizations use both patterns simultaneously for different data sources:

Trusted, stable APIs (internal systems, well-documented partners): Pattern A with dlt at ingestion
Unreliable or exploratory sources (new APIs, third-party services, web scraping): Pattern B with raw Bronze preservation

This hybrid approach maximizes efficiency where possible while maintaining flexibility where needed.

Benefits for small teams

Pattern B is particularly valuable for small teams because it provides safety nets:

Debugging is easier: When Silver data looks wrong, you can query Bronze to see exactly what the API sent
Validation rules can evolve: You can reprocess Bronze data with updated rules without re-calling APIs
No data loss: Even if validation rules are too strict initially, raw data is preserved for later recovery
Incremental migration: Teams can start with simple Bronze loading, then add DLT validation incrementally

The tradeoff is additional storage cost (storing both raw and validated data) and slightly more complex pipelines. For most teams, these costs are minor compared to the operational benefits of having an audit trail and safety net.

7. Visual representation

DAG diagram: Full pipeline with validation gates

Imagine a directed acyclic graph (DAG) that visualizes the complete pipeline flow:

At the top sits the External API (GitHub, Salesforce, etc.) the source of raw data. An arrow flows down to the dlt Ingestion Layer, where resources fetch data via HTTP requests.

Immediately below, the flow splits at the Validation Gates. Here, incoming records face a gauntlet of checks arranged in parallel: Structural Integrity (schema and types), Semantic Validity (business rules), Uniqueness & Relations (primary keys and referential integrity), and Privacy & Governance (PII detection and masking).

Records that pass all gates flow to the Raw/Staging Lakehouse, represented as a safe zone where validated data accumulates. Records that fail any gate are diverted to a Failed Records Log, preventing contamination of clean data.

From the staging area, an arrow leads to Spark Transformations business logic that aggregates, joins, and enriches validated data. These transformations produce outputs that flow to the Trusted Lakehouse, the final destination consumed by Power BI dashboards, ML models, and analytics applications.

Throughout the diagram, monitoring feeds flow upward to an Observability Dashboard, capturing metrics from each stage: ingestion volumes, validation pass rates, transformation durations, and freshness indicators.

Flow Highlighting Failure Points

The visual representation emphasizes failure handling. When validation gates reject data, the pipeline doesn't proceed there's no arrow leading to staging. This hard stop prevents bad data from contaminating downstream systems.

Similarly, if Spark transformations encounter unexpected issues (though they shouldn't, given upstream validation), failures are contained. Monitoring detects the problem, alerts fire, and engineers investigate using rich metadata from the failed run rather than debugging cryptic Spark errors.

This failure-aware design makes the pipeline resilient. Small teams can deploy with confidence, knowing that quality gates will catch issues automatically, that monitoring will surface problems quickly, and that troubleshooting will be efficient thanks to comprehensive logging.

8. Benefits for small teams

Reduces operational firefighting

The most immediate benefit for small teams is dramatically reduced firefighting. Without pre-load validation, engineers spend significant time investigating why dashboards broke, why ML models produced weird results, or why reports show unexpected duplicates.

These investigations are exhausting and time-consuming, often requiring diving into production data, checking API logs, and tracing data lineage manually.

With dlt-powered validation gates, most of these issues simply don't occur. Invalid data is rejected at ingestion.

Clear error messages explain exactly what went wrong. Engineers fix issues once (update the validation rule or fix the API integration) rather than repeatedly debugging symptoms downstream.

This shift from reactive firefighting to proactive prevention is transformative for one or two-person data teams. It means weekends without on-call alerts, days without surprise outages, and time to focus on delivering value rather than maintaining fragile pipelines.

Increases trust in analytics

When data quality is inconsistent, trust erodes. Business users question dashboard numbers. Analysts spend time validating results rather than generating insights. Executives hesitate to make decisions based on data they suspect might be wrong.

Reliable pre-load validation rebuilds this trust. When analysts know that data passes comprehensive quality checks before reaching their queries, they use it with confidence. When executives see metrics that reconcile across reports, they trust data-driven decisions. When ML engineers train models on validated features, they trust predictions.

For small teams, building this trust is critical. Without the reputation capital of large, established data teams, every quality issue damages credibility. Investing in quality upfront through dlt establishes a foundation of reliability that compounds over time.

Simplifies end-to-end pipeline management

Traditional data pipelines involve juggling multiple tools: something for ingestion, something for transformation, something for quality checks, something for monitoring. Each tool has its own configuration, deployment process, and failure modes. For small teams, this complexity is overwhelming.

dlthub consolidates quality management into the pipeline itself. Schema definitions, validation rules, PII policies, and monitoring all live in the same codebase as extraction logic. Deployment is straightforward (commit code, run notebook). Monitoring is built-in (metadata tables automatically populated). Troubleshooting is efficient (comprehensive logs for every run).

This consolidation dramatically reduces cognitive load. Engineers don't context-switch between tools. They don't maintain separate configurations for quality rules and ingestion logic. Everything lives together, versioned in git, tested in development, and promoted to production as a cohesive unit.

9. Conclusion

Microsoft fabric provides compute and storage but not built-in DQ

Microsoft Fabric is a powerful platform that unifies data engineering, data science, and business intelligence in a single environment. It provides scalable compute through Spark and SQL engines, efficient storage in OneLake, and seamless integration with Power BI for analytics. These capabilities are essential foundations for modern data platforms.

However, Fabric deliberately does not include a prescriptive data quality framework. It provides the infrastructure but leaves quality enforcement to the teams building pipelines. This flexibility is valuable—different organizations have different quality requirements—but it also creates a gap that must be filled.

dltHub completes the DQ lifecycle

dltHub fills this gap by providing a comprehensive, code-first approach to data quality that integrates naturally with Fabric's ecosystem.

From source profiling through schema enforcement, pre-load validation, controlled loading, and continuous monitoring, dlt implements the complete data quality lifecycle within your Python pipelines.

This integration is powerful because it doesn't require adopting a separate quality tool or platform. Quality enforcement lives in the same notebooks, same git repositories, and same deployment pipelines as your data logic. For small teams especially, this cohesive approach reduces complexity while enhancing capability.

Pre-Load Validation with PII Protection Is Critical

The most impactful element of dlt's quality approach is pre-load validation combined with automatic PII protection. This combination addresses two of the most common data platform failures: bad data reaching production tables, and sensitive data leaking into analytics systems.

By validating data before it enters the lakehouse, pipelines prevent quality issues from cascading downstream. Invalid records are rejected immediately with clear error messages. Valid records proceed with confidence.

Downstream consumers, Spark jobs, Power BI reports, ML models work with data they can trust.

By detecting and masking PII before storage, pipelines reduce compliance risk and prevent accidental exposure. Even if downstream access controls are misconfigured or exports are mishandled, the sensitive data simply isn't there to leak. Privacy is enforced by default, not as an afterthought.

Blog/March 10, 2026/

Product,
Tutorials,
Community

Building production-ready data pipelines in Microsoft Fabric: A complete data quality framework with dlthub

Add data quality gates to Microsoft Fabric with dlt. Validate schemas, catch bad records, and mask PII before data reaches your lakehouse and downstream analytics.

Rakesh Gupta,
Director and Principal Consultant (SketchMyView)

On this page

1. Introduction
Why data quality matters
The impact of poor data on analytics, ML, and business decisions
2. The challenges of data quality in Microsoft Fabric
Tool fragmentation: Multiple services but no unified DQ engine
Lack of pre-load (write-audit-publish) validation
Schema drift and semi-structured API data
Downstream errors from unvalidated data
Limited monitoring and observability for small teams
3. The dltHub solution
What dlt from dltHub is
How it integrates with Microsoft Fabric
Benefits for small teams (1–2 engineers)
Data quality gates for Microsoft Fabric
4. Mapping the DQ lifecycle to dlthub
Stage 1: Source profiling
Stage 2: Data contract and schema enforcement
Stage 3: Pre-load / WAP validation
Stage 4: Controlled load into lakehouse
Stage 5: Logging and monitoring
Stage 6: Feedback loop and iterative improvement
5. Protecting sensitive data (PII)
Common PII fields in APIs
6. Integrating into a Microsoft Fabric pipeline
How dltHub acts as a gatekeeper for DQ
Monitoring metrics and alerting
6.5 Alternative pattern: dlt quality gates between medallion layers
Two deployment strategies
Pattern A: dlt at ingestion (Bronze as Validated Data)
Pattern B: dlt between Layers (Bronze as True Raw)
When to Use Each Pattern
Implementation Example: Bronze to Silver with dlt
The Quarantine Table Pattern
Hybrid approach: Both patterns together
Benefits for small teams
7. Visual representation
Flow Highlighting Failure Points
8. Benefits for small teams
Reduces operational firefighting
Increases trust in analytics
Simplifies end-to-end pipeline management
9. Conclusion
Microsoft fabric provides compute and storage but not built-in DQ
dltHub completes the DQ lifecycle
Pre-Load Validation with PII Protection Is Critical

1. Introduction

Why data quality matters

The impact of poor data on analytics, ML, and business decisions

Poor data quality creates a cascade of problems throughout an organization. Analytics dashboards displaying incorrect metrics erode trust and lead to misguided strategic decisions.

For small data teams, often just one or two engineers; the challenge is even more acute.

Without the resources for extensive data quality tooling or dedicated governance teams, these engineers must find efficient, scalable solutions that prevent problems before they cascade downstream.

2. The challenges of data quality in Microsoft Fabric

Tool fragmentation: Multiple services but no unified DQ engine

However, what Fabric doesn't provide is a unified, built-in data quality engine that operates across these services.

This fragmentation means data quality checks are often implemented ad-hoc, scattered across different pipeline stages, or worse; implemented only after data has already landed in production tables.

Each team might develop their own validation logic, leading to inconsistent quality standards and duplicated effort.

Lack of pre-load (write-audit-publish) validation

One of the most critical gaps in many Fabric implementations is the absence of a structured Write-Audit-Publish (WAP) pattern.

In this pattern, data is written to a staging area, thoroughly audited against quality rules, and only published to production tables if it passes all checks.

Without WAP validation, bad data flows directly into trusted tables. By the time quality issues are discovered, often by downstream consumers or business users; the damage is done.

Correcting these issues requires costly data remediation, re-running of downstream processes, and explanations to stakeholders about why their reports suddenly changed.

Schema drift and semi-structured API data

Downstream errors from unvalidated data

This reactive approach to data quality is exhausting and unsustainable, especially for small teams who cannot afford to be on-call for data issues around the clock.

Limited monitoring and observability for small teams

3. The dltHub solution

What dlt from dltHub is

How it integrates with Microsoft Fabric

Benefits for small teams (1–2 engineers)

For small teams, dlt is a force multiplier. Instead of building custom validation logic for every pipeline, teams define quality rules once using dlt's declarative schema system.

Most importantly, dlt's approach to data quality is proactive rather than reactive. Issues are caught at the source, before bad data pollutes downstream systems.

This shift from firefighting to prevention dramatically reduces the operational burden on small teams, freeing them to deliver value rather than debug failures.

Data quality gates for Microsoft Fabric

Explore the end-to-end implementation of the dlt quality lifecycle. This collection of notebooks covers the five essential pillars of data integrity and production-ready transformations.

Pillar 1: Structural Integrity
Pillar 2: Semantic Validity
Pillar 3: Uniqueness & Relations
Pillar 4: Privacy & Governance
Pillar 5: Operational Health

View the Notebooks: dlt x Microsoft Fabric Quality Gates
Access Python Source: .py versions

4. Mapping the DQ lifecycle to dlthub

Stage 1: Source profiling

Stage 2: Data contract and schema enforcement

Required fields (nullable: false) that must always be present
Data types (bigint, text, timestamp) that enforce structural integrity
Primary keys that ensure uniqueness
Business rules encoded as validation functions

Stage 3: Pre-load / WAP validation

The heart of dlt's data quality approach is pre-load validation, the Write-Audit-Publish pattern that gates data before it enters trusted tables. This validation operates at multiple levels:

Stage 4: Controlled load into lakehouse

Stage 5: Logging and monitoring

This metadata is captured automatically and can be persisted to monitoring tables or logging systems.

For small teams without dedicated observability platforms, dlt provides a lightweight built-in solution that surfaces critical information without requiring additional infrastructure.

Stage 6: Feedback loop and iterative improvement

5. Protecting sensitive data (PII)

Common PII fields in APIs

Salesforce APIs return customer names, addresses, and phone numbers. Social media APIs expose location data and private messages.

How dltHub detects and masks PII before loading

Third, dlt applies appropriate protections based on the type of PII and your governance requirements:

Masking obscures values while preserving some information (showing the last 4 digits of a phone number)
Hashing enables privacy-preserving joins (matching records across systems without exposing actual values)
Redaction completely removes sensitive data, replacing it with [REDACTED]
Dropping eliminates entire fields that shouldn't be stored

These transformations preserve data utility for analytics while minimizing privacy risk and ensuring regulatory compliance.

6. Integrating into a Microsoft Fabric pipeline

A production-ready Microsoft Fabric pipeline with dlt-powered data quality follows a clear flow from ingestion through transformation to trusted analytics:

Step-by-step pipeline architecture

How dltHub acts as a gatekeeper for DQ

Engineers don't waste time debugging Spark failures caused by malformed inputs. Business users don't see empty dashboards or incorrect metrics because bad data never reaches production.

Monitoring metrics and alerting

Comprehensive monitoring built into dlt-powered pipelines provides visibility into both data quality and operational health. Critical metrics include:

Volume metrics track records ingested per run, detecting unexpected drops (empty API responses) or spikes (backfilling) that might indicate problems.

Quality metrics measure validation pass rates, null frequencies, and business rule violations, highlighting degrading data quality trends before they become crises.

Freshness metrics monitor the age of data, alerting when SLAs are breached (data is more than 24 hours old) or approaching breach thresholds.

Schema metrics track column additions, removals, and type changes, surfacing potentially breaking changes in source APIs.

PII metrics count detected sensitive fields and applied protections, ensuring governance policies are consistently enforced.

6.5 Alternative pattern: dlt quality gates between medallion layers

Two deployment strategies

Pattern A: dlt at ingestion (Bronze as Validated Data)

In the first pattern, dlt validates data before it ever touches the lakehouse:

API → dlt Quality Gate → Bronze (clean) → Spark → Silver (enriched)

Pattern B: dlt between Layers (Bronze as True Raw)

The alternative pattern treats Bronze as a true landing zone for completely raw, unvalidated data:

API → Direct Load → Bronze (raw) → dlt Quality Gate → Silver (validated) → Spark → Gold (refined)

When to Use Each Pattern

Use Pattern A (dlt at Ingestion) when:

The source API is known to be unreliable or frequently sends bad data
Storage costs are a concern (you don't want to store invalid data)
PII must be masked immediately for compliance reasons
You need fast detection of source degradation
Your team is comfortable with in-flight validation

Use Pattern B (dlt Between Layers) when:

You want an immutable record of what sources actually sent
Debugging requires examining raw API responses
Multiple downstream processes need different quality standards
You're migrating from an existing "dump everything to Bronze" pattern
Compliance requires preserving original data for audit trails
Your ingestion pipeline is already fast and reliable

Implementation Example: Bronze to Silver with dlt

The code structure for Pattern B is straightforward. Your Bronze ingestion becomes trivial:

Python

# Simple Bronze ingestion - no validation
df_raw = spark.read.json(api_response)
df_raw.write.mode("append").saveAsTable("bronze.github_prs")
Then, your DLT pipeline reads from Bronze and applies quality gates:
python
@dlt.resource(
    name="silver_pull_requests",
    primary_key="id",
    write_disposition="merge"
)
def validate_bronze_to_silver():
    """Read from Bronze, validate, write to Silver"""
    
    # Read raw Bronze data
    spark = SparkSession.builder.getOrCreate()
    df_bronze = spark.table("bronze.github_prs")
    
    for row in df_bronze.collect():
        pr = row.asDict()
        
        try:
            # Apply full DQ lifecycle
            validated = validate_structural_integrity(pr)
            validated = validate_semantics(validated)
            validated = filter_bad_rows(validated)
            validated = apply_pii_governance(validated)
            yield validate
        except ValueError as e:
            # Log to quarantine instead of rejecting
            quarantine_record(pr, error=str(e))
            continue

This pattern gives you the best of both worlds: complete raw data preservation in Bronze, and rigorously validated data in Silver ready for analytics.

The Quarantine Table Pattern

A critical component of Pattern B is the quarantine table, a dedicated location for records that failed validation. Rather than silently dropping bad records, you preserve them for investigation:

Python

def quarantine_record(record, error, layer="bronze_to_silver"):
    """Store failed records for investigation"""
    quarantine_entry = {
        "original_record": json.dumps(record),
        "validation_error": error,
        "layer_transition": layer,
        "quarantine_timestamp": datetime.now(timezone.utc).isoformat(),
        "record_id": record.get("id")
    }
    
    # Write to quarantine table
    spark.createDataFrame([quarantine_entry]).write.mode("append").saveAsTable("quarantine.failed_validations")

Data engineers can then query the quarantine table to understand patterns:

Is one particular API endpoint consistently sending bad data?
Did a recent API change break our validation assumptions?
Are certain fields frequently null when we expect them to be populated?

These insights feed back into improving both validation rules and source system integrations.

Hybrid approach: Both patterns together

Some organizations use both patterns simultaneously for different data sources:

Trusted, stable APIs (internal systems, well-documented partners): Pattern A with dlt at ingestion
Unreliable or exploratory sources (new APIs, third-party services, web scraping): Pattern B with raw Bronze preservation

This hybrid approach maximizes efficiency where possible while maintaining flexibility where needed.

Benefits for small teams

Pattern B is particularly valuable for small teams because it provides safety nets:

Debugging is easier: When Silver data looks wrong, you can query Bronze to see exactly what the API sent
Validation rules can evolve: You can reprocess Bronze data with updated rules without re-calling APIs
No data loss: Even if validation rules are too strict initially, raw data is preserved for later recovery
Incremental migration: Teams can start with simple Bronze loading, then add DLT validation incrementally

7. Visual representation

DAG diagram: Full pipeline with validation gates

Imagine a directed acyclic graph (DAG) that visualizes the complete pipeline flow:

At the top sits the External API (GitHub, Salesforce, etc.) the source of raw data. An arrow flows down to the dlt Ingestion Layer, where resources fetch data via HTTP requests.

Flow Highlighting Failure Points

8. Benefits for small teams

Reduces operational firefighting

These investigations are exhausting and time-consuming, often requiring diving into production data, checking API logs, and tracing data lineage manually.

With dlt-powered validation gates, most of these issues simply don't occur. Invalid data is rejected at ingestion.

Clear error messages explain exactly what went wrong. Engineers fix issues once (update the validation rule or fix the API integration) rather than repeatedly debugging symptoms downstream.

Increases trust in analytics

Simplifies end-to-end pipeline management

9. Conclusion

Microsoft fabric provides compute and storage but not built-in DQ

dltHub completes the DQ lifecycle

dltHub fills this gap by providing a comprehensive, code-first approach to data quality that integrates naturally with Fabric's ecosystem.

From source profiling through schema enforcement, pre-load validation, controlled loading, and continuous monitoring, dlt implements the complete data quality lifecycle within your Python pipelines.

Pre-Load Validation with PII Protection Is Critical

Downstream consumers, Spark jobs, Power BI reports, ML models work with data they can trust.

1. IntroductionLink icon

Why data quality mattersLink icon

The impact of poor data on analytics, ML, and business decisionsLink icon

2. The challenges of data quality in Microsoft FabricLink icon

Tool fragmentation: Multiple services but no unified DQ engineLink icon

Lack of pre-load (write-audit-publish) validationLink icon

Schema drift and semi-structured API dataLink icon

Downstream errors from unvalidated dataLink icon

Limited monitoring and observability for small teamsLink icon

3. The dltHub solutionLink icon

What dlt from dltHub isLink icon

How it integrates with Microsoft FabricLink icon

Benefits for small teams (1–2 engineers)Link icon

Data quality gates for Microsoft Fabric Link icon

4. Mapping the DQ lifecycle to dlthubLink icon

Stage 1: Source profilingLink icon

Stage 2: Data contract and schema enforcementLink icon

Stage 3: Pre-load / WAP validationLink icon

Stage 4: Controlled load into lakehouseLink icon

Stage 5: Logging and monitoringLink icon

Stage 6: Feedback loop and iterative improvementLink icon

5. Protecting sensitive data (PII)Link icon

Common PII fields in APIsLink icon

How dltHub detects and masks PII before loadingLink icon

6. Integrating into a Microsoft Fabric pipelineLink icon

How dltHub acts as a gatekeeper for DQLink icon

Monitoring metrics and alertingLink icon

6.5 Alternative pattern: dlt quality gates between medallion layersLink icon

Two deployment strategiesLink icon

Pattern A: dlt at ingestion (Bronze as Validated Data)Link icon

Pattern B: dlt between Layers (Bronze as True Raw)Link icon

When to Use Each PatternLink icon

Implementation Example: Bronze to Silver with dltLink icon

The Quarantine Table PatternLink icon

Hybrid approach: Both patterns togetherLink icon

Benefits for small teamsLink icon

7. Visual representationLink icon

Flow Highlighting Failure PointsLink icon

8. Benefits for small teamsLink icon

Reduces operational firefightingLink icon

Increases trust in analyticsLink icon

Simplifies end-to-end pipeline managementLink icon

9. ConclusionLink icon

Microsoft fabric provides compute and storage but not built-in DQ Link icon

dltHub completes the DQ lifecycleLink icon

Pre-Load Validation with PII Protection Is CriticalLink icon

1. IntroductionLink icon

Why data quality mattersLink icon

The impact of poor data on analytics, ML, and business decisionsLink icon

2. The challenges of data quality in Microsoft FabricLink icon

Tool fragmentation: Multiple services but no unified DQ engineLink icon

Lack of pre-load (write-audit-publish) validationLink icon

Schema drift and semi-structured API dataLink icon

Downstream errors from unvalidated dataLink icon

Limited monitoring and observability for small teamsLink icon

3. The dltHub solutionLink icon

What dlt from dltHub isLink icon

How it integrates with Microsoft FabricLink icon

Benefits for small teams (1–2 engineers)Link icon

Data quality gates for Microsoft Fabric Link icon

4. Mapping the DQ lifecycle to dlthubLink icon

Stage 1: Source profilingLink icon

Stage 2: Data contract and schema enforcementLink icon

Stage 3: Pre-load / WAP validationLink icon

Stage 4: Controlled load into lakehouseLink icon

Stage 5: Logging and monitoringLink icon

Stage 6: Feedback loop and iterative improvementLink icon

5. Protecting sensitive data (PII)Link icon

Common PII fields in APIsLink icon

How dltHub detects and masks PII before loadingLink icon

6. Integrating into a Microsoft Fabric pipelineLink icon

How dltHub acts as a gatekeeper for DQLink icon

Monitoring metrics and alertingLink icon

6.5 Alternative pattern: dlt quality gates between medallion layersLink icon

Two deployment strategiesLink icon

Pattern A: dlt at ingestion (Bronze as Validated Data)Link icon

Pattern B: dlt between Layers (Bronze as True Raw)Link icon

When to Use Each PatternLink icon

Implementation Example: Bronze to Silver with dltLink icon

The Quarantine Table PatternLink icon

1. Introduction

Why data quality matters

The impact of poor data on analytics, ML, and business decisions

2. The challenges of data quality in Microsoft Fabric

Tool fragmentation: Multiple services but no unified DQ engine

Lack of pre-load (write-audit-publish) validation

Schema drift and semi-structured API data

Downstream errors from unvalidated data

Limited monitoring and observability for small teams

3. The dltHub solution

What dlt from dltHub is

How it integrates with Microsoft Fabric

Benefits for small teams (1–2 engineers)

Data quality gates for Microsoft Fabric

4. Mapping the DQ lifecycle to dlthub

Stage 1: Source profiling

Stage 2: Data contract and schema enforcement

Stage 3: Pre-load / WAP validation

Stage 4: Controlled load into lakehouse

Stage 5: Logging and monitoring

Stage 6: Feedback loop and iterative improvement

5. Protecting sensitive data (PII)

Common PII fields in APIs

How dltHub detects and masks PII before loading

6. Integrating into a Microsoft Fabric pipeline

How dltHub acts as a gatekeeper for DQ

Monitoring metrics and alerting

6.5 Alternative pattern: dlt quality gates between medallion layers

Two deployment strategies

Pattern A: dlt at ingestion (Bronze as Validated Data)

Pattern B: dlt between Layers (Bronze as True Raw)

When to Use Each Pattern

Implementation Example: Bronze to Silver with dlt

The Quarantine Table Pattern

Hybrid approach: Both patterns together

Benefits for small teams

7. Visual representation

Flow Highlighting Failure Points

8. Benefits for small teams

Reduces operational firefighting

Increases trust in analytics

Simplifies end-to-end pipeline management

9. Conclusion

Microsoft fabric provides compute and storage but not built-in DQ

dltHub completes the DQ lifecycle

Pre-Load Validation with PII Protection Is Critical

1. Introduction

Why data quality matters

The impact of poor data on analytics, ML, and business decisions

2. The challenges of data quality in Microsoft Fabric

Tool fragmentation: Multiple services but no unified DQ engine

Lack of pre-load (write-audit-publish) validation

Schema drift and semi-structured API data

Downstream errors from unvalidated data

Limited monitoring and observability for small teams

3. The dltHub solution

What dlt from dltHub is

How it integrates with Microsoft Fabric

Benefits for small teams (1–2 engineers)

Data quality gates for Microsoft Fabric

4. Mapping the DQ lifecycle to dlthub

Stage 1: Source profiling

Stage 2: Data contract and schema enforcement

Stage 3: Pre-load / WAP validation

Stage 4: Controlled load into lakehouse

Stage 5: Logging and monitoring

Stage 6: Feedback loop and iterative improvement

5. Protecting sensitive data (PII)

Common PII fields in APIs

How dltHub detects and masks PII before loading

6. Integrating into a Microsoft Fabric pipeline

How dltHub acts as a gatekeeper for DQ

Monitoring metrics and alerting

6.5 Alternative pattern: dlt quality gates between medallion layers

Two deployment strategies

Pattern A: dlt at ingestion (Bronze as Validated Data)

Pattern B: dlt between Layers (Bronze as True Raw)

When to Use Each Pattern

Implementation Example: Bronze to Silver with dlt

The Quarantine Table Pattern