Blog//
Exploring schema evolution with ontology-driven propagation
Write your access policy as a plain-English ontology. Schema evolves; the LLM reads the rules and decides.
Aman Gupta,
Data Engineer
INTRO
Schema evolution is the part of data engineering that looks rudimentary until it isn't. A developer adds a column. Your analytics view doesn't know about it. If the new column contains PII, you've silently expanded your exposure surface. Either way, someone has to notice and fix it manually.
The design: describe the access policy in plain language, encode it as a natural-language ontology, and use that ontology as the runtime policy applied column by column. Code produces what only code can easily produce - row counts, cardinality ratios, value samples. The LLM reads the ontology and applies it to those measurements. The ontology is the source of truth for both behavior and audit. Edit it in English; and the next rebuild adapts.
The ontology encodes the policy in plain English
Four steps:
- Write a prompt: describe the access policy in plain English
- Generate an ontology: pass the prompt to an LLM to encode the policy as structured rules
- Apply the policy at runtime: for each column, an LLM reads the ontology and decides based on the column's name, dtype, cardinality, and a sample of values
- Analytics-safe data: only columns that pass reach the output

Step 1 - prompt:
I'm building a DuckDB analytics view over a fintech dataset. The schema evolves at any time — new columns can arrive without warning. I need an ontology that encodes the access policy as plain-English subject — predicate — object triples. No CamelCase, no identifier syntax — write it the way you'd write rules for a colleague who hasn't seen the codebase. Group claims under a shared subject with bullets when the subject repeats. Collapse parallel claims into a single list when the same predicate applies to several subjects. Keep the file short and scannable.
The ontology will be applied two ways: a deterministic interpreter handles crisp cases (name patterns, dtypes, cardinality), and an LLM handles ambiguous cases by sampling values. Write rules that work for both readers.
Keep it concise, essential rules only, no exhaustive type enumerations.
Define:
- The view's goal: expose analytics-safe columns, exclude PII and high-cardinality text unless the content is verifiably non-personal, adapt automatically on schema change
- Which column classes pass and which are rejected
- PII fragments — representative examples across name, contact, financial, and identity categories
- A cardinality threshold of 3% (a text column is high-cardinality if more than 3 in 100 values are unique)
- A runtime escalation rule: for high-cardinality text whose name doesn't match a PII fragment, inspect a sample of values; pass if values are identifiers, codes, or non-personal descriptions; reject if values contain personal information
- That new columns are evaluated automatically by the same rules with no code change neededThe ontology it produced:
The view:
- exposes analytics-safe columns
- excludes PII columns
- excludes high-cardinality text columns unless their values are verifiably non-personal
- adapts automatically when the schema changes
Columns that pass: numeric, boolean, date, low-cardinality text.
Columns that are rejected: PII, high-cardinality text containing personal information.
A column is PII if its name contains a fragment like:
- name, full_name, first_name, last_name, username
- email, email_address, phone, phone_number, mobile, address, city, zip, postal_code
- card_number, iban, account_number, routing_number, cvv, pan
- ssn, passport, national_id, tax_id, dob, date_of_birth, ip_address, device_id
A text column is high-cardinality if more than 3% of its values are unique.
For high-cardinality text columns whose name does not match a PII fragment:
- inspect a sample of values
- the column passes if its values are identifiers, codes, machine-generated strings, or non-personal descriptions
- the column is rejected if its values contain personal information: names, contact details, financial identifiers, free-form text describing individuals, or anything resembling the PII categories above
New columns:
- are evaluated by the same rules
- require no code change to be classified
- inherit the policy of their column classTwo layers - what things are (taxonomy) and what to do with them (relationships). The taxonomy defines what makes a column PII, what name fragments flag it, what threshold marks text as high-cardinality. The relationships sit on top: what the view exposes, what it excludes, when to escalate to value inspection. Together, that's the ontology. Readable by a human auditor and by an LLM making runtime decisions.
The policy holds when the schema changes
I added the ontology to Claude Code, which already had the pipeline scripts in context, and asked it to write the view logic.
The code:
import json
import anthropic
client = anthropic.Anthropic(api_key=api_key.value)
MODEL = "claude-sonnet-4-6"
def llm_decide(name, dtype, cardinality, sample):
# Retry loop for model stability
for _ in range(3):
try:
response = client.messages.create(
model=MODEL,
max_tokens=200,
system=(
f"You enforce this data access policy:\n\n{ontology_text}\n\n"
"Decide if a column is analytics-safe. Reply with JSON only, no markdown: "
'{"decision": "pass" or "reject", "reason": "<one sentence>"}'
),
messages=[{
"role": "user",
"content": (
f"Column name: {name}\n"
f"Dtype: {dtype}\n"
f"Cardinality ratio: {cardinality:.1%}\n"
f"Sample values: {sample}"
),
}],
)
out = json.loads(response.content[0].text)
# Ensure the model sticks to the allowed ontology decisions
if out.get("decision") in ["pass", "reject"]:
return out["decision"], out["reason"]
except:
print(f" Retrying {name} due to invalid LLM output...")
continue
# Fail-closed posture for safety
return "reject", "classification_failed_after_retries"
def build_safe_view(pipeline_name: str, raw_table: str, safe_view: str):
con = dlt.attach(pipeline_name).dataset().ibis()
tbl = con.table(raw_table)
total = tbl.count().execute()
safe_columns = []
for col_name, dtype in zip(tbl.schema().names, tbl.schema().types):
if col_name.startswith("_dlt_") or col_name == "_id":
continue
cardinality = tbl[col_name].nunique().execute() / total
sample = tbl.select(col_name).limit(20).execute()[col_name].tolist()
decision, reason = llm_decide(col_name, dtype, cardinality, sample)
marker = "✓ PASS " if decision == "pass" else "✗ REJECT"
print(f" {marker} {col_name} ← {reason}")
if decision == "pass":
safe_columns.append(col_name)
con.create_view(safe_view, tbl.select(safe_columns), overwrite=True)
print(f"\nVIEW '{safe_view}' created — {len(safe_columns)} columns: {', '.join(safe_columns)}")Two runs: 5 columns first, then 5 more. The question was whether the policy would handle new columns without code changes - including the case where column names don't tell the full story.

It held. Numeric columns passed, PII fragments were rejected. The case that shows why the LLM is doing the evaluation: user_ref clears the name-fragment check and shows 100% cardinality - it even sounds like it could reference a person. Pure cardinality-based rejection would discard it. The LLM sampled the values, saw UUIDs, and passed it as non-personal.
Run 1: pipeline loaded, view built from policy:
| Column | Result | Reason (as inferred) |
|---|---|---|
| id | ✓ Pass | Numeric surrogate key with no personal information |
| transaction_volume_30d | ✓ Pass | Numeric — transaction count aggregates with no PII characteristics |
| kyc_status | ✓ Pass | Low-cardinality text — non-personal status values (expired, verified, failed, pending) |
| full_name | ✗ Reject | PII — name fragment confirmed by sample values |
| ✗ Reject | PII — email fragment, values are personal email addresses |
Run 2: same policy, 5 new columns:
Four resolved immediately from name and dtype. One required value inspection.
| Column | Result | Reason (as inferred) |
|---|---|---|
| fraud_risk_score | ✓ Pass | Numeric — machine-generated risk scores with no PII characteristics |
| outstanding_balance_usd | ✓ Pass | Numeric — aggregate balance amounts, not a PII identifier |
| risk_tier | ✓ Pass | Low-cardinality text — non-personal risk tier labels (low, medium, high, critical) |
| user_ref | ✓ Pass | High-cardinality text — all sampled values are machine-generated UUIDs, non-personal |
| phone_number | ✗ Reject | PII — phone fragment confirmed by sample values |
v1 columns unchanged → 3 pass, 2 rejected as before. Cumulative: 7 pass | 3 rejected.
phone_number caught by the fragment rule immediately.
user_ref went to the LLM: 100% cardinality, no PII keyword - values resolved it. No view code changes between runs.
What the ontology actually bought
The policy is the contract, not the column list. A SELECT allowlist tells you what's currently safe. An ontology tells you why it's safe. That distinction matters when the schema changes: the ontology stays valid, the allowlist goes stale.
The code produces row counts, cardinality ratios, value samples - things code can produce easily. The LLM reads the natural-language ontology and applies it to those measurements per column. Neither is doing the other's job.
Changing the policy is a one-line edit. Add driver_license to the PII fragment list. The next run catches it.
limit(20) if row order in your table is non-random.The limits worth knowing
This works for what it claims to work for. It doesn't solve everything.
Name-fragment matching catches PII by column name. A non-PII-looking name clears that check - value inspection covers the high-cardinality text case, but a numeric column encoding a sensitive inference (a risk score, a derived identifier) passes regardless: the ontology treats numeric as safe. The cardinality rule catches high-entropy strings, but a small-vocabulary column could still contain sensitive values. Cross-column re-identification isn't addressed at all.
When to use this
If your policy decomposes to pure pattern matches and counts - substring on names, dtype allowlist, a threshold - skip the LLM and run the same ontology through a small deterministic interpreter. We didn't, because high-cardinality text is the case where names lie and only inspecting the values resolves it. Pick the executor that matches your policy's edge.
The ontology tells you what safety means. A column list tells you what's currently safe. One expires when the schema changes, the other doesn't. Because the policy is separate from the code, improving it doesn't touch the pipeline.
Try it
The Agentic Data Engineering course covers the full pipeline from raw API to production using the dltHub AI Workbench. Ontology comes in during the transformation section, where you use it to guide modeling decisions.
The full workbench includes toolkits for REST API ingestion, data exploration, and production deployment, so you can go end-to-end without leaving your editor.
The ontology-driven data modelling toolkit is part of the dltHub AI Workbench, available in dltHub Pro, due for release in Q2 and currently in design partnership stage. Currently, it’s being leveraged by commercial data engineering agencies who benefit from standardisation and acceleration. if you’re interested, apply for the design partnership!
For more background on the toolkit: Ontology Toolkit Preview