Blog//
The rise of the Semantic engineer
Agents now write the pipelines, models, and dashboards. What they can't write is what your data means. Meet the data role that's emerging: the semantic engineer.
Adrian Brudaru,
Co-Founder & CDO

Data roles have never been stable. They re-form every time the tooling economics flip, and they're flipping now.
The arc so far: in the MIS era, reporting lived inside IT and a business question took weeks to answer. The 90s professionalized it into the first specialist team — warehouse architect, ETL developer, DBA, report developer. Then the 2000s suites re-bundled the loop so tightly that one person could span it: the one-man BI team, on SQL Server or Cognos, talking to the CFO in the morning and shipping the dashboard in the afternoon. The modern data stack broke that chair apart again — cloud warehouse, ingestion tool, transformation framework, orchestrator, BI layer, each with its own operator — and we got the five-role team: data engineer, analytics engineer, BI developer, analyst, and eventually a data product person to coordinate the other four.

Every split was locally rational. The cost was global: every handoff drops business meaning, every question queues at every station, and the people who understand the business end up furthest from the data, connected to it by tickets. The pattern across all of it: roles fragment when the stack fragments, and re-converge when the tooling re-bundles. The tooling is re-bundling.
The middle of the stack is what gets automated
Agents now write the plumbing. In January 2026, 91% of the 81,000 new dlt pipelines shipped by the community were built by agents.

We evaled this: generating a pipeline costs $2–3 of agent time, and with an engineering toolkit loaded, the generated pipelines passed 100% of our behavioral checks (current docs read, no credentials leaked, data sampled before loading, persistent state initialized) versus 68% without it.
The same automation is moving up the stack, because transformations, semantic models, and dashboards share the property that made pipelines automatable: each is a translation of business meaning into a technical artifact, and generation is good at translation. What agents cannot generate is meaning: what your data means, which records count, where the historical breaks in your systems are, why a definition excludes what it excludes. Those decisions exist, mostly undocumented, in people's heads.
The middle of the stack: pipeline coding, modeling, dashboard building, every job and tool whose function is translating meaning from one specialist to the next, gets automated. What stays human sits at the two ends.
- Downstack: the infrastructure that runs agents safely: permissions, contracts, verification.
- Upstack: the meaning agents need: definitions, context, the knowledge layer. Work at either end is growing. Work in the middle is being generated.
The teams aren't shrinking, they're recomposing.
The bottom line: AI isn't cutting data jobs so much as splitting the work in two. Automatable work is shrinking. Augmentable work is growing. That single split explains every number below.
It looks contradictory at first. Between 2022 and 2025, computer science graduates looking for work rose about 40% while entry-level software engineering postings fell about 65% - more people, far fewer doors. A Stanford payroll study confirms it: employment of 22-to-25-year-old developers is down nearly 20% from its 2022 peak, while experienced developers held steady. Yet over the same period, in another study, most data teams grew. Cuts at the bottom, growth overall, both true at once.

The split is why. The same Stanford study says it directly: the losses land where AI replaces people, not where it helps them. Inside almost every team, the automatable half of the work is deleted and the augmentable half grows.
Entry-level work took the first hit because it was the most codified, the easiest to automate. That's not the death of the entry level. It's the death of the automatable version of it, and the augmentable version of that same job is exactly where the Semantic engineer path leads.
Who does what
The big central team gives way to smaller teams that go end to end. The org fragments while the workflow de-fragments — one team with fragmented work becomes small teams with unified work. This is the data mesh outcome arriving by economics rather than manifesto: domain ownership failed in 2021 because every domain needed the five-specialist stack. But at $3 a pipeline and maintenance as a conversation, a 1-3 person pod can own its data.

What survives centrally is exactly two functions, and that's where the roles land:
Data engineers move from making pipelines to making the factory: the agent harness, permissions, data contracts, CI for generated code, cost, blast radius. Fewer of them, far more leveraged. The ones still hand-typing connectors are competing with a $3 process that doesn't sleep.
Analytics engineers hold the rarest combination in the building — domain knowledge plus rigor — and their daily output is the layer being compiled away. They've made this move before: analyst to AE, when dbt made engineering practices reachable. The same upskill again, one level up: from writing the models to owning the meaning the models are generated from, and the rigor that hold every generated artifact to it.
Analysts split. The half of the job that pulls numbers on request — write the query, build the chart, close the ticket — is gone. The investigative half gets more valuable: why did the number move, what's causal, what should we do. Fewer analysts, embedded in the business, owning decisions instead of dashboards.
Across all of them, the verb changes more than the title: from build to author and review. Humans write meaning, agents write code, and humans approve the implementation with their name on it.
On a small team, it's one person
On a startup or a mid-sized team, the roles collapse into one job. Three parts.
You collect the meaning. Sit with the person who owns each system. Write down what they know. The agent can read the structure on its own, but only a human can say what the data actually means.
You manage the meaning and the systems. Keep the definitions in version control. Review changes like code. The pipelines and models are generated from those definitions, and contracts guard the edges.
You check precision, both ways. The definitions have to say exactly what the business means. And what the agent generates has to match the definitions. In practice that means writing evals — the same skill now central to AI engineering, applied to data.
Collect, manage, check. The agent does the rest.
That's a job one person can hold again — the first time since the old BI suites. And it's a better version. The 2010 one-person team kept everything in their head, so everything left when they did. This one keeps it in a repo: the meaning written down, the models generated from it, the evals proving they match. Same chair, but the knowledge stays.
This isn't theory; we've done it. In an early proof of concept, we took a ~20-person company off exactly the first-generation stack described above: dlt ingestion, an Airflow setup, a hand-maintained transformation layer, and a semantic model that lived in one contributor's head. We ran the canonical model toolkit against the existing pipeline and we reverse-engineered the SQL into a semantic model, consolidated everything into a handful of clean concepts, and generated the transformation layer from that model.
A generalist maintains the stack now, making scoped changes instead of architecting from scratch. Reliability went from ~85% to 99%+, time-to-new-metric from days to hours. and the institutional knowledge moved out of one person's head into a versioned artifact the company owns. That last part is the whole point: the Semantic engineer's job is to make the knowledge outlive the person who held it.
How we're building for this
This is the world we're building dltHub Pro for, and here's where it is today.
The whole loop runs from chat. You build pipelines, ingest from a source, transform the data, deploy to production, and manage the deployment — all by talking to an agent, with the generated code in your repo where you can read it, version it, and review it. The plumbing that used to be five tools and several roles becomes a conversation, and what you keep at the end is code you own, not a platform you're locked into.

The transformation layer is where the knowledge-engineer pattern shows up most directly. Our canonical modeling toolkit is spec-first: you (llm guided) author the meaning: the definitions, the concepts, what counts as what. dlt infers the schema and types from the raw data underneath. The agent then generates the model for your spec from the inferred structure. You then confirm or edit the code and sync the changes to the spec, so you can re-use the spec for agentic retrieval.
dltHub pro features a trial you can begin any time: 2 week with 30 runtime hours included, no credit card required. Try it today!