Blog//
We moved in 2 weeks from Hubspot to Attio using dltHub's agentic transformations, here's how
One working student, Claude Code, one stakeholder call, and 2 weeks. The migration worked but the workflow we used is the actual point of this post. AI alone wouldn’t have gotten us there.
Nikolas Jack Altran,
Working Student
We moved in 2 weeks from Hubspot to Attio using dltHub's agentic transformations, here's how
Sooner or later a stakeholder walks in and says it’s time to migrate the CRM. Whatever you planned for the next month is gone. Custom logic, reporting wiring, sync schedules. All of it back on the bench.
Or that’s how it used to be.
We just moved from HubSpot to Attio in 2 weeks. One working student, Claude Code, one stakeholder in the kickoff, one senior engineer on PRs. The migration worked but the workflow we used is the actual point of this post. AI alone wouldn’t have gotten us there.
This isn’t a “look, AI is amazing” piece. CRM migrations aren’t a “copy the rows” problem. They’re where your GTM team lives every day. Sales reps have notes attached to every contact, meetings they’ve attended logged against accounts, and pipelines they’ve been grooming for months. From the way a deal is shaped, to how a contact relates to a company, to what “owner” means in your org, all of that matters. Get it wrong and you haven't migrated anything; you've deleted institutional knowledge.
Here’s the step-by-step workflow we used, and the three pieces of context AI needs before it can pull weight on something like this.
Why pointing an agent at the spec fails
If you point an AI agent at the spec and tell it to migrate, you’ll get plausible nonsense.
Here’s the easy one to picture: deleted contacts. The textbook answer for tracking history across systems is Slowly Changing Dimensions Type 2: you keep the old row, you version it, and mark it deprecated. An AI agent reading the schema will reach for this. It looks correct. It is correct in a vacuum.
GDPR says no. When a contact is deleted in the source, the deletion has to propagate. No deleted_at flag, no soft delete, no retention anywhere. The AI didn't know that because nothing in the schema or the HubSpot API told it. It's a legal constraint on our business in our jurisdiction, and it lives in the ontology but we’ll get to that.
That’s one example. The AI gets things plausibly wrong in ways that may pass PR review but break in production, or worse, six months later in an audit. The reason isn’t that the model is not intelligent enough. It’s that it’s missing the right context.

What agentic transformations need before they can pull weight
1. Best practice. How a data engineer with twelve years of experience would structure this for maintainability. Where to put the CDM. How to slice PRs. When to break a transformation out vs. inline it. The AI knows a lot of patterns. It does not know which ones survive contact with a team.
2. Actual data. Your schema’s edge cases. The specific quirks of the APIs you’re working with. Attio’s relationship labels don’t behave like HubSpot’s. The mock data the AI imagines is cleaner than what you’ll get from a real export. Until the AI can see and understand your schema, it’s guessing.
3. Business reality. What your data means. The compliance constraints you’re under. Which fields are load-bearing for which downstream process. Why “former employee” is a relationship label you can’t lose. Even if the AI has the schema, it does not know what that schema represents in the real world.
Strip those three out and you’re back where you started. Put them in and the same model becomes a force multiplier. Below is how we put them in.
The 2-week playbook: Setup, Build, Iterate
Setup
1. Stakeholder meeting — recorded and transcribed.
I sat down with the business stakeholder. Two things came out of that meeting: a scope and a set of entity-level decisions.
Scope: Replicate the HubSpot status quo in Attio. Two things explicitly deferred, in writing:
- Product telemetry from PostHog
- Complex dashboards
Architecture. All work routes through a canonical data model (CDM):
- HubSpot → CDM for the objects not yet synced (Deals, Notes, Tasks, Emails).
- CDM → Attio for everything downstream.
Entity decisions. Logged in the same meeting: the Contact↔Company labels that matter — Primary Company, Former Employee, Partner Manager.
These aren't really calls AI is in a position to make on its own. They require knowing the team, the calendar, the budget for risk. The the transcript becomes input to the next step.
A short aside: what a canonical data model(CDM) is, and why it’s the backbone
A canonical data model is a system-neutral common language where every source speaks the same tongue. It’s the technical realization of your organization’s ontology — contacts, companies, deals, owners are defined once, in the names your business uses internally, independent of any software vendor.

Every tool — present or future — wires into the CDM. The CDM stays. The tools change around it.
Inside the CDM, the sales-ops entities look like this (simplified, the real graph has more entities and edges, but this is the core):
Every entity is one row per business concept, defined once. Relationships live in a separate fact_association table so swapping a destination doesn’t ripple through entity definitions. The shape stays stable across migrations; the column lists evolve as the business does.
Why it matters for a migration like ours:
- Mapping work is bounded. HubSpot → CDM and CDM → Attio. Not HubSpot → Attio with every bespoke transformation that implies.
- This isn’t a one-shot. The next time we migrate — to whatever comes after Attio — half the work is already done. The CDM is the part that doesn’t get thrown away.
2. Build the ontology from the transcript.
Feed the transcript, the CDM, and the Attio API schema to dltHub Pro’s transformation toolkit (in the AI workbench). Out comes a first-draft ontology — the machine-readable description of the CDM plus the decisions from the transcript. The workbench builds a taxonomy first — canonical concepts stripped of source-specific vocabulary, so “Person” / “User” / “Guest” collapse into one entity before any relationships get drawn — and then constructs the ontology on top. This is what the AI reads at every step from here on.
Then review it, prune it. Fix the parts that don’t match reality.
Then prune more. Context windows aren’t infinite — they’re scarce. Every line in the ontology earns its place. We call this minimum viable context: the smallest representation that still answers the questions the AI will ask.
What the ontology looks like for one entity — abbreviated:
entity: Contact
cdm_table: dim_contact (SCD2, conformed)
natural_key: email (lowercased, trimmed)
sources:
pre_migration:
- hubspot_crm_data.contacts
- salesops_entities.contacts # existing CDM rows
post_migration:
- attio_raw.people
enrichment:
- github_data # developer signals, contributor activity
- slack_data # community presence, engagement
- apollo_people_match # employment, firmographics
- progai_waterfall # waterfall enrichment
- active_campaigns # email marketing + consent
- luma_events_data # event registrations
- form_submissions # landing-page + signup form leads
- manual_uploads # sales-team CSVs
attributes:
identity: source_id, email, firstname, lastname, linkedin
relationships: company_sk → dim_company, owner_sk → dim_owner
consent_gdpr: email_master_opt_out, email_marketing_confirmed
scd2_machinery: valid_from, valid_to, is_current, is_active
write_rules_to_attio:
- match_key: external_id = "hubspot_crm_data:{source_id}"
- email multi-value → Attio email_addresses[]
- phone normalized to E.164 before write
Structured prose per entity. The AI reads this every time it touches Contact-shaped code — it knows the natural key, the relationships, what’s GDPR-sensitive, and what the destination expects. Nothing is implicit.
3. PRs stacked in a deliberate order. The classical way to slice migration PRs is by ETL stage: extract, transform, sync. At this volume one PR ends up touching every entity, and nobody can review it. The rules we used instead:
- Stack the PRs. Each PR sets up what the next one depends on. Top-to-bottom review never overloads.
- Keep PRs small. The reviewer verifies one thing at a time without holding the whole system in working memory.
- Plan the stack before the first commit. This doesn’t work as an afterthought. The slicing is decided when you scope the work, not when you open the PR.
Build
4. Extract missing data with the REST API toolkit. We needed HubSpot’s Deals, Notes, Tasks, and Emails loaded into the CDM. From a single prompt, the REST API toolkit pulls API context via an MCP server (10,000+ configs at dlthub.com/context), reuses an existing pipeline configuration if one exists or reads the docs and builds one, scaffolds auth + pagination + schema + incremental loading, and runs the pipeline so you can inspect the loaded data without leaving your editor. The feedback loop drops from days to minutes. The ontology tells the toolkit what these entities are; the toolkit handles the extraction shape.
5. Generate mock data from the ontology.
The ontology already knows what real contacts or companies look like: its natural keys, the GDPR-flagged fields, SCD2 columns, write rules, and so on. That's enough for the workbench to build a realistic mock-replica of our production database in duckdb, with both the happy path and the bad path covered. Synthetic rows that hit the same edge cases the real data will hit — not just the easy ones.
6. Turn the ontology into transformations. The transformation toolkit takes four inputs:
- ontology
- taxonomy
- Attio and CDM schema
- field-by-field mapping table
The ontology from Step 2 does the heavy lifting. The workbench reads each entity definition and writes the Python, honoring every field in it. Surrogate keys, GDPR filtering (SCD2 rows with is_active = FALSE never make it out), email lowercasing, phone normalization to E.164, and the per-entity write rules are all generated for you. It runs locally first against DuckDB, so we can see the exact rows that would land in Attio before any of them do. The transformations are SQL on top of dlt and ibis, so switching the destination from DuckDB to BigQuery is a one-line change (thanks to the potability of dlt), meaning no dialect rewrites.
Iterate
7. Bug → fix → regression test. Every time you find a bug, after you fix it, have the AI write a test for the exact case. The tests run against the mock data from Step 5, so you catch regressions locally without touching the real Attio workspace. The test suite becomes the AI’s memory of what it got wrong here — next time it walks into this code, the failing test will catch the regression before you do.
8. Controlled rollout. We never pointed anything at prod on the first try. Sandbox workspace, dry-run, then a live sync with the limits dialed way down. Only once that looked right and PRs passed did we move to production — and even then, with caps on and some reading sample rows before we lifted them. No entity type goes live without somebody having stared at a sample payload.
The guard rails that let us move this fast
These are the guard rails that ran the whole time:
- pytest as the AI’s memory of past bugs
- local DuckDB dev loop — full pipeline run in seconds, nothing remote
- Attio sandbox workspace — never touch prod until sandbox is green
- dry-run before any write
- mock data generated from the ontology
- sync limits that cap row count per run
dlt_load_id(the per-load identifier dlt attaches to every row) for clean revert of a bad load- change log for every payload we send
- dead-letter for rows the destination rejects
- self-review before submitting any PR: read your own diff like a stranger would
None of these are especially clever. Together they let us run a tighter ship at every iteration.
What dltHub Pro + AI didn’t solve
Code review. With AI writing faster than humans read, review became the rate-limiting step in the whole project. Stacked PRs kept under ten files each made any single diff tractable, but somebody still has to sit and read every one — and that somebody needs to be a human that goes at human speed. No amount of workbench polish moves that needle.
There's a catch. "You read the diff, you ship it" only works when the diffs stay small. The rules we used to keep each PR cheap to review — a rough file-count cap, one entity per PR, stacking them in order — are the same rules that kept the total reading load manageable. Drop them and the speed AI gives you stops mattering.
The eight steps, end to end:
Setup
- Stakeholder meeting, recorded. Scope, architecture and entity decisions. The transcript becomes input to the next step.
- Build the ontology. Feed the transcript, the CDM, and the destination schema to the workbench. Prune to minimum viable context — every line must earn its place.
- Plan the PR stack. One entity per PR, small enough to keep the context in the reviewer’s memory, ordered so every PR sets up the next. Decided before the first commit, not at review time.
Build
- Extract missing data with the REST API toolkit. One prompt: auth, pagination, schema, incremental loading, gets all scaffolded and implemented(it was usually 1 shot).
- Generate the canonical-to-destination mapping with the transformation toolkit. Runs locally against DuckDB so you see the exact rows before they hit the destination.
- Generate mock data from the ontology. It covers the happy path, the broken cases, and keeps references between entities intact — so you get a working set of test fixtures without writing any
Iterate
- Bug → fix → regression test. Every bug becomes a pytest case. The suite becomes another form of AI memory.
- Controlled rollout. Sandbox → prod with sync limits → spot-check → lift the cap. No entity ships without a human reading a sample payload.
Try it
If you want to see what an AI-native data engineering workflow actually looks like:
- Agentic Data Engineering — our free course teaching the workflow this post documents, end-to-end. If you read one link, read this one.
- dltHub Pro — the AI workbench, REST API toolkit, and transformation toolkit used in this project
- Slack community
The AI didn’t migrate our CRM. The workflow around it did. The workflow is what we built — and what you can build too.