How Flatiron Health used dlt for privacy-enhancing data processing

Highlights

Improving time-to-data
Flatiron Health increased the speed of introducing load & normalization pipelines for new data sources from weeks to just days, thanks to dlt’s automatic schema evolution and Python-based approach. This agility freed the team to focus on refining healthcare data insights rather than wrestling with rigid schemas.
Eliminating costly dependencies
By shifting data normalization away from Snowflake, Flatiron realized a 50% reduction in overall pipeline costs overnight. dlt’s multi-CPU capability matched the desired performance without the high spend.
Providing a clear path from prototype to production
With dlt’s DuckDB integration, Flatiron was able to prototype transformations locally before scaling to Snowflake, dramatically cutting iteration time. This smooth migration from simple local tests to robust cloud solutions helped maintain the team’s rapid delivery pace.

Data Stack

Data sources: Electronic Health Record (EHR) systems used in hospitals and clinics in a variety of formats, including JSON, XML, CSV, and even German-specific formats.
Destinations: Snowflake
Orchestration: AWS StepFunctions & AWS ECS
Transformation: dbt

Challenge: More data, more complexity, more problems with custom tools

Flatiron Health is a company with a mission to improve cancer care through data-driven insights. It requires the processing of complex healthcare data with efficiency and the highest standards of security. As part of its international expansion, Flatiron encountered challenges in integrating diverse complex data formats from Electronic Health Records (EHRs) to Snowflake.

Flatiron initially developed a custom ingestion tool that generated dbt models from source data schemas and used Snowflake for data normalization.

However, as the number and complexity of data sources increased, the home-grown solution presented several challenges:

Complexity and Maintainability: With a custom solution at every turn, the situation evolved into a complex tool with limited documentation, making it difficult to maintain and extend, especially with a small team.
High Cost: The tooling relied heavily on Snowflake for computation, resulting in high processing costs, especially for normalization.
Limited Agility: The dependence on pre-defined source schemas hindered agile development and slowed down the onboarding of new data sources.

Solution: Go open-source, solve problems today and in the future

Flatiron discovered dlt, an open-source Python library for data movement, at the EuroPython conference in Prague. Recognizing dlt’s potential to overcome the limitations of custom tools, Flatiron conducted a hackathon and developed a prototype that successfully demonstrated dlt’s capabilities.

“I walked past the stand at EuroPython and I don’t exactly remember which piece of information caught my attention, but I saw that instantly, this was my problem, and dlt could be my solution to really help me to replace all the custom ingestions. I knew that I was solving today’s data problems right now, but solving tomorrow’s problems will get harder and harder and harder.”

- Florian Stefan, Staff Engineer, Flatiron Health

According to Flatiron, why dlt could offer a compelling solution came down to the core functionality of dlt like schema extraction and automatic normalization:

Library Approach: dlt seamlessly integrated into Flatiron's existing Python-based data stack, leveraging their team's expertise and allowing them to customize the pipeline as needed.
Schema Evolution: dlt automatically handled schema changes in source data, enabling Flatiron to ingest data without needing pre-defined schemas, thus fostering a more agile development process.
Flexibility and Extensibility: dlt's support for various data sources and destinations, including Snowflake and DuckDB, provided Flatiron with the flexibility to adapt the pipeline to evolving requirements.
Compliance: A significant requirement for a tool like dlt is compliance. Flatiron Health was able to choose dlt because it allows them to keep their data within their controlled environment.

Implementation: Speedy success

Flatiron replaced its custom ingestion tool with dlt, leveraging it to ingest data from S3 buckets into Snowflake. The data, often in various formats like JSON, is first aggregated and uploaded to S3 in a secure and structured manner, providing a centralized storage layer. dlt's ability to efficiently handle hierarchical data formats like JSON significantly simplified the normalization process, which had been a bottleneck with the previous custom tool.

Flatiron also utilized dlt’s ability to target DuckDB for prototyping transformations in a restricted environment before running them in Snowflake. This enabled parallel development and faster iteration cycles. A strength of dlt is enabling local development and prototyping with DuckDB using local transformations to be quickly scaled to a cloud environment in seconds.

"We achieved significant improvements with dlt very quickly. It was fast to implement and see results. And there's a clear migration path from a very simple solution to more sophisticated solutions when you actually need that."

- Florian Stefan, Staff Engineer, Flatiron Health

Results: 50% cost reduction is just the start

Implementing dlt yielded significant benefits for Flatiron Health:

Cost Reduction: By moving normalization out of Snowflake, dlt helped Flatiron reduce pipeline costs by 50% overnight.
Improved Performance: Despite moving computation out of Snowflake, overall pipeline runtime remained comparable, with dlt efficiently utilizing multi-CPU architecture for data processing.
Enhanced Agility: dlt's schema evolution capabilities allowed Flatiron to work more iteratively, onboard new data sources faster, and prototype transformations more efficiently.
Compliance: As a library running within Flatiron’s infrastructure, dlt addressed data compliance concerns by eliminating the need to share sensitive data with external vendors.

“dlt’s iterative approach, ability to connect with various technologies, and simple migration path from simple to sophisticated solutions make it a perfect fit for our agile development process. It's a valuable component in building an open-source, compliant data platform."

- Florian Stefan, Staff Engineer, Flatiron Health

By incorporating dlt into its data workflow, Flatiron Health has enhanced its ability to efficiently ingest and transform complex healthcare data. dlt's flexibility, cost-effectiveness, and compliance advantages make it an ideal data integration solution for organizations working with sensitive data in dynamic environments.

“dlt is a great option when you need to work iteratively, starting in a simple way without getting stuck with a simple solution. The migration path is available by connecting with more heavyweight technology and it becomes very sophisticated. It’s compliant, and it’s performant.”

- Florian Stefan, Staff Engineer, Flatiron Health

About the customer

Flatiron Health

FLATIRON HEALTH® is a healthtech company dedicated to improving cancer treatment and advancing research. As the pioneer in real-world evidence for oncology, they provide technology and services to support patient care and make every person’s story count. They partner with hundreds of cancer centers, 20+ top global developers of oncology therapeutics, and researchers and regulators around the world.

Highlights

Improving time-to-data
Flatiron Health increased the speed of introducing load & normalization pipelines for new data sources from weeks to just days, thanks to dlt’s automatic schema evolution and Python-based approach. This agility freed the team to focus on refining healthcare data insights rather than wrestling with rigid schemas.
Eliminating costly dependencies
By shifting data normalization away from Snowflake, Flatiron realized a 50% reduction in overall pipeline costs overnight. dlt’s multi-CPU capability matched the desired performance without the high spend.
Providing a clear path from prototype to production
With dlt’s DuckDB integration, Flatiron was able to prototype transformations locally before scaling to Snowflake, dramatically cutting iteration time. This smooth migration from simple local tests to robust cloud solutions helped maintain the team’s rapid delivery pace.

Data Stack

Challenge: More data, more complexity, more problems with custom tools

Flatiron initially developed a custom ingestion tool that generated dbt models from source data schemas and used Snowflake for data normalization.

However, as the number and complexity of data sources increased, the home-grown solution presented several challenges:

Complexity and Maintainability: With a custom solution at every turn, the situation evolved into a complex tool with limited documentation, making it difficult to maintain and extend, especially with a small team.
High Cost: The tooling relied heavily on Snowflake for computation, resulting in high processing costs, especially for normalization.
Limited Agility: The dependence on pre-defined source schemas hindered agile development and slowed down the onboarding of new data sources.

Solution: Go open-source, solve problems today and in the future

“I walked past the stand at EuroPython and I don’t exactly remember which piece of information caught my attention, but I saw that instantly, this was my problem, and dlt could be my solution to really help me to replace all the custom ingestions. I knew that I was solving today’s data problems right now, but solving tomorrow’s problems will get harder and harder and harder.”

- Florian Stefan, Staff Engineer, Flatiron Health

According to Flatiron, why dlt could offer a compelling solution came down to the core functionality of dlt like schema extraction and automatic normalization:

Library Approach: dlt seamlessly integrated into Flatiron's existing Python-based data stack, leveraging their team's expertise and allowing them to customize the pipeline as needed.
Schema Evolution: dlt automatically handled schema changes in source data, enabling Flatiron to ingest data without needing pre-defined schemas, thus fostering a more agile development process.
Flexibility and Extensibility: dlt's support for various data sources and destinations, including Snowflake and DuckDB, provided Flatiron with the flexibility to adapt the pipeline to evolving requirements.
Compliance: A significant requirement for a tool like dlt is compliance. Flatiron Health was able to choose dlt because it allows them to keep their data within their controlled environment.

Implementation: Speedy success

"We achieved significant improvements with dlt very quickly. It was fast to implement and see results. And there's a clear migration path from a very simple solution to more sophisticated solutions when you actually need that."

- Florian Stefan, Staff Engineer, Flatiron Health

Results: 50% cost reduction is just the start

Implementing dlt yielded significant benefits for Flatiron Health:

Cost Reduction: By moving normalization out of Snowflake, dlt helped Flatiron reduce pipeline costs by 50% overnight.
Improved Performance: Despite moving computation out of Snowflake, overall pipeline runtime remained comparable, with dlt efficiently utilizing multi-CPU architecture for data processing.
Enhanced Agility: dlt's schema evolution capabilities allowed Flatiron to work more iteratively, onboard new data sources faster, and prototype transformations more efficiently.
Compliance: As a library running within Flatiron’s infrastructure, dlt addressed data compliance concerns by eliminating the need to share sensitive data with external vendors.

“dlt’s iterative approach, ability to connect with various technologies, and simple migration path from simple to sophisticated solutions make it a perfect fit for our agile development process. It's a valuable component in building an open-source, compliant data platform."

- Florian Stefan, Staff Engineer, Flatiron Health

“dlt is a great option when you need to work iteratively, starting in a simple way without getting stuck with a simple solution. The migration path is available by connecting with more heavyweight technology and it becomes very sophisticated. It’s compliant, and it’s performant.”

- Florian Stefan, Staff Engineer, Flatiron Health

About the customer

Flatiron Health accelerates privacy-enhancing data processing

Highlights

Data Stack

Challenge: More data, more complexity, more problems with custom tools

Solution: Go open-source, solve problems today and in the future

Implementation: Speedy success

Results: 50% cost reduction is just the start

Flatiron Health

Want to learn more?

Flatiron Health accelerates privacy-enhancing data processing

Highlights

Data Stack

Challenge: More data, more complexity, more problems with custom tools

Solution: Go open-source, solve problems today and in the future

Implementation: Speedy success

Results: 50% cost reduction is just the start

Flatiron Health

Want to learn more?

HighlightsLink icon

Data StackLink icon

Challenge: More data, more complexity, more problems with custom toolsLink icon

Solution: Go open-source, solve problems today and in the futureLink icon

Implementation: Speedy successLink icon

Results: 50% cost reduction is just the startLink icon

Want to learn more?

HighlightsLink icon

Data StackLink icon

Challenge: More data, more complexity, more problems with custom toolsLink icon

Solution: Go open-source, solve problems today and in the futureLink icon

Implementation: Speedy successLink icon

Results: 50% cost reduction is just the startLink icon

Want to learn more?

Highlights

Data Stack

Challenge: More data, more complexity, more problems with custom tools

Solution: Go open-source, solve problems today and in the future

Implementation: Speedy success

Results: 50% cost reduction is just the start

Highlights

Data Stack

Challenge: More data, more complexity, more problems with custom tools

Solution: Go open-source, solve problems today and in the future

Implementation: Speedy success

Results: 50% cost reduction is just the start