Artsy moves data faster with dlt

Highlights

Optimized data extraction
Artsy replaced a slow, custom-built data pipeline with dlt, significantly reducing data load times.
Standardized data sources
Adding a new data source now requires just a few lines of code, a vast improvement over the previous process.
Customizable solution
dlt’s customization capabilities allowed Artsy to build an interface for their teams to add data sources more quickly.

Data Stack

Data sources: 15+ different sources including Postgres, MongoDB, and REST APIs
Destinations: S3, Redshift
Transformation: dbt
Orchestration: Airflow

Challenge: A Legacy Data Stack

Artsy, a New York-based online marketplace for buying and selling art, needed to modernize its data infrastructure. The core issue was a custom-built data pipeline written in custom Ruby that had evolved over a 10 year period. The legacy system required a daily full re-extraction and reload of data from 15 different sources, including processing via S3 to Redshift, leading to very long processing times.

The inefficiency of this process pushed Artsy to seek a more modern and effective solution, but also one that was faster, cost-effective, customizable, and would be an easy-to-maintain system for stakeholders both now and into the future.

The need for a more efficient system became critical. As a Senior Data Engineer at Artsy explained, the team was spending too much time on both data extraction, and maintenance, especially with one source having a complex API and therefore frequently breaking.

The thing that really spurred us to action, after doing an audit of our sources, was that the daily extraction and load times were starting to exceed 2.5 hours.

- Senior Data Engineer, Artsy

With a firm grip on the changes required, Artsy's data team was tasked with modernising the data stack.

Solution: Embracing dlt to Deliver Wins

Artsy discovered dlt on the "dataengineering" subreddit on Reddit while researching other data integration tools. They looked at other popular SaaS ingestion tools, but none met their needs - some were too expensive, others couldn't be customized enough, many were difficult to troubleshoot black boxes, and some were just too complicated to set up. In the end, dlt was the best option for cost effectiveness and customization.

Key factors in Artsy’s choice of dlt included:

Incremental loading: dlt made it easy to set up incremental extracts from various sources, something seen as a deal breaker if not available.
Parallelization: dlt’s parallelisation features significantly sped up data extraction
Customization: Because dlt is code, integrating with Artsy’s existing custom Spark jobs was easy

A key requirement for us to reduce our load times was incremental loading, and dlt could handle that and did it well. I was also really impressed how easy it was to parallelize parts of it and that gave us some big improvements right out of the gate.

- Senior Data Engineer, Artsy

Implementation

Artsy uses dlt primarily for data extraction. Data is loaded into S3 and then transformed using a custom Spark job before loading into Redshift, to handle schema changes as the fields in their source databases were not well-defined. dlt's customisable nature made it easy to integrate with Artsy's existing Spark jobs.

The Artsy team ran deep proof-of-concept tests, finding that dlt worked well with their main source types including Postgres databases, MongoDB and APIs.

Our proof-of-concept was a rigorous process overall. We wanted to truly test whether dlt could deliver for us. On paper, the user reports looked promising, and we did see lots of people using it. But we kept asking, ‘Does it actually work in practice?’. I decided to put it to the test, the team backed it, and in the end, it did exactly what we needed.

- Senior Data Engineer, Artsy

Results

dlt has enabled Artsy to streamline its data pipelines with substantial improvements:

Reduced loading times: One daily pipeline reduced from 75 minutes to 1.5 minutes (benchmarked over multiple runs)
Standardised process: Adding a new extract to an existing source now requires a pull request and just a handful of lines added to a YAML config file, even for people outside the core team
Simplified Maintenance: dlt's request helper package resolved tricky issues with malformed data from one API
Faster Recovery: Artsy significantly reduced workflow recovery time, minimizing delays when processes fail. Previously, retries could take over an hour. With dlt, recovery now happens in minutes.

In regards to key data extraction from more complex API sources, moving that over to dlt was surprisingly easy. I was really impressed!

- Senior Data Engineer, Artsy

Future: More migration

Adopting dlt has helped Artsy successfully address its challenges with slow, inflexible data pipelines. Artsy is working towards reducing their total daily data extraction and load time from 2.5 hours to under 30 minutes, a goal it hopes to comfortably beat. The team is currently using dlt to extract from 3 out of 5 of their highest priority data sources and is looking to migrate the final 2 as well. They are also looking into how dlt could help them in other ways, such as helping populate their staging environment more reliably.

This concrete progress paves the way for plans to complete the migration of their remaining 2, most data-intensive, data sources to dlt.

*Based on internal cost-tracking data over 3 months

**Some pipelines improved by up to 98%. This was benchmarked across multiple runs.

***New sources are defined in a YAML config

About Artsy

Artsy envisions a future where everyone is moved by art every day. To get there, they’re expanding the art market to support more artists and art around the world. As the leading marketplace to discover and buy fine art, Artsy believes that the process of buying art should be as brilliant as art itself. That’s why they’re dedicated to making a joyful, welcoming experience that connects collectors with the artists and artworks they love.

How Flatiron Health used dlt to accelerate privacy-enhancing data processing

Remerge's journey from manual processes to streamlined dlt pipelines

Artsy moves data faster with dlt

HighlightsLink icon

Data StackLink icon

Challenge: A Legacy Data StackLink icon

Solution: Embracing dlt to Deliver WinsLink icon

ImplementationLink icon

ResultsLink icon

Future: More migrationLink icon