Highlights
- Massive cost reductions: Costly cloud ingestion led to a rethink. PostHog went for dltHub and its open-source solution, dlt.
- Enhanced scalability: dlt's versatility and customization make it ideal for ingesting large datasets into PostHog's data warehouse, a previous stumbling block.
- Community power: PostHog's dlt journey was supercharged by the vibrant community on dlt. Through collaboration and problem solving, PostHog tapped into a wealth of collective knowledge and experience, to crush problems and build new features. In turn, the activity helped strengthen the overall dlt ecosystem.
Data Stack
Data sources: Stripe, HubSpot, Zendesk, Snowflake, Postgres, MySQL, Custom Sources
Destinations: S3, R2, GCS, Azure
Orchestration: Temporal
Challenge: Finding the right ingestion partner for a new data warehouse product
PostHog is an all-in-one open source platform that helps developers build successful products. As of 2024, it has over 200,000 users and is one of the top 0.01% most popular repos on GitHub. The platform includes product analytics, session replays, feedback tools, feature flags, experimentation. At the end of 2023, Tim Glaser, CTO and co-founder of PostHog, led the charge to add a data warehouse product to the platform.
Glaser put together a Data Warehouse team to work on the product, to decide on the infrastructure and implementation, and integrate it in the wider platform.
The new product’s vision was to be a core building block of the platform, offering their customers a one-click solution that syncs all user data into a single production database. This would enable teams using PostHog to analyze other data sources alongside their product data, for example by pulling data into PostHog from tools such as Zendesk, Hubspot, and Stripe.
A major challenge for the early prototypes of this tool came after it was built using a popular but expensive cloud solution for data ingestion. Core to the issue was that the warehouse would be ingesting user data, often vast datasets, with a high associated cost.
The tool we were using initially was just too much money. For the data warehouse, we’re not ingesting our data; we’re ingesting our user’s data. So we’d incur those costs ourselves and potentially have to pass those on to users. And we didn’t want that.
- Tom Owers, Senior Software Engineer, PostHog
Solution: Go open-source, and leverage the community
In order to build the data warehouse at a PostHog-level of quality and cost, the team needed an ingestion solution that could compete on cost, reliability, maturing, and scales with a company dealing with very large clients.
The eventual choice for dlt vs other tools was a mix of unique features, cost effectiveness and the ability for dlt to be easily added to existing PostHog code.
After discovering dlt, PostHog's team identified two key reasons why it appealed to them: it's a popular open-source library, and simple to understand and begin using.
It’s real open source, and it is solving a very complex problem in a very simple way: the problem of moving the data from location A into location B. It’s simple and clear and that’s quite appealing.
- Tom Owers, Senior Software Engineer, PostHog
With dltHub being a relatively young company, the PostHog team had some common concerns about whether the solution could meet all their needs and expectations. Those concerns were quickly alleviated once Tom Owers in the Data Warehouse product team started using dlt and saw how incredibly versatile and easy to use it was. He also found the active Slack community invaluable, allowing him to learn a great deal about dlt and get his questions answered in often real-time conversations.
On top of this, he was also able to chat about suggested new features and ideas, getting new features that he needed added into the library.
New features that were built as part of this process include:
- Backendless delta tables destination with data fusion
- Merge support for delta tables with schema evolution
- Ability to materialize empty tables with predefined schemas
Scalability was another concern, given the large requirements that PostHog needed to fulfill. The data warehouse must be able to ingest large amounts of user data, which means that any ingestion solution should be able to scale to these needs.
dlt, a customizable Python library, was designed for this challenge. It integrates simply with any Python environment and scales effortlessly with infrastructure. In addition, the ease of implementing things like chunking meant that it could deal with tens of thousands of API calls to bring in millions of records in a fault tolerant way.
Rolling out the warehouse
With dlt in place, the library proved to be a simple drop-in replacement to the existing cloud solution. Initial testing showed positive performance and results.
PostHog already had Temporal workers orchestrating the workflow which could be easily configured to now run dlt scripts, without additional setup. The team started out with a single Stripe API during testing, but as they rolled out the tool and began receiving user data, expanding to include more connectors proved straightforward and seamless. These leveraged dlt's verified sources, with Python scripts that were easily adapted to fit PostHog's specific requirements. Temporal and dlt are the key building blocks of the product.
Results
With dlt as the ingest layer, PostHog successfully built their data warehouse product. Being open-source, scalable, and reliable meant that PostHog could switch out their expensive cloud solution with dlt, empowering their users to easily build production data warehouses.
The product left beta in July 2024. Today, the PostHog Data Warehouse product runs over 20,000 jobs daily, ingesting data from sources like Stripe, HubSpot, Postgres, Snowflake, and Zendesk, into destinations like AWS S3, Cloudflare R2, Google Cloud Storage, and Azure.
Customers use it to analyze how sign-ups correlate to MRR, specific in-app actions to qualify leads. They use it as a tool for analyzing Google Adwords data. The vision is to make building a data warehouse accessible, even for organizations without a CTO, to do the tinkering.
Future: Less code, more maintainability
Looking ahead, PostHog’s Tom Owers envisions significant potential in dlt’s REST API source, and expects it will streamline the codebase by abstracting much of the code used for other sources. This generic style of code declaration will make maintenance easier and enhance overall reliability.
It is very powerful and has helped us simplify a lot of code. Now we can add the REST API source once and have every other source use the same code.
- Tom Owers, Senior Software Engineer, PostHog
About PostHog
PostHog is an all-in-one open-source platform with tools that help developers build successful products. This includes product analytics, session replays, feedback tools, feature flags, experimentation, and data management.
PostHog has fun with what it does (see Hedgehog mode!), breaking the conventions of how traditional SaaS companies may operate. However, what PostHog does across its nine products is very serious, despite the contrarian attitude, given that it counts heavyweight clients like Airbus and DHL on its client roster. PostHog is transparent about its products and processes and publishes information on how it uses dlt.