Skip to main content

Deploy with Dagster

Introduction to Dagster

Dagster is an orchestrator designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports. Dagster ensures these processes are reliable and focuses on using software-defined assets (SDAs) to simplify complex data management, enhance the ability to reuse code, and provide a better understanding of data.

To read more, please refer to Dagster’s documentation.

Dagster Cloud Features

Dagster Cloud offers enterprise-level orchestration service with serverless or hybrid deployment options. It incorporates native branching and built-in CI/CD to prioritize the developer experience. It enables scalable, cost-effective operations without the hassle of infrastructure management.

Dagster deployment options: Serverless versus Hybrid:

The serverless option fully hosts the orchestration engine, while the hybrid model offers flexibility to use your computing resources, with Dagster managing the control plane. Reducing operational overhead and ensuring security.

For more info, please refer to the Dagster Cloud docs.

Using Dagster for Free

Dagster offers a 30-day free trial during which you can explore its features, such as pipeline orchestration, data quality checks, and embedded ELTs. You can try Dagster using its open source or by signing up for the trial.

Building Data Pipelines with dlt

dlt is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets through automatic schema inference and evolution. It simplifies building data pipelines with support for extract and load processes.

How does dlt integrate with Dagster for pipeline orchestration?

dlt integrates with Dagster for pipeline orchestration, providing a streamlined process for building, enhancing, and managing data pipelines. This enables developers to leverage dlt's capabilities for handling data extraction and load and Dagster's orchestration features to efficiently manage and monitor data pipelines.

Orchestrating dlt pipeline on Dagster

Here's a concise guide to orchestrating a dlt pipeline with Dagster, using the project "Ingesting GitHub issues data from a repository and storing it in BigQuery" as an example.

More details can be found in the article “Orchestrating unstructured data pipelines with dagster and dlt."

The steps are as follows:

  1. Create a dlt pipeline. For more, please refer to the documentation: Creating a pipeline.

  2. Set up a Dagster project, configure resources, and define the asset as follows:

    1. To create a Dagster project:

      mkdir dagster_github_issues  
      cd dagster_github_issues
      dagster project scaffold --name github-issues
    2. Define dlt as a Dagster resource:

      from dagster import ConfigurableResource  
      from dagster import ConfigurableResource
      import dlt

      class DltPipeline(ConfigurableResource):
      pipeline_name: str
      dataset_name: str
      destination: str

      def create_pipeline(self, resource_data, table_name):

      # configure the pipeline with your destination details
      pipeline = dlt.pipeline(
      pipeline_name=self.pipeline_name,
      destination=self.destination,
      dataset_name=self.dataset_name
      )

      # run the pipeline with your parameters
      load_info = pipeline.run(resource_data, table_name=table_name)

      return load_info
    3. Define the asset as:

      @asset  
      def issues_pipeline(pipeline: DltPipeline):

      logger = get_dagster_logger()
      results = pipeline.create_pipeline(github_issues_resource, table_name='github_issues')
      logger.info(results)

      For more information, please refer to Dagster’s documentation.

  3. Next, define Dagster definitions as follows:

    all_assets = load_assets_from_modules([assets])  
    simple_pipeline = define_asset_job(name="simple_pipeline", selection= ['issues_pipeline'])

    defs = Definitions(
    assets=all_assets,
    jobs=[simple_pipeline],
    resources={
    "pipeline": DltPipeline(
    pipeline_name = "github_issues",
    dataset_name = "dagster_github_issues",
    destination = "bigquery",
    ),
    }
    )
  4. Finally, start the web server as:

    dagster dev
info

For the complete hands-on project on “Orchestrating unstructured data pipelines with dagster and dlt", please refer to article. The author offers a detailed overview and steps for ingesting GitHub issue data from a repository and storing it in BigQuery. You can use a similar approach to build your pipelines.

Additional Resources

  • A general configurable dlt resource orchestrated on Dagster: dlt resource.

  • Configure dlt pipelines for Dagster: dlt pipelines.

  • Configure MongoDB source as an Asset factory:

    Dagster provides the feature of @multi_asset declaration that will allow us to convert each collection under a database into a separate asset. This will make our pipeline easy to debug in case of failure and the collections independent of each other.

note

These are external repositories and are subject to change.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.