Blog /February 21, 2024

Single pane of glass for pipelines running on various orchestrators

Adrian Brudaru,
Co-Founder & CDO

The challenge in the discussion

In large organisations, there are often many data teams that serve different departments. These data teams usually cannot agree on where to run their infrastructure, and everyone ends up doing something else. For example:

40 generated GCP projects with various services used on each
Native AWS services under no particular orchestrator
That on-prem machine that’s the only gateway to some strange corporate data
and of course that SaaS orchestrator from the marketing team
together with the event tracking lambdas from the product
don’t forget the notebooks someone scheduled

So, what’s going on? Where is the data flowing? what data is it?

The case at hand

At dltHub, we are data people and use data in our daily work.

One of our sources is our community Slack, which we use in 2 ways:

We are on free tier Slack, where messages expire quickly. We refer to them in our GitHub issues and plan to use the technical answers to train our GPT helper. For these purposes, we archive the conversations daily. We run this pipeline on GitHub actions (docs), a serverless runner with no short time limit like cloud functions.
We measure the growth rate of the dlt community - for this, it helps to understand when people join Slack. Because we are on the free tier, we cannot request this information from the API but can capture the event via a webhook. This runs serverless on cloud functions, set up as in this documentation.

So already we have 2 different serverless run environments, each with its own “run reporting”.

Not fun to manage. So how do we achieve a single pane of glass?

Alerts are better than monitoring

Since “checking” things can be tedious, we rather forget about it and be notified. For this, we can use Slack to send messages. Docs here.

Here’s a gist of how to use it

from dlt.common.runtime.slack import send_slack_message

def run_pipeline_and_notify(pipeline, data):
    try:
        load_info = pipeline.run(data)
    except Exception as e:
        send_slack_message(
            pipeline.runtime_config.slack_incoming_hook,
						f"Pipeline {pipeline.pipeline_name} failed! \n Error: {str(e)}")
        raise

Monitoring load metrics is cheaper than scanning entire data sets

As for monitoring, we could always run some queries to count the number of loaded rows ad hoc - but this would scan a lot of data and cost significantly on larger volumes.

A better way would be to leverage runtime metrics collected by the pipeline such as row counts. You can find docs on how to do that here.

If we care, governance is doable too

Now, not everything needs to be governed. But for the slack pipelines we want to tag which columns have personally identifiable information, so we can delete that information and stay compliant.

One simple way to stay compliant is to annotate your raw data schema and use views for the transformed data, so if you delete the data at the source, it’s gone everywhere.

If you are materialising your transformed tables, you would need to have column-level lineage in the transform layer to facilitate the documentation and deletion of the data. Here’s a write-up of how to capture that info. There are also other ways to grab a schema and annotate it, read more here.

In conclusion

There are many reasons why you’d end up running pipelines in different places, from organisational disagreements to skillset differences or simply technical restrictions.

Having a single pane of glass is not just beneficial but essential for operational coherence.

While solutions exist for different parts of this problem, the data collection still needs to be standardised and supported across different locations.

By using a tool like dlt, standardisation is introduced with ingestion, enabling cross-orchestrator observability and monitoring.

Want to discuss?

Join our slack community to take part in the conversation.

API playground: Free APIs for personal data projects

PyAirbyte - what it is and what it’s not

The challenge in the discussionLink icon

The case at handLink icon

Alerts are better than monitoringLink icon

Monitoring load metrics is cheaper than scanning entire data setsLink icon

If we care, governance is doable tooLink icon

In conclusionLink icon

Want to discuss?Link icon