Skip to main content

Removing columns

Removing columns before loading data into a database is a reliable method to eliminate sensitive or unnecessary fields. For example, in the given scenario, a source is created with a "country_id" column, which is then excluded from the database before loading.

Let's create a sample pipeline demonstrating the process of removing a column.

  1. Create a source function that creates dummy data as follows:

    import dlt

    # This function creates a dummy data source.
    @dlt.source
    def dummy_source():
    @dlt.resource(write_disposition="replace")
    def dummy_data():
    for i in range(3):
    yield {"id": i, "name": f"Jane Washington {i}", "country_code": 40 + i}

    return dummy_data()

    This function creates three columns id, name and country_code.

  2. Next, create a function to filter out columns from the data before loading it into a database as follows:

    from typing import Dict, List, Optional

    def remove_columns(doc: Dict, remove_columns: Optional[List[str]] = None) -> Dict:
    if remove_columns is None:
    remove_columns = []

    # Iterating over the list of columns to be removed
    for column_name in remove_columns:
    # Removing the column if it exists in the document
    if column_name in doc:
    del doc[column_name]

    return doc

    doc: The document (dict) from which columns will be removed.

    remove_columns: List of column names to be removed, defaults to None.

  3. Next, declare the columns to be removed from the table, and then modify the source as follows:

    # Example columns to remove:
    remove_columns_list = ["country_code"]

    # Create an instance of the source so you can edit it.
    data_source = dummy_source()

    # Modify this source instance's resource
    data_source = data_source.dummy_data.add_map(
    lambda doc: remove_columns(doc, remove_columns_list)
    )
  4. You can optionally inspect the result:

    for row in data_source:
    print(row)
    #{'id': 0, 'name': 'Jane Washington 0'}
    #{'id': 1, 'name': 'Jane Washington 1'}
    #{'id': 2, 'name': 'Jane Washington 2'}
  5. At last, create a pipeline:

    # Integrating with a dlt pipeline
    pipeline = dlt.pipeline(
    pipeline_name='example',
    destination='bigquery',
    dataset_name='filtered_data'
    )
    # Run the pipeline with the transformed source
    load_info = pipeline.run(data_source)
    print(load_info)

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.