Python Guide: Load Data from google sheets to aws s3 using dlt
Join our Slack community or book a call with our support engineer Violetta.
This page provides technical documentation on how to utilize dlt, an open-source Python library, to load data from Google Sheets to AWS S3. Google Sheets allows the creation and editing of online spreadsheets, offering real-time and secure sharing from any device. More details about Google Sheets can be found here. On the other hand, AWS S3, as a filesystem destination, stores data enabling the easy creation of datalakes. It supports data upload in JSONL, Parquet, or CSV formats. By leveraging dlt, data can be seamlessly transferred from Google Sheets to AWS S3, facilitating efficient data management.
dlt Key Features
-
Google Sheets Integration:
dltprovides a verified source for Google Sheets, allowing users to load data using the Google Sheets API to the destination of their choice. You can find more details here. -
Google Storage and Azure Blob Storage: The library supports various storage services including Google Storage and Azure Blob Storage. It allows users to easily set up their storage credentials and bucket information. More on this can be found here.
-
Local File System:
dltalso offers the ability to store files in the local folder by setting up thebucket_urlaccordingly. This feature is especially useful when there are no secrets required. Read more here. -
Easy Initialization: Users can easily initialize a new
dltproject with a simple command. This initializes the pipeline with the chosen source and destination. More details can be found here. -
Support for Multiple Bucket Types:
dltcan access various bucket types including AWS S3, Google Cloud Storage, Azure Blob Storage, and Local Storage. To access these, users need secret credentials which can be easily set up. More information is available here.
Getting started with your pipeline locally
0. Prerequisites
dlt requires Python 3.8 or higher. Additionally, you need to have the pip package manager installed, and we recommend using a virtual environment to manage your dependencies. You can learn more about preparing your computer for dlt in our installation reference.
1. Install dlt
First you need to install the dlt library with the correct extras for AWS S3:
pip install "dlt[filesystem]"
The dlt cli has a useful command to get you started with any combination of source and destination. For this example, we want to load data from Google Sheets to AWS S3. You can run the following commands to create a starting point for loading data from Google Sheets to AWS S3:
# create a new directory
mkdir google_sheets_pipeline
cd google_sheets_pipeline
# initialize a new pipeline with your source and destination
dlt init google_sheets filesystem
# install the required dependencies
pip install -r requirements.txt
The last command will install the required dependencies for your pipeline. The dependencies are listed in the requirements.txt:
google-api-python-client
dlt[filesystem]>=0.3.25
You now have the following folder structure in your project:
google_sheets_pipeline/
├── .dlt/
│ ├── config.toml # configs for your pipeline
│ └── secrets.toml # secrets for your pipeline
├── google_sheets/ # folder with source specific files
│ └── ...
├── google_sheets_pipeline.py # your main pipeline script
├── requirements.txt # dependencies for your pipeline
└── .gitignore # ignore files for git (not required)
2. Configuring your source and destination credentials
The dlt cli will have created a .dlt directory in your project folder. This directory contains a config.toml file and a secrets.toml file that you can use to configure your pipeline. The automatically created version of these files look like this:
generated config.toml
# put your configuration values here
[runtime]
log_level="WARNING" # the system log level of dlt
# use the dlthub_telemetry setting to enable/disable anonymous usage data reporting, see https://dlthub.com/docs/telemetry
dlthub_telemetry = true
[sources.google_sheets]
spreadsheet_url_or_id = "spreadsheet_url_or_id" # please set me up!
range_names =
["a", "b", "c"] # please set me up!
generated secrets.toml
# put your secret values and credentials here. do not share this file and do not push it to github
[sources.google_sheets.credentials]
client_id = "client_id" # please set me up!
client_secret = "client_secret" # please set me up!
refresh_token = "refresh_token" # please set me up!
project_id = "project_id" # please set me up!
[destination.filesystem]
dataset_name = "dataset_name" # please set me up!
bucket_url = "bucket_url" # please set me up!
[destination.filesystem.credentials]
aws_access_key_id = "aws_access_key_id" # please set me up!
aws_secret_access_key = "aws_secret_access_key" # please set me up!
2.1. Adjust the generated code to your usecase
By default, the filesystem destination will store your files as JSONL. You can tell your pipeline to choose a different format with the loader_file_format property that you can set directly on the pipeline or via your config.toml. Available values are jsonl, parquet and csv:
[pipeline] # in ./dlt/config.toml
loader_file_format="parquet"
3. Running your pipeline for the first time
The dlt cli has also created a main pipeline script for you at google_sheets_pipeline.py, as well as a folder google_sheets that contains additional python files for your source. These files are your local copies which you can modify to fit your needs. In some cases you may find that you only need to do small changes to your pipelines or add some configurations, in other cases these files can serve as a working starting point for your code, but will need to be adjusted to do what you need them to do.
The main pipeline script will look something like this:
from typing import Sequence
import dlt
from google_sheets import google_spreadsheet
def load_pipeline_with_ranges(
spreadsheet_url_or_id: str, range_names: Sequence[str]
) -> None:
"""
Loads explicitly passed ranges
"""
pipeline = dlt.pipeline(
pipeline_name="google_sheets_pipeline",
destination='filesystem',
full_refresh=True,
dataset_name="test",
)
data = google_spreadsheet(
spreadsheet_url_or_id=spreadsheet_url_or_id,
range_names=range_names,
get_sheets=False,
get_named_ranges=False,
)
info = pipeline.run(data)
print(info)
def load_pipeline_with_sheets(spreadsheet_url_or_id: str) -> None:
"""
Will load all the sheets in the spreadsheet, but it will not load any of the named ranges in the spreadsheet.
"""
pipeline = dlt.pipeline(
pipeline_name="google_sheets_pipeline",
destination='filesystem',
full_refresh=True,
dataset_name="sample_google_sheet_data",
)
data = google_spreadsheet(
spreadsheet_url_or_id=spreadsheet_url_or_id,
get_sheets=True,
get_named_ranges=False,
)
info = pipeline.run(data)
print(info)
def load_pipeline_with_named_ranges(spreadsheet_url_or_id: str) -> None:
"""
Will not load the sheets in the spreadsheet, but it will load all the named ranges in the spreadsheet.
"""
pipeline = dlt.pipeline(
pipeline_name="google_sheets_pipeline",
destination='filesystem',
full_refresh=True,
dataset_name="sample_google_sheet_data",
)
data = google_spreadsheet(
spreadsheet_url_or_id=spreadsheet_url_or_id,
get_sheets=False,
get_named_ranges=True,
)
info = pipeline.run(data)
print(info)
def load_pipeline_with_sheets_and_ranges(spreadsheet_url_or_id: str) -> None:
"""
Will load all the sheets in the spreadsheet and all the named ranges in the spreadsheet.
"""
pipeline = dlt.pipeline(
pipeline_name="google_sheets_pipeline",
destination='filesystem',
full_refresh=True,
dataset_name="sample_google_sheet_data",
)
data = google_spreadsheet(
spreadsheet_url_or_id=spreadsheet_url_or_id,
get_sheets=True,
get_named_ranges=True,
)
info = pipeline.run(data)
print(info)
def load_with_table_rename_and_multiple_spreadsheets(
spreadsheet_url_or_id: str, range_names: Sequence[str]
) -> None:
"""Demonstrates how to load two spreadsheets in one pipeline and how to rename tables"""
pipeline = dlt.pipeline(
pipeline_name="google_sheets_pipeline",
destination='filesystem',
full_refresh=True,
dataset_name="sample_google_sheet_data",
)
# take data from spreadsheet 1
data = google_spreadsheet(
spreadsheet_url_or_id=spreadsheet_url_or_id,
range_names=[range_names[0]],
get_named_ranges=False,
)
# take data from spreadsheet 2
data_2 = google_spreadsheet(
spreadsheet_url_or_id=spreadsheet_url_or_id,
range_names=[range_names[1]],
get_named_ranges=False,
)
# apply the table name to the existing resource: the resource name is the name of the range
data.resources[range_names[0]].apply_hints(table_name="first_sheet_data")
data_2.resources[range_names[1]].apply_hints(table_name="second_sheet_data")
# load two spreadsheets
info = pipeline.run([data, data_2])
print(info)
# yes the tables are there
user_tables = pipeline.default_schema.data_tables()
# check if table is there
assert {t["name"] for t in user_tables} == {
"first_sheet_data",
"second_sheet_data",
"spreadsheet_info",
}
if __name__ == "__main__":
url_or_id = "1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580"
range_names = ["hidden_columns_merged_cells", "Blank Columns"]
load_pipeline_with_ranges(url_or_id, range_names)
load_pipeline_with_sheets(url_or_id)
load_pipeline_with_named_ranges(url_or_id)
load_pipeline_with_sheets_and_ranges(url_or_id)
load_with_table_rename_and_multiple_spreadsheets(url_or_id, range_names)
Provided you have set up your credentials, you can run your pipeline like a regular python script with the following command:
python google_sheets_pipeline.py
4. Inspecting your load result
You can now inspect the state of your pipeline with the dlt cli:
dlt pipeline google_sheets_pipeline info
You can also use streamlit to inspect the contents of your AWS S3 destination for this:
# install streamlit
pip install streamlit
# run the streamlit app for your pipeline with the dlt cli:
dlt pipeline google_sheets_pipeline show
5. Next steps to get your pipeline running in production
One of the beauties of dlt is, that we are just a plain Python library, so you can run your pipeline in any environment that supports Python >= 3.8. We have a couple of helpers and guides in our docs to get you there:
The Deploy section will show you how to deploy your pipeline to
- Deploy with Github Actions: Learn how to deploy your
dltpipeline using Github Actions. - Deploy with Airflow: Follow the guide to deploy your
dltpipeline with Airflow and Google Composer. - Deploy with Google Cloud Functions: Discover how to deploy your
dltpipeline with Google Cloud Functions. - Explore other deployment options: Check out additional resources and guides on how to deploy your
dltpipeline here.
The running in production section will teach you about:
- How to Monitor your pipeline: Learn how to effectively monitor your
dltpipeline in production to ensure smooth and efficient operations. Read more - Set up alerts: Set up alerts to get notified of any issues or anomalies in your
dltpipeline, ensuring you can take immediate action. Read more - Set up tracing: Implement tracing to gain detailed insights into the execution of your
dltpipeline, making it easier to debug and optimize. Read more
Additional pipeline guides
- Load data from Shopify to Google Cloud Storage in python with dlt
- Load data from IFTTT to YugabyteDB in python with dlt
- Load data from Shopify to Google Cloud Storage in python with dlt
- Load data from IFTTT to Snowflake in python with dlt
- Load data from HubSpot to YugabyteDB in python with dlt
- Load data from ClickHouse Cloud to Timescale in python with dlt
- Load data from Google Sheets to Azure Cosmos DB in python with dlt
- Load data from Bitbucket to Azure Cosmos DB in python with dlt
- Load data from Microsoft SQL Server to Redshift in python with dlt
- Load data from Qualtrics to Redshift in python with dlt