Skip to main content

Full loading

Full loading is the act of fully reloading the data of your tables. All existing data will be removed and replaced by whatever the source produced on this run. Resources that are not selected while performing a full load will not replace any data in the destination.

Performing a full loadโ€‹

To perform a full load on one or more of your resources, choose the write_disposition='replace' for this resource:

p = dlt.pipeline(destination="bigquery", dataset_name="github")
issues = []
reactions = ["%2B1", "-1", "smile", "tada", "thinking_face", "heart", "rocket", "eyes"]
for reaction in reactions:
for page_no in range(1, 3):
page = requests.get(f"https://api.github.com/repos/{repo}/issues?state=all&sort=reactions-{reaction}&per_page=100&page={page_no}", headers=headers)
print(f"got page for {reaction} page {page_no}, requests left", page.headers["x-ratelimit-remaining"])
issues.extend(page.json())
p.run(issues, write_disposition="replace", primary_key="id", table_name="issues")

Choosing the correct replace strategy for your full loadโ€‹

dlt implements three different strategies for doing a full load on your table: truncate-and-insert, insert-from-staging and staging-optimized. The exact behaviour of these strategies can also vary between the available destinations.

You can select a strategy with a setting in your config.toml file. If you do not select a strategy, dlt will default to truncate-and-insert.

[destination]
# set the optimized replace strategy
replace_strategy = "staging-optimized"

The truncate-and-insert strategyโ€‹

The truncate-and-insert replace strategy is the default and the fastest of all three strategies. If you load data with this setting, then the destination tables will be truncated at the beginning of the load and the new data will be inserted consecutively but not within the same transaction. The downside of this strategy is, that your tables will have no data for a while until the load is completed. You may end up with new data in some tables and no data in other tables if the load fails during the run. Such incomplete load may be however detected by checking if the _dlt_loads table contains load id from _dlt_load_id of the replaced tables. If you prefer to have no data downtime, please use one of the other strategies.

The insert-from-staging strategyโ€‹

The insert-from-staging is the slowest of all three strategies. It will load all new data into staging tables away from your final destination tables and will then truncate and insert the new data in one transaction. It also maintains a consistent state between child and parent tables at all times. Use this strategy if you have the requirement for consistent destination datasets with zero downtime and the optimized strategy does not work for you. This strategy behaves the same way across all destinations.

The staging-optimized strategyโ€‹

The staging-optimized strategy has all the upsides of the insert-from-staging but implements certain optimizations for faster loading on some destinations. This comes at the cost of destination tables being dropped and recreated in some cases, which will mean that any views or other constraints you have placed on those tables will be dropped with the table. If you have a setup where you need to retain your destination tables, do not use the staging-optimized strategy. If you do not care about tables being dropped but need the upsides of the insert-from-staging with some performance (and cost) saving opportunities, you should use this strategy. The staging-optimized strategy behaves differently across destinations:

  • Postgres: After loading the new data into the staging tables, the destination tables will be dropped and replaced by the staging tables. No data needs to be moved, so this strategy is almost as fast as truncate-and-insert.
  • bigquery: After loading the new data into the staging tables, the destination tables will be dropped and recreated with a clone command from the staging tables. This is a low cost and fast way to create a second independent table from the data of another. Learn more about table cloning on bigquery.
  • snowflake: After loading the new data into the staging tables, the destination tables will be dropped and recreated with a clone command from the staging tables. This is a low cost and fast way to create a second independent table from the data of another. Learn more about table cloning on snowflake.

For all other destinations, please look at their respective documentation pages to see if and how the staging-optimized strategy is implemented. If it is not implemented, dlt will fall back to the insert-from-staging strategy.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.