DuckDB
Setup Guideโ
1. Initialize a project with a pipeline that loads to DuckDB by running
dlt init chess duckdb
2. Install the necessary dependencies for DuckDB by running
pip install -r requirements.txt
3. Run the pipeline
python3 chess_pipeline.py
Write dispositionโ
All write dispositions are supported
Data loadingโ
dlt
will load data using large INSERT VALUES statements by default. Loading is multithreaded (20 threads by default). If you are ok with installing pyarrow
we suggest to switch to parquet
as file format. Loading is faster (and also multithreaded).
Names normalizationโ
dlt
uses standard snake_case naming convention to keep identical table and column identifiers across all destinations. If you want to use duckdb wide range of characters (ie. emojis) for table and column names, you can switch to duck_case naming convention which accepts almost any string as an identifier:
\n
\r
and" are translated to
_`- multiple
_
are translated to single_
Switch the naming convention using config.toml
:
[schema]
naming="duck_case"
or via env variable SCHEMA__NAMING
or directly in code:
dlt.config["schema.naming"] = "duck_case"
duckdb identifiers are case insensitive but display names preserve case. This may create name clashes if for example you load json with
{"Column": 1, "column": 2}
will map data to a single column.
Supported file formatsโ
You can configure the following file formats to load data to duckdb
- insert-values is used by default
- parquet is supportednote
duckdb
cannot COPY many parquet files to a single table from multiple threads. In this situationdlt
serializes the loads. Still - that may be faster than INSERT - jsonl is supported but does not work if JSON fields are optional. the missing keys fail the COPY instead of being interpreted as NULL
Supported column hintsโ
duckdb
may create unique indexes for all columns with unique
hints but this behavior is disabled by default because it slows the loading down significantly.
Destination Configurationโ
By default, a DuckDB database will be created in the current working directory with a name <pipeline_name>.duckdb
(chess.duckdb
in the example above). After loading, it is available in read/write
mode via with pipeline.sql_client() as con:
which is a wrapper over DuckDBPyConnection
. See duckdb docs for details.
The duckdb
credentials do not require any secret values. You are free to pass the configuration explicitly via the credentials
parameter to dlt.pipeline
or pipeline.run
methods. For example:
# will load data to files/data.db database file
p = dlt.pipeline(pipeline_name='chess', destination='duckdb', dataset_name='chess_data', full_refresh=False, credentials="files/data.db")
# will load data to /var/local/database.duckdb
p = dlt.pipeline(pipeline_name='chess', destination='duckdb', dataset_name='chess_data', full_refresh=False, credentials="/var/local/database.duckdb")
The destination accepts a duckdb
connection instance via credentials
, so you can also open a database connection yourself and pass it to dlt
to use. :memory:
databases are supported.
import duckdb
db = duckdb.connect()
p = dlt.pipeline(pipeline_name='chess', destination='duckdb', dataset_name='chess_data', full_refresh=False, credentials=db)
This destination accepts database connection strings in format used by duckdb-engine.
You can configure a DuckDB destination with secret / config values (e.g. using a secrets.toml
file)
destination.duckdb.credentials=duckdb:///_storage/test_quack.duckdb
Unique indexes may be created during loading if the following config value is set:
[destination.duckdb]
create_indexes=true
dbt supportโ
This destination integrates with dbt via dbt-duckdb which is a community supported package. The duckdb
database is shared with dbt
. In rare cases you may see information that binary database format does not match the database format expected by dbt-duckdb
. You may avoid that by updating the duckdb
package in your dlt
project with pip install -U
.
Syncing of dlt
stateโ
This destination fully supports dlt state sync