Snowflake
Setup Guideโ
1. Initialize a project with a pipeline that loads to snowflake by running
dlt init chess snowflake
2. Install the necessary dependencies for snowflake by running
pip install -r requirements.txt
This will install dlt with snowflake extra which contains Snowflake Python dbapi client.
3. Create a new database, user and give dlt access
Read the next chapter below.
4. Enter your credentials into .dlt/secrets.toml
.
It should now look like
[destination.snowflake.credentials]
database = "dlt_data"
password = "<password>"
username = "loader"
host = "kgiotue-wn98412"
warehouse = "COMPUTE_WH"
role = "DLT_LOADER_ROLE"
In case of snowflake host is your Account Identifier. You can get in Admin/Accounts by copying account url: https://kgiotue-wn98412.snowflakecomputing.com and extracting the host name (kgiotue-wn98412)
The warehouse and role are optional if you assign defaults to your user. In the example below we do not do that, so we set them explicitly.
Setup the database user and permissionsโ
Instructions below assume that you use the default account setup that you get after creating Snowflake account. You should have default warehouse named COMPUTE_WH and snowflake account. Below we create a new database, user and assign permissions. The permissions are very generous. A more experienced user can easily reduce dlt
permissions to just one schema in the database.
--create database with standard settings
CREATE DATABASE dlt_data;
-- create new user - set your password here
CREATE USER loader WITH PASSWORD='<password>'
-- we assign all permission to a role
CREATE ROLE DLT_LOADER_ROLE;
GRANT ROLE DLT_LOADER_ROLE TO USER loader;
-- give database access to new role
GRANT USAGE ON DATABASE dlt_data TO DLT_LOADER_ROLE;
-- allow dlt to create new schemas
GRANT CREATE SCHEMA ON DATABASE dlt_data TO ROLE DLT_LOADER_ROLE
-- allow access to a warehouse named COMPUTE_WH
GRANT USAGE ON WAREHOUSE COMPUTE_WH TO DLT_LOADER_ROLE;
-- grant access to all future schemas and tables in the database
GRANT ALL PRIVILEGES ON FUTURE SCHEMAS IN DATABASE dlt_data TO DLT_LOADER_ROLE;
GRANT ALL PRIVILEGES ON FUTURE TABLES IN DATABASE dlt_data TO DLT_LOADER_ROLE;
Now you can use the user named LOADER
to access database DLT_DATA
and log in with specified password.
You can also decrease the suspend time for your warehouse to 1 minute (Admin/Warehouses in Snowflake UI)
Authentication typesโ
Snowflake destination accepts three authentication types
- password authentication
- key pair authentication
- external authentication
The password authentication is not any different from other databases like Postgres or Redshift. dlt
follows the same syntax as SQLAlchemy dialect.
You can also pass credentials as a database connection string. For example:
# keep it at the top of your toml file! before any section starts
destination.snowflake.credentials="snowflake://loader:<password>@kgiotue-wn98412/dlt_data?warehouse=COMPUTE_WH&role=DLT_LOADER_ROLE"
In key pair authentication you replace password with a private key string that should be in Base64-encoded DER format (DBT also recommends base64-encoded private keys for Snowflake connections). The private key may also be encrypted. In that case you must provide a passphrase alongside with the private key.
[destination.snowflake.credentials]
database = "dlt_data"
username = "loader"
host = "kgiotue-wn98412"
private_key = "LS0tLS1CRUdJTiBFTkNSWVBURUQgUFJJ....Qo="
private_key_passphrase="passphrase"
You can easily get the base64-encoded value of your private key by running
base64 -i <path-to-private-key-file>.pem
in your terminal
If you pass a passphrase in the connection string, please url encode it.
# keep it at the top of your toml file! before any section starts
destination.snowflake.credentials="snowflake://loader:<password>@kgiotue-wn98412/dlt_data?private_key=<base64 encoded pem>&private_key_passphrase=<url encoded passphrase>"
In external authentication you can use oauth provider like Okta or external browser to authenticate. You pass your authenticator and refresh token as below:
[destination.snowflake.credentials]
database = "dlt_data"
username = "loader"
authenticator="..."
token="..."
or in connection string as query parameters. Refer to Snowflake OAuth for more details.
Write dispositionโ
All write dispositions are supported
If you set the replace
strategy to staging-optimized
the destination tables will be dropped and
recreated with a clone command from the staging tables.
Data loadingโ
The data is loaded using internal Snowflake stage. We use PUT
command and per-table built-in stages by default. Stage files are immediately removed (if not specified otherwise).
Supported file formatsโ
- insert-values is used by default
- parquet is supported
- jsonl is supported
When staging is enabled:
โ When loading from
parquet
, Snowflake will storecomplex
types (JSON) inVARIANT
as string. Usejsonl
format instead or usePARSE_JSON
to update the `VARIANT`` field after loading.
Supported column hintsโ
Snowflake supports the following column hints:
cluster
- creates a cluster column(s). Many column per table are supported and only when a new table is created.
Table and column identifiersโ
Snowflake makes all unquoted identifiers uppercase and then resolves them case-insensitive in SQL statements. dlt
(effectively) does not quote identifies in DDL preserving default behavior.
Names of tables and columns in schemas are kept in lower case like for all other destinations. This is the pattern we observed in other tools ie. dbt
. In case of dlt
it is however trivial to define your own uppercase naming convention
Staging supportโ
Snowflake supports s3 and gcs as a file staging destinations. dlt will upload files in the parquet format to the bucket provider and will ask snowflake to copy their data directly into the db.
Alternavitely to parquet files, you can also specify jsonl as the staging file format. For this set the loader_file_format
argument of the run
command of the pipeline to jsonl
.
Snowflake and Amazon S3โ
Please refer to the S3 documentation to learn how to set up your bucket with the bucket_url and credentials. For s3 The dlt Redshift loader will use the aws credentials provided for s3 to access the s3 bucket if not specified otherwise (see config options below). Alternatively you can create a stage for your S3 Bucket by following the instructions provided in the Snowflake S3 documentation. The basic steps are as follows:
- Create a storage integration linked to GCS and the right bucket
- Grant access to this storage integration to the snowflake role you are using to load the data into snowflake.
- Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data.
- Also grant access to this stage for the role you are using to load data into snowflake.
- Provide the name of your stage (including the namespace) to dlt like so:
To prevent dlt from forwarding the s3 bucket credentials on every command, and set your s3 stage, change these settings:
[destination]
stage_name=PUBLIC.my_s3_stage
To run Snowflake with s3 as staging destination:
# Create a dlt pipeline that will load
# chess player data to the snowflake destination
# via staging on s3
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='snowflake',
staging='filesystem', # add this to activate the staging location
dataset_name='player_data'
)
Snowflake and Google Cloud Storageโ
Please refer to the Google Storage filesystem documentation to learn how to set up your bucket with the bucket_url and credentials. For gcs you can define a stage in Snowflake and provide the stage identifier in the configuration (see config options below.) Please consult the snowflake Documentation on how to create a stage for your GCS Bucket. The basic steps are as follows:
- Create a storage integration linked to GCS and the right bucket
- Grant access to this storage integration to the snowflake role you are using to load the data into snowflake.
- Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data.
- Also grant access to this stage for the role you are using to load data into snowflake.
- Provide the name of your stage (including the namespace) to dlt like so:
[destination]
stage_name=PUBLIC.my_gcs_stage
To run Snowflake with gcs as staging destination:
# Create a dlt pipeline that will load
# chess player data to the snowflake destination
# via staging on gcs
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='snowflake',
staging='filesystem', # add this to activate the staging location
dataset_name='player_data'
)
Snowflake and Azure Blob Storageโ
Please refer to the Azure Blob Storage filesystem documentation to learn how to set up your bucket with the bucket_url and credentials. For azure the Snowflake loader will use the filesystem credentials for your azure blob storage container if not specified otherwise (see config options below). Alternatively you can define an external stage in Snowflake and provide the stage identifier. Please consult the snowflake Documentation on how to create a stage for your Azure Blob Storage Container. The basic steps are as follows:
- Create a storage integration linked to Azure Blob Storage and the right container
- Grant access to this storage integration to the snowflake role you are using to load the data into snowflake.
- Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data.
- Also grant access to this stage for the role you are using to load data into snowflake.
- Provide the name of your stage (including the namespace) to dlt like so:
[destination]
stage_name=PUBLIC.my_azure_stage
To run Snowflake with azure as staging destination:
# Create a dlt pipeline that will load
# chess player data to the snowflake destination
# via staging on azure
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='snowflake',
staging='filesystem', # add this to activate the staging location
dataset_name='player_data'
)
Additional destination optionsโ
You can define your own stage to PUT files and disable removing of the staged files after loading.
[destination.snowflake]
# Use an existing named stage instead of the default. Default uses the implicit table stage per table
stage_name="DLT_STAGE"
# Whether to keep or delete the staged files after COPY INTO succeeds
keep_staged_files=true
dbt supportโ
This destination integrates with dbt via dbt-snowflake. Both password and key pair authentication is supported and shared with dbt runners.
Syncing of dlt
stateโ
This destination fully supports dlt state sync