Skip to main content
Version: 0.3.25

Snowflake

Setup Guideโ€‹

1. Initialize a project with a pipeline that loads to snowflake by running

dlt init chess snowflake

2. Install the necessary dependencies for snowflake by running

pip install -r requirements.txt

This will install dlt with snowflake extra which contains Snowflake Python dbapi client.

3. Create a new database, user and give dlt access

Read the next chapter below.

4. Enter your credentials into .dlt/secrets.toml. It should now look like

[destination.snowflake.credentials]
database = "dlt_data"
password = "<password>"
username = "loader"
host = "kgiotue-wn98412"
warehouse = "COMPUTE_WH"
role = "DLT_LOADER_ROLE"

In case of snowflake host is your Account Identifier. You can get in Admin/Accounts by copying account url: https://kgiotue-wn98412.snowflakecomputing.com and extracting the host name (kgiotue-wn98412)

The warehouse and role are optional if you assign defaults to your user. In the example below we do not do that, so we set them explicitly.

Setup the database user and permissionsโ€‹

Instructions below assume that you use the default account setup that you get after creating Snowflake account. You should have default warehouse named COMPUTE_WH and snowflake account. Below we create a new database, user and assign permissions. The permissions are very generous. A more experienced user can easily reduce dlt permissions to just one schema in the database.

--create database with standard settings
CREATE DATABASE dlt_data;
-- create new user - set your password here
CREATE USER loader WITH PASSWORD='<password>'
-- we assign all permission to a role
CREATE ROLE DLT_LOADER_ROLE;
GRANT ROLE DLT_LOADER_ROLE TO USER loader;
-- give database access to new role
GRANT USAGE ON DATABASE dlt_data TO DLT_LOADER_ROLE;
-- allow dlt to create new schemas
GRANT CREATE SCHEMA ON DATABASE dlt_data TO ROLE DLT_LOADER_ROLE
-- allow access to a warehouse named COMPUTE_WH
GRANT USAGE ON WAREHOUSE COMPUTE_WH TO DLT_LOADER_ROLE;
-- grant access to all future schemas and tables in the database
GRANT ALL PRIVILEGES ON FUTURE SCHEMAS IN DATABASE dlt_data TO DLT_LOADER_ROLE;
GRANT ALL PRIVILEGES ON FUTURE TABLES IN DATABASE dlt_data TO DLT_LOADER_ROLE;

Now you can use the user named LOADER to access database DLT_DATA and log in with specified password.

You can also decrease the suspend time for your warehouse to 1 minute (Admin/Warehouses in Snowflake UI)

Authentication typesโ€‹

Snowflake destination accepts three authentication types

The password authentication is not any different from other databases like Postgres or Redshift. dlt follows the same syntax as SQLAlchemy dialect.

You can also pass credentials as a database connection string. For example:

# keep it at the top of your toml file! before any section starts
destination.snowflake.credentials="snowflake://loader:<password>@kgiotue-wn98412/dlt_data?warehouse=COMPUTE_WH&role=DLT_LOADER_ROLE"

In key pair authentication you replace password with a private key string that should be in Base64-encoded DER format (DBT also recommends base64-encoded private keys for Snowflake connections). The private key may also be encrypted. In that case you must provide a passphrase alongside with the private key.

[destination.snowflake.credentials]
database = "dlt_data"
username = "loader"
host = "kgiotue-wn98412"
private_key = "LS0tLS1CRUdJTiBFTkNSWVBURUQgUFJJ....Qo="
private_key_passphrase="passphrase"

You can easily get the base64-encoded value of your private key by running base64 -i <path-to-private-key-file>.pem in your terminal

If you pass a passphrase in the connection string, please url encode it.

# keep it at the top of your toml file! before any section starts
destination.snowflake.credentials="snowflake://loader:<password>@kgiotue-wn98412/dlt_data?private_key=<base64 encoded pem>&private_key_passphrase=<url encoded passphrase>"

In external authentication you can use oauth provider like Okta or external browser to authenticate. You pass your authenticator and refresh token as below:

[destination.snowflake.credentials]
database = "dlt_data"
username = "loader"
authenticator="..."
token="..."

or in connection string as query parameters. Refer to Snowflake OAuth for more details.

Write dispositionโ€‹

All write dispositions are supported

If you set the replace strategy to staging-optimized the destination tables will be dropped and recreated with a clone command from the staging tables.

Data loadingโ€‹

The data is loaded using internal Snowflake stage. We use PUT command and per-table built-in stages by default. Stage files are immediately removed (if not specified otherwise).

Supported file formatsโ€‹

When staging is enabled:

โ— When loading from parquet, Snowflake will store complex types (JSON) in VARIANT as string. Use jsonl format instead or use PARSE_JSON to update the `VARIANT`` field after loading.

Supported column hintsโ€‹

Snowflake supports the following column hints:

  • cluster - creates a cluster column(s). Many column per table are supported and only when a new table is created.

Table and column identifiersโ€‹

Snowflake makes all unquoted identifiers uppercase and then resolves them case-insensitive in SQL statements. dlt (effectively) does not quote identifies in DDL preserving default behavior.

Names of tables and columns in schemas are kept in lower case like for all other destinations. This is the pattern we observed in other tools ie. dbt. In case of dlt it is however trivial to define your own uppercase naming convention

Staging supportโ€‹

Snowflake supports s3 and gcs as a file staging destinations. dlt will upload files in the parquet format to the bucket provider and will ask snowflake to copy their data directly into the db.

Alternavitely to parquet files, you can also specify jsonl as the staging file format. For this set the loader_file_format argument of the run command of the pipeline to jsonl.

Snowflake and Amazon S3โ€‹

Please refer to the S3 documentation to learn how to set up your bucket with the bucket_url and credentials. For s3 The dlt Redshift loader will use the aws credentials provided for s3 to access the s3 bucket if not specified otherwise (see config options below). Alternatively you can create a stage for your S3 Bucket by following the instructions provided in the Snowflake S3 documentation. The basic steps are as follows:

  • Create a storage integration linked to GCS and the right bucket
  • Grant access to this storage integration to the snowflake role you are using to load the data into snowflake.
  • Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data.
  • Also grant access to this stage for the role you are using to load data into snowflake.
  • Provide the name of your stage (including the namespace) to dlt like so:

To prevent dlt from forwarding the s3 bucket credentials on every command, and set your s3 stage, change these settings:

[destination]
stage_name=PUBLIC.my_s3_stage

To run Snowflake with s3 as staging destination:

# Create a dlt pipeline that will load
# chess player data to the snowflake destination
# via staging on s3
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='snowflake',
staging='filesystem', # add this to activate the staging location
dataset_name='player_data'
)

Snowflake and Google Cloud Storageโ€‹

Please refer to the Google Storage filesystem documentation to learn how to set up your bucket with the bucket_url and credentials. For gcs you can define a stage in Snowflake and provide the stage identifier in the configuration (see config options below.) Please consult the snowflake Documentation on how to create a stage for your GCS Bucket. The basic steps are as follows:

  • Create a storage integration linked to GCS and the right bucket
  • Grant access to this storage integration to the snowflake role you are using to load the data into snowflake.
  • Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data.
  • Also grant access to this stage for the role you are using to load data into snowflake.
  • Provide the name of your stage (including the namespace) to dlt like so:
[destination]
stage_name=PUBLIC.my_gcs_stage

To run Snowflake with gcs as staging destination:

# Create a dlt pipeline that will load
# chess player data to the snowflake destination
# via staging on gcs
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='snowflake',
staging='filesystem', # add this to activate the staging location
dataset_name='player_data'
)

Snowflake and Azure Blob Storageโ€‹

Please refer to the Azure Blob Storage filesystem documentation to learn how to set up your bucket with the bucket_url and credentials. For azure the Snowflake loader will use the filesystem credentials for your azure blob storage container if not specified otherwise (see config options below). Alternatively you can define an external stage in Snowflake and provide the stage identifier. Please consult the snowflake Documentation on how to create a stage for your Azure Blob Storage Container. The basic steps are as follows:

  • Create a storage integration linked to Azure Blob Storage and the right container
  • Grant access to this storage integration to the snowflake role you are using to load the data into snowflake.
  • Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data.
  • Also grant access to this stage for the role you are using to load data into snowflake.
  • Provide the name of your stage (including the namespace) to dlt like so:
[destination]
stage_name=PUBLIC.my_azure_stage

To run Snowflake with azure as staging destination:

# Create a dlt pipeline that will load
# chess player data to the snowflake destination
# via staging on azure
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='snowflake',
staging='filesystem', # add this to activate the staging location
dataset_name='player_data'
)

Additional destination optionsโ€‹

You can define your own stage to PUT files and disable removing of the staged files after loading.

[destination.snowflake]
# Use an existing named stage instead of the default. Default uses the implicit table stage per table
stage_name="DLT_STAGE"
# Whether to keep or delete the staged files after COPY INTO succeeds
keep_staged_files=true

dbt supportโ€‹

This destination integrates with dbt via dbt-snowflake. Both password and key pair authentication is supported and shared with dbt runners.

Syncing of dlt stateโ€‹

This destination fully supports dlt state sync

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.