Version: 0.3.25

Weaviate

Weaviate is an open source vector database. It allows you to store data objects and perform similarity searches over them. This destination helps you to load data into Weaviate from dlt resources.

Setup Guide

To use Weaviate as a destination, make sure dlt is installed with the 'weaviate' extra:

pip install dlt[weaviate]

Next, configure the destination in the dlt secrets file. The file is located at ~/.dlt/secrets.toml by default. Add the following section to the secrets file:

[destination.weaviate.credentials]
url = "https://your-weaviate-url"
api_key = "your-weaviate-api-key"

[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "your-openai-api-key"

In this setup guide, we are using the Weaviate Cloud Services to get a Weaviate instance and OpenAI API for generating embeddings through the text2vec-openai module.

You can host your own weaviate instance using docker compose, kubernetes or embedded. Refer to Weaviate's How-to: Install or dlt recipe we use for our tests. In that case you can skip the credentials part altogether:

[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "your-openai-api-key"

The url will default to http://localhost:8080 and api_key is not defined - which are the defaults for Weaviate container.

Define the source of the data. For starters, let's load some data from a simple data structure:

import dlt
from dlt.destinations.weaviate import weaviate_adapter

movies = [
    {
        "title": "Blade Runner",
        "year": 1982,
    },
    {
        "title": "Ghost in the Shell",
        "year": 1995,
    },
    {
        "title": "The Matrix",
        "year": 1999,
    }
]

Define the pipeline:

pipeline = dlt.pipeline(
    pipeline_name="movies",
    destination="weaviate",
    dataset_name="MoviesDataset",
)

Run the pipeline:

info = pipeline.run(
    weaviate_adapter(
        movies,
        vectorize="title",
    )
)

Check the results:

print(info)

The data is now loaded into Weaviate.

Weaviate destination is different from other dlt destinations. To use vector search after the data has been loaded, you must specify which fields Weaviate needs to include in the vector index. You do that by wrapping the data (or dlt resource) with the weaviate_adapter function.

weaviate_adapter

The weaviate_adapter is a helper function that configures the resource for the Weaviate destination:

weaviate_adapter(data, vectorize, tokenization)

It accepts the following arguments:

data: a dlt resource object or a Python data structure (e.g. a list of dictionaries).
vectorize: a name of the field or a list of names that should be vectorized by Weaviate.
tokenization: the dictionary containing the tokenization configuration for a field. The dictionary should have the following structure {'field_name': 'method'}. Valid methods are "word", "lowercase", "whitespace", "field". The default is "word". See Property tokenization in Weaviate documentation for more details.

Returns: a dlt resource object that you can pass to the pipeline.run().

Example:

weaviate_adapter(
    resource,
    vectorize=["title", "description"],
    tokenization={"title": "word", "description": "whitespace"},
)

tip

A more comprehensive pipeline would load data from some API or use one of dlt's verified sources.

Write disposition

A write disposition defines how the data should be written to the destination. All write dispositions are supported by the Weaviate destination.

Replace

The replace disposition replaces the data in the destination with the data from the resource. It deletes all the classes and objects and recreates the schema before loading the data.

In the movie example from the setup guide, we can use the replace disposition to reload the data every time we run the pipeline:

info = pipeline.run(
    weaviate_adapter(
        movies,
        vectorize="title",
    ),
    write_disposition="replace",
)

Merge

The merge write disposition merges the data from the resource with the data in the destination. For merge disposition you would need to specify a primary_key for the resource:

info = pipeline.run(
    weaviate_adapter(
        movies,
        vectorize="title",
    ),
    primary_key="document_id",
    write_disposition="merge"
)

Internally dlt will use primary_key (document_id in the example above) to generate a unique identifier (UUID) for each object in Weaviate. If the object with the same UUID already exists in Weaviate, it will be updated with the new data. Otherwise, a new object will be created.

caution

If you are using the merge write disposition, you must set it from the first run of your pipeline, otherwise the data will be duplicated in the database on subsequent loads.

Append

This is the default disposition. It will append the data to the existing data in the destination ignoring the primary_key field.

Data loading

Loading data into Weaviate from different sources requires a proper understanding of how data is transformed and integrated into Weaviate's schema.

Data types

Data loaded into Weaviate from various sources might have different types. To ensure compatibility with Weaviate's schema, there's a predefined mapping between the dlt types and Weaviate's native types:

dlt Type	Weaviate Type
text	text
double	number
bool	boolean
timestamp	date
date	date
bigint	int
binary	blob
decimal	text
wei	number
complex	text

Dataset name

Weaviate uses classes to categorize and identify data. To avoid potential naming conflicts, especially when dealing with multiple datasets that might have overlapping table names, dlt includes the dataset name into the Weaviate class name. This ensures a unique identifier for every class.

For example, if you have a dataset named movies_dataset and a table named actors, the Weaviate class name would be MoviesDataset_Actors (the default separator is an underscore).

However, if you prefer to have class names without the dataset prefix, skip dataset_name argument.

For example:

pipeline = dlt.pipeline(
    pipeline_name="movies",
    destination="weaviate",
)

Names normalization

When loading data into Weaviate, dlt tries to maintain naming conventions consistent with the Weaviate schema.

Here's a summary of the naming normalization approach:

Table names

Snake case identifiers such as snake_case_name get converted to SnakeCaseName (aka Pascal case).
Pascal case identifiers such as PascalCaseName remain unchanged.
Leading underscores are removed. Hence, _snake_case_name becomes SnakeCaseName.
Numbers in names are retained, but if a name starts with a number, it's prefixed with a character, e.g., 1_a_1snake_case_name to C1A1snakeCaseName.
Double underscores in the middle of names, like Flat__Space, result in a single underscore: Flat_Space. If these appear at the end, they are followed by an 'x', making Flat__Space_ into Flat_Spacex.
Special characters and spaces are replaced with underscores, and emojis are simplified. For instance, Flat Sp!ace becomes Flat_SpAce and Flat_Sp💡ace is changed to Flat_SpAce.

Property names

Snake case and camel case remain unchanged: snake_case_name and camelCaseName.
Names starting with a capital letter have it lowercased: CamelCase -> camelCase
Names with multiple underscores, such as Snake-______c__ase_``, are compacted to snake_c_asex. Except for the case when underscores are leading, in which case they are kept: snake_case_namebecomessnake_case_name`.
Names starting with a number are prefixed with a "p_". For example, 123snake_case_name becomes p_123snake_case_name.

Reserved property names

Reserved property names like id or additional are prefixed with underscores for differentiation. Therefore, id becomes __id and _id is rendered as ___id.

Case insensitive naming convention

The default naming convention described above will preserve the casing of the properties (besides the first letter which is lowercased). This generates nice classes in Weaviate but also requires that your input data does not have clashing property names when comparing case insensitive ie. (caseName == casename). In such case Weaviate destination will fail to create classes and report a conflict.

You can configure alternative naming convention which will lowercase all properties. The clashing properties will be merged and the classes created. Still if you have a document where clashing properties like:

{"camelCase": 1, "CamelCase": 2}

it will be normalized to:

{"camelcase": 2}

so your best course of action is to clean up the data yourself before loading and use default naming convention. Nevertheless you can configure the alternative in config.toml:

[schema]
naming="dlt.destinations.weaviate.ci_naming"

Additional destination options

batch_size: (int) the number of items in the batch insert request. The default is 100.
batch_workers: (int) the maximal number of concurrent threads to run batch import. The default is 1.
batch_consistency: (str) the number of replica nodes in the cluster that must acknowledge a write or read request before it's considered successful. The available consistency levels include:
- ONE: Only one replica node needs to acknowledge.
- QUORUM: Majority of replica nodes (calculated as replication_factor / 2 + 1) must acknowledge.
- ALL: All replica nodes in the cluster must send a successful response. The default is ONE.
batch_retries: (int) number of retries to create a batch that failed with ReadTimeout. The default is 5.
dataset_separator: (str) the separator to use when generating the class names in Weaviate.
conn_timeout and read_timeout: (float) to set timeouts (in seconds) when connecting and reading from REST API. defaults to (10.0, 180.0)
startup_period (int) - how long to wait for weaviate to start
vectorizer: (str) the name of the vectorizer to use. The default is text2vec-openai.
moduleConfig: (dict) configurations of various Weaviate modules

Configure Weaviate modules

The default configuration for the Weaviate destination uses text2vec-openai. To configure another vectorizer or a generative module, replace the default module_config value by updating config.toml:

[destination.weaviate]
module_config={text2vec-openai = {}, generative-openai = {}}

This ensures the generative-openai module is used for generative queries.

Run Weaviate fully standalone

Below is an example that configures the contextionary vectorizer. You can put this into config.toml. This configuration does not need external APIs for vectorization and may be used fully offline.

[destination.weaviate]
vectorizer="text2vec-contextionary"
module_config={text2vec-contextionary = { vectorizeClassName = false, vectorizePropertyName = true}}

You can find docker composer with the instructions to run here

dbt support

Currently Weaviate destination does not support dbt.

Syncing of `dlt` state

Weaviate destination supports syncing of the dlt state.

Weaviate

Setup Guide

weaviate_adapter

Write disposition

Replace

Merge

Append

Data loading

Data types

Dataset name

Names normalization

Table names

Property names

Reserved property names

Case insensitive naming convention

Additional destination options

Configure Weaviate modules

Run Weaviate fully standalone

dbt support

Syncing of `dlt` state

DHelp

Ask a question

Weaviate

Setup Guide​

weaviate_adapter​

Write disposition​

Replace​

Merge​

Append​

Data loading​

Data types​

Dataset name​

Names normalization​

Table names​

Property names​

Reserved property names​

Case insensitive naming convention​

Additional destination options​

Configure Weaviate modules​

Run Weaviate fully standalone​

dbt support​

Syncing of dlt state​

DHelp

Ask a question

Setup Guide

weaviate_adapter

Write disposition

Replace

Merge

Append

Data loading

Data types

Dataset name

Names normalization

Table names

Property names

Reserved property names

Case insensitive naming convention

Additional destination options

Configure Weaviate modules

Run Weaviate fully standalone

dbt support

Syncing of `dlt` state