Weaviate
Weaviate is an open source vector database. It allows you to store data objects and perform similarity searches over them. This destination helps you to load data into Weaviate from dlt resources.
Setup Guideโ
- To use Weaviate as a destination, make sure dlt is installed with the 'weaviate' extra:
pip install dlt[weaviate]
- Next, configure the destination in the dlt secrets file. The file is located at
~/.dlt/secrets.toml
by default. Add the following section to the secrets file:
[destination.weaviate.credentials]
url = "https://your-weaviate-url"
api_key = "your-weaviate-api-key"
[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "your-openai-api-key"
In this setup guide, we are using the Weaviate Cloud Services to get a Weaviate instance and OpenAI API for generating embeddings through the text2vec-openai module.
You can host your own weaviate instance using docker compose, kubernetes or embedded. Refer to Weaviate's How-to: Install or dlt recipe we use for our tests. In that case you can skip the credentials part altogether:
[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "your-openai-api-key"
The url
will default to http://localhost:8080 and api_key
is not defined - which are the defaults for Weaviate container.
- Define the source of the data. For starters, let's load some data from a simple data structure:
import dlt
from dlt.destinations.weaviate import weaviate_adapter
movies = [
{
"title": "Blade Runner",
"year": 1982,
},
{
"title": "Ghost in the Shell",
"year": 1995,
},
{
"title": "The Matrix",
"year": 1999,
}
]
- Define the pipeline:
pipeline = dlt.pipeline(
pipeline_name="movies",
destination="weaviate",
dataset_name="MoviesDataset",
)
- Run the pipeline:
info = pipeline.run(
weaviate_adapter(
movies,
vectorize="title",
)
)
- Check the results:
print(info)
The data is now loaded into Weaviate.
Weaviate destination is different from other dlt destinations. To use vector search after the data has been loaded, you must specify which fields Weaviate needs to include in the vector index. You do that by wrapping the data (or dlt resource) with the weaviate_adapter
function.
weaviate_adapterโ
The weaviate_adapter
is a helper function that configures the resource for the Weaviate destination:
weaviate_adapter(data, vectorize, tokenization)
It accepts the following arguments:
data
: a dlt resource object or a Python data structure (e.g. a list of dictionaries).vectorize
: a name of the field or a list of names that should be vectorized by Weaviate.tokenization
: the dictionary containing the tokenization configuration for a field. The dictionary should have the following structure{'field_name': 'method'}
. Valid methods are "word", "lowercase", "whitespace", "field". The default is "word". See Property tokenization in Weaviate documentation for more details.
Returns: a dlt resource object that you can pass to the pipeline.run()
.
Example:
weaviate_adapter(
resource,
vectorize=["title", "description"],
tokenization={"title": "word", "description": "whitespace"},
)
A more comprehensive pipeline would load data from some API or use one of dlt's verified sources.
Write dispositionโ
A write disposition defines how the data should be written to the destination. All write dispositions are supported by the Weaviate destination.
Replaceโ
The replace disposition replaces the data in the destination with the data from the resource. It deletes all the classes and objects and recreates the schema before loading the data.
In the movie example from the setup guide, we can use the replace
disposition to reload the data every time we run the pipeline:
info = pipeline.run(
weaviate_adapter(
movies,
vectorize="title",
),
write_disposition="replace",
)
Mergeโ
The merge write disposition merges the data from the resource with the data in the destination.
For merge
disposition you would need to specify a primary_key
for the resource:
info = pipeline.run(
weaviate_adapter(
movies,
vectorize="title",
),
primary_key="document_id",
write_disposition="merge"
)
Internally dlt will use primary_key
(document_id
in the example above) to generate a unique identifier (UUID) for each object in Weaviate. If the object with the same UUID already exists in Weaviate, it will be updated with the new data. Otherwise, a new object will be created.
If you are using the merge write disposition, you must set it from the first run of your pipeline, otherwise the data will be duplicated in the database on subsequent loads.
Appendโ
This is the default disposition. It will append the data to the existing data in the destination ignoring the primary_key
field.
Data loadingโ
Loading data into Weaviate from different sources requires a proper understanding of how data is transformed and integrated into Weaviate's schema.
Data typesโ
Data loaded into Weaviate from various sources might have different types. To ensure compatibility with Weaviate's schema, there's a predefined mapping between the dlt types and Weaviate's native types:
dlt Type | Weaviate Type |
---|---|
text | text |
double | number |
bool | boolean |
timestamp | date |
date | date |
bigint | int |
binary | blob |
decimal | text |
wei | number |
complex | text |
Dataset nameโ
Weaviate uses classes to categorize and identify data. To avoid potential naming conflicts, especially when dealing with multiple datasets that might have overlapping table names, dlt includes the dataset name into the Weaviate class name. This ensures a unique identifier for every class.
For example, if you have a dataset named movies_dataset
and a table named actors
, the Weaviate class name would be MoviesDataset_Actors
(the default separator is an underscore).
However, if you prefer to have class names without the dataset prefix, skip dataset_name
argument.
For example:
pipeline = dlt.pipeline(
pipeline_name="movies",
destination="weaviate",
)
Names normalizationโ
When loading data into Weaviate, dlt tries to maintain naming conventions consistent with the Weaviate schema.
Here's a summary of the naming normalization approach:
Table namesโ
- Snake case identifiers such as
snake_case_name
get converted toSnakeCaseName
(aka Pascal case). - Pascal case identifiers such as
PascalCaseName
remain unchanged. - Leading underscores are removed. Hence,
_snake_case_name
becomesSnakeCaseName
. - Numbers in names are retained, but if a name starts with a number, it's prefixed with a character, e.g.,
1_a_1snake_case_name
toC1A1snakeCaseName
. - Double underscores in the middle of names, like
Flat__Space
, result in a single underscore:Flat_Space
. If these appear at the end, they are followed by an 'x', makingFlat__Space_
intoFlat_Spacex
. - Special characters and spaces are replaced with underscores, and emojis are simplified. For instance,
Flat Sp!ace
becomesFlat_SpAce
andFlat_Sp๐กace
is changed toFlat_SpAce
.
Property namesโ
- Snake case and camel case remain unchanged:
snake_case_name
andcamelCaseName
. - Names starting with a capital letter have it lowercased:
CamelCase
->camelCase
- Names with multiple underscores, such as
Snake-______c__ase_``, are compacted to
snake_c_asex. Except for the case when underscores are leading, in which case they are kept:
snake_case_namebecomes
snake_case_name`. - Names starting with a number are prefixed with a "p_". For example,
123snake_case_name
becomesp_123snake_case_name
.
Reserved property namesโ
Reserved property names like id
or additional
are prefixed with underscores for differentiation. Therefore, id
becomes __id
and _id
is rendered as ___id
.
Case insensitive naming conventionโ
The default naming convention described above will preserve the casing of the properties (besides the first letter which is lowercased). This generates nice classes
in Weaviate but also requires that your input data does not have clashing property names when comparing case insensitive ie. (caseName
== casename
). In such case
Weaviate destination will fail to create classes and report a conflict.
You can configure alternative naming convention which will lowercase all properties. The clashing properties will be merged and the classes created. Still if you have a document where clashing properties like:
{"camelCase": 1, "CamelCase": 2}
it will be normalized to:
{"camelcase": 2}
so your best course of action is to clean up the data yourself before loading and use default naming convention. Nevertheless you can configure the alternative in config.toml
:
[schema]
naming="dlt.destinations.weaviate.ci_naming"
Additional destination optionsโ
batch_size
: (int) the number of items in the batch insert request. The default is 100.batch_workers
: (int) the maximal number of concurrent threads to run batch import. The default is 1.batch_consistency
: (str) the number of replica nodes in the cluster that must acknowledge a write or read request before it's considered successful. The available consistency levels include:ONE
: Only one replica node needs to acknowledge.QUORUM
: Majority of replica nodes (calculated asreplication_factor / 2 + 1
) must acknowledge.ALL
: All replica nodes in the cluster must send a successful response. The default isONE
.
batch_retries
: (int) number of retries to create a batch that failed with ReadTimeout. The default is 5.dataset_separator
: (str) the separator to use when generating the class names in Weaviate.conn_timeout
andread_timeout
: (float) to set timeouts (in seconds) when connecting and reading from REST API. defaults to (10.0, 180.0)startup_period
(int) - how long to wait for weaviate to startvectorizer
: (str) the name of the vectorizer to use. The default istext2vec-openai
.moduleConfig
: (dict) configurations of various Weaviate modules
Configure Weaviate modulesโ
The default configuration for the Weaviate destination uses text2vec-openai
.
To configure another vectorizer or a generative module, replace the default module_config
value by updating config.toml
:
[destination.weaviate]
module_config={text2vec-openai = {}, generative-openai = {}}
This ensures the generative-openai
module is used for generative queries.
Run Weaviate fully standaloneโ
Below is an example that configures the contextionary vectorizer. You can put this into config.toml
. This configuration does not need external APIs for vectorization and may be used fully offline.
[destination.weaviate]
vectorizer="text2vec-contextionary"
module_config={text2vec-contextionary = { vectorizeClassName = false, vectorizePropertyName = true}}
You can find docker composer with the instructions to run here
dbt supportโ
Currently Weaviate destination does not support dbt.
Syncing of dlt
stateโ
Weaviate destination supports syncing of the dlt
state.