I want to have a custom chunking strategy for a ve...
# ask-ai
m
I want to have a custom chunking strategy for a vector db destination. How to do it?
u
You can configure the chunking strategy for a vector database destination in Airbyte during the setup of the destination connector. Here's how you can do it: 1. During the setup of the destination connector in Airbyte, you will find a section for "Processing". This is where you can specify how individual records should be split into chunks. 2. You can specify the text fields that should be split into chunks. All text fields are concatenated into a single string and then split into chunks of configured length. You can access nested fields in the record by using dot notation, e.g.
user.name
will access the
name
field in the
user
object. It's also possible to use wildcards to access all fields in an object, e.g.
users.*.name
will access all
names
fields in all entries of the
users
array. 3. The chunk length is measured in tokens produced by the
tiktoken
library. The maximum is 8191 tokens, which is the maximum length supported by the
text-embedding-ada-002
model. 4. If specified, the metadata fields are stored as-is along with the embedded text chunks. Please note that meta data fields can only be used for filtering and not for retrieval and have to be of type string, number, boolean (all other values are ignored). There's a 40kb limit on the total size of the metadata saved for each entry. Please refer to these knowledge sources for more details: - [PGVector Destination Configuration](https://docs.airbyte.com/integrations/destinations/pgvector#configuration) - [Snowflake Cortex Destination Configuration](https://docs.airbyte.com/integrations/destinations/snowflake-cortex#configuration) - [Vector Database (powered by LangChain) Overview](https://docs.airbyte.com/integrations/destinations/langchain#overview) - [Tutorial on using Milvus and Airbyte](https://airbyte.com/tutorials/use-milvus-and-airbyte-for-similarity-search-on-all-your-data)
m
What if I want to do some custom chunking strategy like one that uss llm calls @kapa.ai
u
Airbyte allows you to configure how each record is split into text fields and meta fields in the "Processing" section. All text fields are concatenated into a single string and then split into chunks of a configured length. If specified, the metadata fields are stored as-is along with the embedded text chunks. The chunking process uses the [Langchain Python library](https://python.langchain.com/docs/get_started/introduction). The chunk length is measured in tokens produced by the
tiktoken
library. The maximum is 8191 tokens, which is the maximum length supported by the
text-embedding-ada-002
model. However, if you want to implement a custom chunking strategy that uses LLM calls, you might need to modify the code that handles the chunking process. Unfortunately, the provided knowledge sources do not contain specific information on how to implement a custom chunking strategy using LLM calls. For more information, you can refer to the documentation for the specific destinations like [Milvus](https://docs.airbyte.com/integrations/destinations/milvus#configuration) and [Snowflake Cortex](https://docs.airbyte.com/integrations/destinations/snowflake-cortex#configuration).