I want to have a custom chunking strategy for a vector db de Airbyte #ask-ai

Join Slack

I want to have a custom chunking strategy for a ve...

# ask-ai

Mayank Agarwal

10/18/2024, 10:11 AM

I want to have a custom chunking strategy for a vector db destination. How to do it?

user

10/18/2024, 10:11 AM

You can configure the chunking strategy for a vector database destination in Airbyte during the setup of the destination connector. Here's how you can do it: 1. During the setup of the destination connector in Airbyte, you will find a section for "Processing". This is where you can specify how individual records should be split into chunks. 2. You can specify the text fields that should be split into chunks. All text fields are concatenated into a single string and then split into chunks of configured length. You can access nested fields in the record by using dot notation, e.g.

user.name

will access the

name

field in the

user

object. It's also possible to use wildcards to access all fields in an object, e.g.

users.*.name

will access all

names

fields in all entries of the

users

array. 3. The chunk length is measured in tokens produced by the

tiktoken

library. The maximum is 8191 tokens, which is the maximum length supported by the

text-embedding-ada-002

model. 4. If specified, the metadata fields are stored as-is along with the embedded text chunks. Please note that meta data fields can only be used for filtering and not for retrieval and have to be of type string, number, boolean (all other values are ignored). There's a 40kb limit on the total size of the metadata saved for each entry. Please refer to these knowledge sources for more details: - [PGVector Destination Configuration](https://docs.airbyte.com/integrations/destinations/pgvector#configuration) - [Snowflake Cortex Destination Configuration](https://docs.airbyte.com/integrations/destinations/snowflake-cortex#configuration) - [Vector Database (powered by LangChain) Overview](https://docs.airbyte.com/integrations/destinations/langchain#overview) - [Tutorial on using Milvus and Airbyte](https://airbyte.com/tutorials/use-milvus-and-airbyte-for-similarity-search-on-all-your-data)

Mayank Agarwal

10/18/2024, 10:12 AM

What if I want to do some custom chunking strategy like one that uss llm calls @kapa.ai

user

10/18/2024, 10:12 AM

Airbyte allows you to configure how each record is split into text fields and meta fields in the "Processing" section. All text fields are concatenated into a single string and then split into chunks of a configured length. If specified, the metadata fields are stored as-is along with the embedded text chunks. The chunking process uses the [Langchain Python library](https://python.langchain.com/docs/get_started/introduction). The chunk length is measured in tokens produced by the

tiktoken

library. The maximum is 8191 tokens, which is the maximum length supported by the

text-embedding-ada-002

model. However, if you want to implement a custom chunking strategy that uses LLM calls, you might need to modify the code that handles the chunking process. Unfortunately, the provided knowledge sources do not contain specific information on how to implement a custom chunking strategy using LLM calls. For more information, you can refer to the documentation for the specific destinations like [Milvus](https://docs.airbyte.com/integrations/destinations/milvus#configuration) and [Snowflake Cortex](https://docs.airbyte.com/integrations/destinations/snowflake-cortex#configuration).

Open in Slack

Previous Next