< kapa ai> I want to add concurrency for a stream The stream Airbyte #ask-ai

<@U04SE163WC8> I want to add concurrency for a str...

Nicolás Gómez

09/26/2024, 10:14 PM

@kapa.ai I want to add concurrency for a stream. The stream is incremental, all the slices are generated using the

stream_slices

method. The slices have multiple properties, that aren't based on time. To make it incremental, I store in the state the "lastest" processed slice. And when generating the slices, I only generate those that are after the "latest" slice that is stored in the state. Something like:

Copy code

def stream_slices(self, stream_state, **kargs):
  last_successful_slice = self.get_last_successful_slice(stream_state)
  slice_params = self.get_slice_params()

  if last_successful_slice:
    yield from dropwhile(
       lambda stream_slice: stream_slice != last_successful_slice,
       self.generate_slices(slice_params)
    )
  else:
    yield from self.generate_slices(slice_params)

how can I add concurrency for a stream like this one?

user

09/26/2024, 10:14 PM

To add concurrency to your stream, you would need to use the Concurrent CDK provided by Airbyte. Here are the steps you would need to follow: 1. Import the necessary libraries: You would need to import the necessary libraries for concurrency. These include

StreamFacade

ConcurrentCursor

FinalStateCursor

, and

EpochValueConcurrentStreamStateConverter

among others. Here is an example of how to import these libraries:

Copy code

python from airbyte_cdk.sources.streams.concurrent.adapters import StreamFacade   from airbyte_cdk.sources.streams.concurrent.cursor import CursorField, ConcurrentCursor, FinalStateCursor from airbyte_cdk.sources.streams.concurrent.state_converters.datetime_stream_state_converter import EpochValueConcurrentStreamStateConverter

(Source: [Custom-python-connector > Concurrent Part 2](https://docs.airbyte.com/connector-development/tutorials/custom-python-connector/concurrency)) 2. Update the
streams
method: You would need to update the

streams

method to wrap the streams in an adapter class to enable concurrency. 3. Use
ConcurrentCursor
for state management: You would need to use

ConcurrentCursor

to support concurrent state management. If your stream does not support incremental reads, you can use a

FinalStateCursor

instead. 4. Add a state converter: You would need to add a state converter to describe how the state cursor is formatted. For example, you can use the

EpochValueConcurrentStreamStateConverter

if the

get_updated_state

method returns the cursor as a timestamp. 5. Define your own
stream_slices
method: If you want your parent and child stream to both be incremental, they should share the same cursor field and you need to define your own

stream_slices

method on the child stream, reading the parent stream with

SyncMode.incremental

. Here is an example of how to define your own

stream_slices

method:

Copy code

python def stream_slices(self, stream_state: Mapping[str, Any] = None, **kwargs) -> Iterable[Optional[Mapping[str, any]]]:       for _slice in self._parent_stream.stream_slices():           for parent_record in self._parent_stream.read_records(sync_mode=SyncMode.full_refresh, stream_slice=_slice):               yield parent_record

(Source: [Custom-python-connector > Reading from a subresource Part 2](https://docs.airbyte.com/connector-development/tutorials/custom-python-connector/reading-from-a-subresource)) Please note that the implementation of concurrency would depend on the specific requirements of your stream and may require additional steps or modifications. For more information, you can refer to the [Airbyte documentation on concurrency](https://docs.airbyte.com/connector-development/tutorials/custom-python-connector/concurrency).

Nicolás Gómez

09/26/2024, 10:18 PM

@kapa.ai can you explain what does the concurrent state? because I don't see the point of using some classes you told me to use, they seems to be based on a "cursor_field" that I'm not using

user

09/26/2024, 10:18 PM

A message from kapa.ai

user

09/26/2024, 10:19 PM

In Airbyte, the concurrent state is used to manage the state of data synchronization when multiple streams are being processed concurrently. It helps to keep track of the progress of each stream and ensures that the data synchronization process can resume from where it left off in case of interruptions. The

ConcurrentCursor

class is used to manage this concurrent state. It provides methods to update and retrieve the state of each stream. The state is usually based on a "cursor field", which is a field in the data records that is used to determine the order or "recency" of the records. This cursor field is used to identify new or updated records since the last sync. However, in your case, since you mentioned that you are not using a cursor field, you might not need to use the

ConcurrentCursor

class or other classes related to the cursor field. Instead, you are using the "last successful slice" as the state, which is a different approach to manage the state. If you want to add concurrency to your stream, you would need to ensure that the state management approach you are using is compatible with concurrent processing. This might involve modifying your

stream_slices

method to handle multiple slices concurrently and updating the state accordingly. Here is an example of how you might modify your

stream_slices

method for concurrent processing:

Copy code

python def stream_slices(self, stream_state, **kargs):   last_successful_slice = self.get_last_successful_slice(stream_state)   slice_params = self.get_slice_params()    if last_successful_slice:     # Use a thread pool to process the slices concurrently     with concurrent.futures.ThreadPoolExecutor() as executor:       # Map the generate_slices function to each slice       executor.map(self.generate_slices, dropwhile(         lambda stream_slice: stream_slice != last_successful_slice,         self.generate_slices(slice_params)       ))   else:     # Process the slices concurrently if there is no last successful slice     with concurrent.futures.ThreadPoolExecutor() as executor:       executor.map(self.generate_slices, self.generate_slices(slice_params))

Please note that this is just a basic example and the actual implementation might vary depending on the specific requirements of your stream. For more information on concurrency in Airbyte, you can refer to the [Airbyte documentation on concurrency](https://docs.airbyte.com/connector-development/tutorials/custom-python-connector/concurrency).

Nicolás Gómez

09/26/2024, 10:23 PM

what is concurrent.futures.ThreadPoolExecutor used for?

Nicolás Gómez

09/26/2024, 10:24 PM

@kapa.ai what is concurrent.futures.ThreadPoolExecutor used for?

user

09/26/2024, 10:24 PM

A message from kapa.ai

user

09/26/2024, 10:25 PM

The

concurrent.futures.ThreadPoolExecutor

is a class in Python's

concurrent.futures

module. It is used for creating a pool of worker threads and submitting tasks to this pool. The tasks are then executed concurrently by the worker threads. Here's a basic example of how it's used:

Copy code

python import concurrent.futures  def some_function(param):     # some task here     return result  with concurrent.futures.ThreadPoolExecutor() as executor:     future = executor.submit(some_function, param)     return_value = future.result()

In this example,

executor.submit(some_function, param)

submits the function to the pool of worker threads and returns a Future object. This object represents the execution of the function and allows you to retrieve its result with

future.result()

. This is particularly useful when you have tasks that can be executed independently and concurrently, as it allows you to take advantage of multiple processors to improve the performance of your program. For more information, you can refer to the [Python documentation on

concurrent.futures

](https://docs.python.org/3/library/concurrent.futures.html).

Nicolás Gómez

09/26/2024, 10:28 PM

@kapa.ai but with your code, I'll be generating the slices in parallel, and what I want to do is to transform the stream to be concurrent 😕 I don't have problems with the slice generation, but I want to concurrently read the stream. And I'm not using a cursor_field

user

09/26/2024, 10:28 PM

A message from kapa.ai

user

09/26/2024, 10:29 PM

I see, you want to read the stream slices concurrently, not generate them concurrently. My apologies for misunderstanding your question earlier. In Airbyte, you can achieve this by using the

ConcurrentSourceAdapter

class. This class allows you to read multiple stream slices in parallel, improving the performance of your connector. Here's a basic example of how to use `ConcurrentSourceAdapter`:

Copy code

python from airbyte_cdk.sources.concurrent_source.concurrent_source_adapter import ConcurrentSourceAdapter, ConcurrentSource  class SourceSurveyMonkeyDemo(ConcurrentSourceAdapter):       def __init__(self, config: Optional[Mapping[str, Any]], state: Optional[Mapping[str, Any]]):           if config:               concurrency_level = min(config.get("num_workers", _DEFAULT_CONCURRENCY), _MAX_CONCURRENCY)           else:               concurrency_level = _DEFAULT_CONCURRENCY           <http://_logger.info|_logger.info>(f"Using concurrent cdk with concurrency level {concurrency_level}")           concurrent_source = ConcurrentSource.create(               concurrency_level, concurrency_level // 2, _logger, self._slice_logger, self.message_repository           )           super().__init__(concurrent_source)           self._config = config           self._state = state

(Source: [Custom-python-connector > Concurrent Part 1](https://docs.airbyte.com/connector-development/tutorials/custom-python-connector/concurrency)) In this example,

ConcurrentSourceAdapter

is used to create a source that reads multiple stream slices concurrently. The

concurrency_level

determines the maximum number of slices that can be read in parallel. Please note that this is a basic example and the actual implementation might vary depending on the specific requirements of your stream. You would need to adapt this example to fit your specific use case. For more information on concurrency in Airbyte, you can refer to the [Airbyte documentation on concurrency](https://docs.airbyte.com/connector-development/tutorials/custom-python-connector/concurrency).

Open in Slack

Previous Next