I want to ingest data from a Postgres source using CDC My ta Airbyte #ask-ai

I want to ingest data from a Postgres source using...

Ashika Umanga Umagiliya

07/17/2024, 7:54 AM

I want to ingest data from a Postgres source using CDC. My targest are Iceberg tables. I want to know how Airbyte internally write to the Iceberg files? Does it use an internal Spark job?

kapa.ai

07/17/2024, 7:55 AM

Airbyte uses logical replication of the Postgres write-ahead log (WAL) to incrementally capture deletes using a replication plugin for Change Data Capture (CDC) [source](https://docs.airbyte.com/integrations/sources/postgres#setup-using-cdc). When writing to Iceberg tables, the incoming Airbyte data is structured in keyspaces and tables and is partitioned and replicated across different nodes in the cluster. This connector maps an incoming

stream

to an Iceberg

table

and a

namespace

to an Iceberg

database

. Fields in the Airbyte message become different columns in the Iceberg tables. Each table will contain the following columns: - `_airbyte_ab_id`: A random generated uuid. - `_airbyte_emitted_at`: a timestamp representing when the event was received from the data source. - `_airbyte_data`: a json text representing the extracted data [source](https://docs.airbyte.com/integrations/destinations/iceberg#iceberg). Every ten thousand pieces of incoming Airbyte data in a stream, which is called a batch, would produce one data file (Parquet/Avro) in an Iceberg table. This batch size can be configured by

Data file flushing batch size

property. As the quantity of Iceberg data files grows, it causes an unnecessary amount of metadata and less efficient queries from file open costs. Iceberg provides data file compaction action to improve this case. This connector also provides auto compact action when stream closes, by

Auto compact data files

property. You can specify the target size of compacted Iceberg data file [source](https://docs.airbyte.com/integrations/destinations/iceberg#iceberg). There is no mention of using an internal Spark job in the provided knowledge sources.

Ashika Umanga Umagiliya

07/17/2024, 7:59 AM

does that mean Airbyte use Iceberg Java API directly without Spark ?

Open in Slack

Previous Next