I want to ingest data from a Postgres source using...
# ask-ai
a
I want to ingest data from a Postgres source using CDC. My targest are Iceberg tables. I want to know how Airbyte internally write to the Iceberg files? Does it use an internal Spark job?
k
Airbyte uses logical replication of the Postgres write-ahead log (WAL) to incrementally capture deletes using a replication plugin for Change Data Capture (CDC) [source](https://docs.airbyte.com/integrations/sources/postgres#setup-using-cdc). When writing to Iceberg tables, the incoming Airbyte data is structured in keyspaces and tables and is partitioned and replicated across different nodes in the cluster. This connector maps an incoming
stream
to an Iceberg
table
and a
namespace
to an Iceberg
database
. Fields in the Airbyte message become different columns in the Iceberg tables. Each table will contain the following columns: - `_airbyte_ab_id`: A random generated uuid. - `_airbyte_emitted_at`: a timestamp representing when the event was received from the data source. - `_airbyte_data`: a json text representing the extracted data [source](https://docs.airbyte.com/integrations/destinations/iceberg#iceberg). Every ten thousand pieces of incoming Airbyte data in a stream, which is called a batch, would produce one data file (Parquet/Avro) in an Iceberg table. This batch size can be configured by
Data file flushing batch size
property. As the quantity of Iceberg data files grows, it causes an unnecessary amount of metadata and less efficient queries from file open costs. Iceberg provides data file compaction action to improve this case. This connector also provides auto compact action when stream closes, by
Auto compact data files
property. You can specify the target size of compacted Iceberg data file [source](https://docs.airbyte.com/integrations/destinations/iceberg#iceberg). There is no mention of using an internal Spark job in the provided knowledge sources.
a
does that mean Airbyte use Iceberg Java API directly without Spark ?