Ashika Umanga Umagiliya
07/17/2024, 7:54 AMkapa.ai
07/17/2024, 7:55 AMstream
to an Iceberg table
and a namespace
to an Iceberg database
. Fields in the Airbyte message become different columns in the Iceberg tables. Each table will contain the following columns: - `_airbyte_ab_id`: A random generated uuid. - `_airbyte_emitted_at`: a timestamp representing when the event was received from the data source. - `_airbyte_data`: a json text representing the extracted data [source](https://docs.airbyte.com/integrations/destinations/iceberg#iceberg). Every ten thousand pieces of incoming Airbyte data in a stream, which is called a batch, would produce one data file (Parquet/Avro) in an Iceberg table. This batch size can be configured by Data file flushing batch size
property. As the quantity of Iceberg data files grows, it causes an unnecessary amount of metadata and less efficient queries from file open costs. Iceberg provides data file compaction action to improve this case. This connector also provides auto compact action when stream closes, by Auto compact data files
property. You can specify the target size of compacted Iceberg data file [source](https://docs.airbyte.com/integrations/destinations/iceberg#iceberg). There is no mention of using an internal Spark job in the provided knowledge sources.Ashika Umanga Umagiliya
07/17/2024, 7:59 AM