https://linen.dev logo
#ask-community-for-troubleshooting
Title
# ask-community-for-troubleshooting
a

Anton Podviaznikov

05/24/2022, 1:26 PM
Hi everyone. I'm currently evaluating airbyte and comparing it to pipelinewise. I use airbyte
0.38.4-alpha
on k8s and trying to sync one table from PG to Snowflake. Table has 32 mln records. It takes airbyte anywhere from 2h30m to 3h30 min to do initial sync on this table. Pipelinewise takes 37min. I'm not sure how to get the same numbers. Another thing that confuses me that after sync is done I see that both tables in snowflake have 32 mln records. But the size of the table created by pipelinewise is 2.6GB and the one created by airbyte is 5GB (and on top of that why does airbye UI shows that 49.25 GB worth of data were processed - those numbers don't match). Why is that? Any ideas.
a

Augustin Lafanechere (Airbyte)

05/24/2022, 5:52 PM
Hey, we migrated our support to our online forum, our team is there ready to help you, do you mind posting this question on it?
d

Davin Chia (Airbyte)

05/25/2022, 9:19 AM
@Anton Podviaznikov did you manage to ask this in the online forum?
l

Liren Tu (Airbyte)

05/25/2022, 9:11 PM
The reason why Snowflake has more data from Airbyte sync is that the Airbyte connector first writes the data into
raw
tables, those prefixed with
_raw
. And if normalization is enabled, the connector will trigger
dbt
to normalizes those raw tables to the final normalized tabled. Hence the physical size is roughly 2x.
The
49.25GB
on the UI is the size of the serialized data in JSON format. It is usually an overestimation of the actual data. For example, the actual data may be a number
1234
from column
value
, the serialized JSON looks like
{"value":1234}
, and its size is significantly larger.
a

Anton Podviaznikov

05/26/2022, 5:57 PM
@Liren Tu (Airbyte) thank you! The difference in size now makes sense. I can totally see how JSON data will be way larger