https://linen.dev logo
#ask-community-for-troubleshooting
Title
# ask-community-for-troubleshooting
j

Jiyuan Zheng

09/06/2021, 6:27 PM
Hi, our company (Bolt) is considering to use Airbyte to replicate data from Postgres to BigQuery on an ongoing basis. (roughly 500GB of data, ~50 tables) Ideally, I would love to keep the destination up-to-date as much as possible, so I am setting the sync frequency to be 5 minutes. So far I was testing Airbyte using 1 connection for the entire job and noticed a few pain points. • The sync takes long time: 25GB/h, so 500GB would take more than 20h I assume • I don’t have flexibility to add and change the tables to replicate. Every time I do any changes to the existing connection, I would need to restart the whole replication from scratch • Sometimes, sync jobs fail without a clear reason. I found it’s very hard to debug in general and resetting the connection would fix the issue most of the time, but it would again take very long time to make the destination up to date. Because of above mentioned reasons, I am thinking about setting up multiple connections, but my concerns are: • operation: I need to manage a lot of replication slots myself. To create a new connection, I need to manually go to the production DB and create a new replication slot and use that replication slot during setup. Any good solution for this? (jenkins, cicd, terraform or anything else?) • monitoring - when replication lags and the WAL logs build up in the source DB, it is hard to identify which connection was problematic. Also it only supports Slack webhook now. (ideally oncall engineers should be alerted) What/How do you do monitoring? Thank you so much for spending time reading my long question and please let me know what your thoughts and recommendations are. 🙏
u

[DEPRECATED] Marcos Marx

09/06/2021, 10:46 PM
1. The default batch size is 1000 records, if you have resource is possible to you change this number and decrease the sync time reading more records each time. See the issue and also a Slack discussion showing how to change this. 2. The schema evolution for new tables and new columns is something we're working to solve and release soon. But now you need to reset the connection. Check the issue here operation: you probably can use our API to handle this using some script language and cicd to trigger the action. monitoring: you have the option to use any webhook not only slack one, so you can send this to other tool and reach your engineers.
👍 1
d

Davin Chia (Airbyte)

09/07/2021, 12:49 AM
Is it 500GB every single job or is that only the first load? What is the expected size of each 5 min job?
j

Jiyuan Zheng

09/07/2021, 3:03 AM
500GB is the initial sync. The data is growing ~ 2 GB per hour for these tables so I guess every 5 minute job would have about less than 200MB of data.
s

Subodh (Airbyte)

09/07/2021, 7:37 AM
@Jiyuan Zheng this is great feedback! thanks for sharing. I have few questions/thoughts. 1. I see you mentioned the
replication slots
so I assume that you are using CDC. In that case the
batch_size
can not be set from your end. 2. Which phase of the sync is taking most of the time? Is it reading the data from the source or loading it into the destination or the normalization step? 3. How big do your WAL logs get? Am curious if we are spending too much time trying to identify where to start from in the WAL logs in case they get too big. 4. The failure that you observed, which part of the sync did you observe it in? The source/destination/normalization. Also were these interim failures i.e. when you restart the sync, did they go away on their own. Am trying to understand if its related to protocol or the connectors
j

Jiyuan Zheng

09/08/2021, 1:51 AM
@Subodh (Airbyte) 1. yes, I am using CDC, but I would assume
batch_size
will still be used during the initial sync? 2. The main issue is during initial sync. It is taking long time and the connection is unstable. I haven’t encounter much issue during subsequent syncs yet. 3. The WAL log size will be about 5GB per hour in production. Currently, I am testing on staging which has about half of the 2 ~ 2.5GB per hour 4. normalization tends to fail quite often so we decided to turn it off and purely evaluate the CDC feature now. Airbyte is reprocessing all the data during normalization right now, so we cann’t use that as is and we are potentially looking for building our own incremental . If you can
if you can point us some resources on how to achieve the incremental normalization, that would help a lot 🙏
s

Subodh (Airbyte)

09/08/2021, 4:25 PM
@Jiyuan Zheng 1. Currently we dont have a mechanism for user to allow batch size via setup wizard for CDC. I assume increasing the batch size should improve interim issues with initial sync. I have created an issue to track this https://github.com/airbytehq/airbyte/issues/5916 2. Regarding the incremental normalization, unfortunately at this point there is no way for that. We have it in our roadmap https://github.com/airbytehq/airbyte/issues/4286 , please comment on the issue to highlight its importance
👍 1
s

s

09/09/2021, 5:46 PM
@Jiyuan Zheng regarding point #3 in your OP, do you have logs/examples of these failing without any clear reaosn? as many examples as you can provide would be super helpful
j

Jiyuan Zheng

09/10/2021, 2:19 AM
s

s

09/10/2021, 3:10 AM
thanks!