Hi is it supported to connect multiple sources i e multiple Airbyte #feedback-and-requests

Hi, is it supported to connect multiple sources (i...

mauro

08/11/2021, 9:04 AM

Hi, is it supported to connect multiple sources (i.e. multiple instances of Facebook Marketing, each pointing to a different account) to the same destination (i.e. same Redshift tables)? would something conflict? specifically, I am interested to sync ads_insigths incrementally (and possibly deduped if a PK exists) (generated Redshift tables have a minimum size of 47 GB due to all the columns and sub-tables)

user

08/11/2021, 10:27 AM

Sure, you can have multiple sources pointing to the same destination by modifying the schema name so they dont conflict. By the way, how did you manage to get your Facebook Marketing API access? Do you have an application verified by FB? We're stuck at that part

user

08/11/2021, 10:39 AM

no I actually want to merge all the data to the same tables, to avoid overallocating space in the cluster (e.g. 5 sources will waste 200 GB)

user

08/11/2021, 10:40 AM

I managed to sync

ads_insight

(and variants with breakdowns) without increasing api call rates (I just created a "business" type app). all other streams (

ads

campaigns

, ...) will fail

user

08/11/2021, 11:20 AM

Wouldn’t you need to track CDC individually by source? I think normally you would sync individually and UNION all into one using something like dbt.

user

08/11/2021, 12:51 PM

that's why I am asking. if it isn't possible, I'll customize dbt normalization to only expand the couple of columns I am interested in

user

08/12/2021, 1:52 AM

I think for now th best solution would use custom dbt for that

user

08/12/2021, 4:17 PM

Hey @mauro, don't mean to intrude on your thread, I'm curious how you're managing to sync your data without increasing call rates. Thanks to your heads up, I removed other streams (

ads

campaigns

, etc, and only used

ads_insights

, and voila, it actually syncs! Though, within a minute or two of syncing, it errors out with a

Utilization is too high (91)%, pausing for 1 minute

. I have a team looking into the business app verification thing, but it's not looking great since we're only using Marketing API for Airbyte. Also, was thinking, like you did, to reduce utilisation going forward with incremental sync instead of full refresh (though, were you able to figure out what the primary keys are? or is using incremental impractical here?) Also, where can one find more documentation to refer on this if you've found the answer to it? Btw, custom dbt would work ON TOP of the pre-synced airbyte data. So it creates views over the data that you'll still have to store with airbyte. Someone correct me if I'm wrong, but airbyte works as an ELT instead of an ETL, so you won't be able to customise the underlying data. Only a view over your data, so the additional storage is still incurred.

user

08/12/2021, 4:18 PM

Happy to connect further and look deeper into FB + Airbyte with you airbyte rocket

Nick Robitaille

08/12/2021, 4:21 PM

for now I am only syncing

ads_insights

(no breakdowns) incremental with start date of last week (no backfill).

user

08/12/2021, 4:22 PM

I am seeing those pause logs, but it doesn't seem to make the sync fail. it's only slower.

user

08/12/2021, 4:24 PM

dbt can materialize its models by creating actual tables, I am trying to replace "basic normalization" with custom models that only extract the columns I want

user

08/12/2021, 4:25 PM

Yes. it pauses every 1000 records for me too. Yes, I also ambitiously tried to make all the other ads_insights breakdowns work which resulted in failed attempts logged.

user

08/12/2021, 4:26 PM

so I will (hopefully) have the raw table (with a single json column) and one final table with the ~10 columns I need instead of 115 columns + 46 nested tables

user

08/12/2021, 4:26 PM

dbt can materialize its models by creating actual tables

You're using a custom-built airbyte?

Nick Robitaille

08/12/2021, 4:26 PM

AFAIK, the "Custom dbt" built into the UI runs only AFTER the sync has occured. So it's still ELT

user

08/12/2021, 4:27 PM

Even creating tables with the custom dbt would come after the sync

user

08/12/2021, 4:27 PM

yes but I can disable "basic normalization" and only have the

_airbyte_raw_xxx

table in redshift

user

08/12/2021, 4:28 PM

then I add a dbt step with a modified version of the normalization that skips all the bloated columns

user

08/12/2021, 4:29 PM

All "basic normalization" does is change to localised types, like dates, INT, etc. Otherwise raw means all is in TEXT/VARCHAR column types

user

08/12/2021, 4:29 PM

I am following https://docs.airbyte.io/operator-guides/transformation-and-normalization

user

08/12/2021, 4:29 PM

Ahh, yes. Then I'm familiar with that part of the docs. I have a few custom dbt's set up.

Nick Robitaille

08/12/2021, 4:30 PM

raw contains a single json

_airbyte_data

column

Nick Robitaille

08/12/2021, 4:30 PM

Ok - does that help you reduce storage space ultimately?

user

08/12/2021, 4:31 PM

it should for small tables

user

08/12/2021, 4:31 PM

redshift allocates 1MB block for each column, for each shard

user

08/12/2021, 4:31 PM

that is blowing up because I have 31 nodes = 62 shards

user

08/12/2021, 4:33 PM

so the minimum size is huge because there are 115 columns in the main table, + 46 sub tables with 3 to 6 columns each

3 Views

Open in Slack

Previous Next