Hi, is it supported to connect multiple sources (i...
# feedback-and-requests
m
Hi, is it supported to connect multiple sources (i.e. multiple instances of Facebook Marketing, each pointing to a different account) to the same destination (i.e. same Redshift tables)? would something conflict? specifically, I am interested to sync ads_insigths incrementally (and possibly deduped if a PK exists) (generated Redshift tables have a minimum size of 47 GB due to all the columns and sub-tables)
u
Sure, you can have multiple sources pointing to the same destination by modifying the schema name so they dont conflict. By the way, how did you manage to get your Facebook Marketing API access? Do you have an application verified by FB? We're stuck at that part
u
no I actually want to merge all the data to the same tables, to avoid overallocating space in the cluster (e.g. 5 sources will waste 200 GB)
u
I managed to sync
ads_insight
(and variants with breakdowns) without increasing api call rates (I just created a "business" type app). all other streams (
ads
,
campaigns
, ...) will fail
u
Wouldn’t you need to track CDC individually by source? I think normally you would sync individually and UNION all into one using something like dbt.
u
that's why I am asking. if it isn't possible, I'll customize dbt normalization to only expand the couple of columns I am interested in
u
I think for now th best solution would use custom dbt for that
u
Hey @mauro, don't mean to intrude on your thread, I'm curious how you're managing to sync your data without increasing call rates. Thanks to your heads up, I removed other streams (
ads
,
campaigns
, etc, and only used
ads_insights
, and voila, it actually syncs! Though, within a minute or two of syncing, it errors out with a
Utilization is too high (91)%, pausing for 1 minute
. I have a team looking into the business app verification thing, but it's not looking great since we're only using Marketing API for Airbyte. Also, was thinking, like you did, to reduce utilisation going forward with incremental sync instead of full refresh (though, were you able to figure out what the primary keys are? or is using incremental impractical here?) Also, where can one find more documentation to refer on this if you've found the answer to it? Btw, custom dbt would work ON TOP of the pre-synced airbyte data. So it creates views over the data that you'll still have to store with airbyte. Someone correct me if I'm wrong, but airbyte works as an ELT instead of an ETL, so you won't be able to customise the underlying data. Only a view over your data, so the additional storage is still incurred.
u
Happy to connect further and look deeper into FB + Airbyte with you airbyte rocket
n
for now I am only syncing
ads_insights
(no breakdowns) incremental with start date of last week (no backfill).
u
I am seeing those pause logs, but it doesn't seem to make the sync fail. it's only slower.
u
dbt can materialize its models by creating actual tables, I am trying to replace "basic normalization" with custom models that only extract the columns I want
u
Yes. it pauses every 1000 records for me too. Yes, I also ambitiously tried to make all the other ads_insights breakdowns work which resulted in failed attempts logged.
u
so I will (hopefully) have the raw table (with a single json column) and one final table with the ~10 columns I need instead of 115 columns + 46 nested tables
u
dbt can materialize its models by creating actual tables
You're using a custom-built airbyte?
n
AFAIK, the "Custom dbt" built into the UI runs only AFTER the sync has occured. So it's still ELT
u
Even creating tables with the custom dbt would come after the sync
u
yes but I can disable "basic normalization" and only have the
_airbyte_raw_xxx
table in redshift
u
then I add a dbt step with a modified version of the normalization that skips all the bloated columns
u
All "basic normalization" does is change to localised types, like dates, INT, etc. Otherwise raw means all is in TEXT/VARCHAR column types
u
Ahh, yes. Then I'm familiar with that part of the docs. I have a few custom dbt's set up.
n
raw contains a single json
_airbyte_data
column
Ok - does that help you reduce storage space ultimately?
u
it should for small tables
u
redshift allocates 1MB block for each column, for each shard
u
that is blowing up because I have 31 nodes = 62 shards
u
so the minimum size is huge because there are 115 columns in the main table, + 46 sub tables with 3 to 6 columns each