Hi all, I have a google ads -> Redshift connect...
# ask-community-for-troubleshooting
l
Hi all, I have a google ads -> Redshift connection with 7 Streams. I´ve set an Incremental sync mode (deduped + history), with a sync every 24 hours. I see the Redshift unblended cost is 450€ per day!!!!, which is impossible. Can you set a meeting with me to see what is the best practice on working with Redshift destination? Because as you can see in the logs below, it doesn´t load the data in one batch but runs every 5 seconds..
h
Hey sure before that could you share the sync log file here? Also happy to help you out
l
I cancelled it when we realized it pushes every 5 seconds
This is what the engineer sent me There is a query that run every 5 second(!!!!!!) that is the main cause of the mess we see: INSERT INTO indiana._airbyte_tmp_zyc_indiana_clickout ( _airbyte_ab_id, _airbyte_data, _airbyte_emitted_at ) VALUES ($ 1, JSON_PARSE($ 2), $ 3), ($ 4, JSON_PARSE($ 5), $ 6), ($ 7, JSON_PARSE($ 8), $ 9), ($ 10, JSON_PARSE($ 11), $ 12), ($ 13, JSON_PARSE($ 14), $ 15), ($ 16, JSON_PARSE($ 17), $ 18), ($ 19, JSON_PARSE($ 20), $ 21), ($ 22, JSON_PARSE($ 23), $ 24), ($ 25, JSON_PARSE($ 26), $ 27), ($ 28, JSON_PARSE($ 29), $ 30), ($ 31, JSON_PARSE($ 32), $ 33), ($ 34, JSON_PARSE($ 35), $ 36), ($ 37, JSON_PARSE($ 38), $ 39), ($ 40, JSON_PARSE($ 41), $ 42), ($ 43, JSON_PARSE($ 44), $ 45), ($ 46, JSON_PARSE($ 47), $ 48), ($ 49, JSON_PARSE($ 50), $ 51), ($ 52, JSON_PARSE($ 53), $ 54), ($ 55, JSON_PARSE($ 56), $ 57), ($ 58, JSON_PARSE($ 59), $ 60), ($ 61, JSON_PARSE($ 62), $ 63), ($ 64, JSON_PARSE($ 65), $ 66), ($ 67, JSON_PARSE($ 68), $ 69), ($ 70, JSON_PARSE($ 71), $ 72), ($ 73, JSON_PARSE($ 74), $ 75), ($ 76, JSON_PARSE($ 77), $ 78), ($ 79, JSON_PARSE($ 80), $ 81), ($ 82, JSON_PARSE($ 83), $ 84), ($ 85, JSON_PARSE($ 86), $ 87), ($ 88, JSON_PARSE($ 89), $ 90), ($ 91, JSON_PARSE($ 92), $ 93), ($ 94, JSON_PARSE($ 95), $ 96), ($ 97, JSON_PARSE($ 98), $ 99), ($ 100, JSON_PARSE($ 101), $ 102), ($ 103, JSON_PARSE($ 104), $ 105), ($ 106, JSON_PARSE($ 107), $ 108), ($ 109, JSON_PARSE($ 110), $ 111), ($ 112, JSON_PARSE($ 113), $ 114), ($ 115, JSON_PARSE($ 116), $ 117), ($ 118, JSON_PARSE($ 119), $ 120), ($ 121, JSON_PARSE($ 122), $ 123), ($ 124, JSON_PARSE($ 125), $ 126), ($ 127, JSON_PARSE($ 128), $ 129), ($ 130, JSON_PARSE($ 131), $ 132), ($ 133, JSON_PARSE($ 134), $ 135), ($ 136, JSON_PARSE($ 137), $ 138), ($ 139, JSON_PARSE($ 140), $ 141), ($ 142, JSON_PARSE($ 143), $ 144), ($ 145, JSON_PARSE($ 146), $ 147), ($ 148, JSON_PARSE($ 149), $ 150), ($ 151, JSON_PARSE($ 152), $ 153), ($ 154, JSON_PARSE($ 155), $ 156), ($ 157, JSON_PARSE($ 158), $ 159), ($ 160, JSON_PARSE($ 161), $ 162), ($ 163, JSON_PARSE($ 164), $ 165), ($ 166, JSON_PARSE($ 167), $ 168), ($ 169, JSON_PARSE($ 170), $ 171), ($ 172, JSON_PARSE($ 173), $ 174), ($ 175, JSON_PARSE($ 176), $ 177), ($ 178, JSON_PARSE($ 179), $ 180), ($ 181, JSON_PARSE($ 182), $ 183), ($ 184, JSON_PARSE($ 185), $ 186), ($ 187, JSON_PARSE($ 188), $ 189), ($ 190, JSON_PARSE($ 191), $ 192), ($ 193, JSON_PARSE($ 194), $ 195), ($ 196, JSON_PARSE($ 197), $ 198), ($ 199, JSON_PARSE($ 200), $ 201), ($ 202, JSON_PARSE($ 203), $ 204), ($ 205, JSON_PARSE($ 206), $ 207), ($ 208, JSON_PARSE($ 209), $ 210), ($ 211, JSON_PARSE($ 212), $ 213), ($ 214, JSON_PARSE($ 215), $ 216), ($ 217, JSON_PARSE($ 218), $ 219), ($ 220, JSON_PARSE($ 221), $ 222), ($ 223, JSON_PARSE($ 224), $ 225), ($ 226, JSON_PARSE($ 227), $ 228), ($ 229, JSON_PARSE($ 230), $ 231), ($ 232, JSON_PARSE($ 233), $ 234), ($ 235, JSON_PARSE($ 236), $ 237), ($ 238, JSON_PARSE($ 239), $ 240), ($ 241, JSON_PARSE($ 242), $ 243), ($ 244, JSON_PARSE($ 245), $ 246), ($ 247, JSON_PARSE($ 248), $ 249), ($ 250, JSON_PARSE($ 251), $ 252), ($ 253, JSON_PARSE($ 254), $ 255), ($ 256, JSON_PARSE($ 257), $ 258), ($ 259, JSON_PARSE($ 260), $ 261), ($ 262, JSON_PARSE($ 263), $ 264), ($ 265, JSON_PARSE($ 266), $ 267), ($ 268, JSON_PARSE($ 269), $ 270), ($ 271, JSON_PARSE($ 272), $ 273), ($ 274, JSON_PARSE($ 275), $ 276), ($ 277, JSON_PARSE($ 278), $ 279), ($ 280, JSON_PARSE($ 281), $ 282), ($ 283, JSON_PARSE($ 284), $ 285), ($ 286, JSON_PARSE($ 287), $ 288), ($ 289, JSON_PARSE($ 290), $ 291), ($ 292, JSON_PARSE($ 293), $ 294), ($ 295, JSON_PARSE($ 296), $ 297), ($ 298, JSON_PARSE($ 299), $ 300), ($ 301, JSON_PARSE($ 302), $ 303), ($ 304, JSON_PARSE($ 305), $ 306), ($ 307, JSON_PARSE($ 308), $ 309), ($ 310, JSON_PARSE($ 311), $ 312), ($ 313, JSON_PARSE($ 314), $ 315), ($ 316, JSON_PARSE($ 317), $ 318), ($ 319, JSON_PARSE($ 320), $ 321), ($ 322, JSON_PARSE($ 323), $ 324), ($ 325, JSON_PARSE($ 326), $ 327), ($ 328, JSON_PARSE($ 329), $ 330), ($ 331, JSON_PARSE($ 332), $ 333), ($ 334, JSON_PARSE($ 335), $ 336), ($ 337, JSON_PARSE($ 338), $ 339), ($ 340, JSON_PARSE($ 341), $ 342), ($ 343, JSON_PARSE($ 344), $ 345), ($ 346, JSON_PARSE($ 347), $ 348), ($ 349, JSON_PARSE($ 350), $ 351), ($ 352, JSON_PARSE($ 353), $ 354), ($ 355, JSON_PARSE($ 356), $ 357), ($ 358, JSON_PARSE($ 359), $ 360), ($ 361, JSON_PARSE($ 362), $ 363), ($ 364, JSON_PARSE($ 365), $ 366), ($ 367, JSON_PARSE($ 368), $ 369), ($ 370, JSON_PARSE($ 371), $ 372), ( $ 373, JSON_PARSE($ 374), $
and this is the Redshift billing
Good Morning!
h
Oh my god. I hope we could have got this information before 😅. Sending this thread across to the team to see if someone can takeover and help you in more details
l
Can we have a meeting? I'm desperate about this. And we need to give a good solution to our client..
e
It looks like this connection is using standard inserts - switching the upload method to S3 staging should reduce this cost by a lot. (our docs have some info about the difference - the COPY strategy is recommended for production workloads)
l
Thanks, I'll try it. But do you know if I should send the data without the normalization, and do the normalization and transformation in my destination Redshift?
e
are you referring to the transformation setting? (i.e. raw data vs normalized) that’s more a question of what you need the data to look like - raw data means you’ll just get the
*_raw
tables, which just has your data in JSON blobs; normalized data will expand that out into normalized tables (and also handles deduping, if you’re using the incremental dedup+history sync mode)
l
Hi, I would prefer normalized data, with incremental + deduped history, but it consumes A LOT of money on my destination (redshift).
But as I red in the forum, someone mentioned that he thinks the normalization is the one consuming redshift resource. Therefore I´m thinking to take the normalization out airbyte. Ideally I would prefer to do everything through airbyte, but we received a predicted monthly cost from redshift of 23000€. Only for loading few tables..
e
got it - normalization does consume some redshift credits, but (once you’re past the first two syncs) it should run incrementally, i.e. only processing recent data. Have you had a chance to run a sync using s3 staging + normalized tables? Certainly if your client is cost-sensitive then it’s potentially worth disabling normalization though.
l
Hi.. I will try today. I intend to create a data lake on S3, therefore, store there the raw data (haven´t found it in the documentation if it´s possible to set incremental + history deduped in the connection.. If it´s not possible, I don´t know where to transform the data.. I wanted with glue, but the client insists on DBT
e
hm, are you planning to directly use the destination-s3 connector for this? That connector doesn’t support normalization (since it’s just writing raw blobs to s3)
l
I just realized that.. I can´t find a good solution. I just want to have few sources via airbyte, transform the data and sent it to redshift destination.. but as dbt custom transform and normalization are done at destination.. not sure what to do
e
mmm. I would recommend giving destination-redshift via s3 staging a try - under this loading method, it’ll do pretty much what you want (i.e. push data to S3 and the
COPY
it into redshift). And then you’d be able to choose basic normalization or a custom dbt transformation
l
I tried it. It´s really good, but it sends 4 tables to redshift (of every table I sync. the scd, the stg, raw , and normalized one). So if a table weights 30G, I´ll transfer it 4 times to redshift. I´m trying the 2 connections option: google ads -> S3 which works correctly. I see data in avro format in bucket. Which Output format should I use when setting the S3 destination? and S3 -> Redshift. Sync works, but not data transferes/ normalized. Any ideas why?
e
did your s3 -> redshift sync have normalization enabled? (it sounds like you got the
_raw
tables, but not the normalized tables? Or did you mean that no data got transferred at all?) generally this setup feels a little redundant - the s3 -> redshift sync will do a second copy into s3, and then do a COPY into redshift. It would be more efficient to do a single GAds -> redshift sync directly, unless I’m misunderstanding the setup. Creating 4 tables is (currently) baked into how normalization works - if that’s problematic cost-wise, then you’d probably need to roll your own custom transformation unfortunately, which would pull data from the
_raw
table and generate a normalized table. (but certainly this is useful feedback, I’ll pass it on to the rest of the team)
l
The purpose of the second connection, the S3-> redshift is for me to be able to select only data I really need, transform it and write it on redshift. The connection succeeded but I´ve got empty tables (raw, scd and normalized). I´m confident the bucket contains data. here is the log https://airbyte.apps.7lds.de/workspaces/9d9aea6f-8f2e-4ef1-9c3e-0bbf57935780/connections/87d5fd47-3c9d-4062-9d17-5acd5b569aec/status#272::0
I must say that I´m thankful for your time! Gracias totales!
e
select only data I really need
interesting! is this at a greater granularity than what the google ads source provides? I.e. you’re looking for more control than just “sync streams X, Y, Z”?
log
there’s an auth layer in front of this (which is good!) but means I can’t access it 🙂 can you download the logs and upload them here?
l
Here is the log. I have multiple sources.. so I´m trying to optimize..
e
hm. you mentioned that the S3 bucket contains avro data - it looks like source-s3 is trying to read csv? (at
2022-11-07 18:47:08
-
'filetype': 'csv'
) I’m not super familiar with sourec-s3, but IIRC there’s a config option to have it read avro instead and of course - it’s always useful to doublecheck that the files on S3 match whatever path is configured on sourec-s3, just in case
l
no.. I switched to csv, and both are in csv.. buff. what a day 😅
e
😕 in the replication tab on your connection, does it at least detect the schema correctly?
l
Hi, good morning! Yes, it detects the schema
Another question.. Let´s say I have 4 tables in google ads source, and redshift destination (with S3 Staging upload method). I set the connection (incremental + history deduped). I also have basic normalization and I set in the connection a dbt transformation that creates a table with fields from the 4 source tables. In redshift I only need the table I created on dbt, which might be a table of 20 rows. Are the other 4 tables that I synced from source going to be written in Redshift destination?
e
yeah, they’ll need to get written to redshift before the dbt transformation can run
lemme reach out to the team about source-s3 - I’m not really sure what’s happening there
the API team (which owns source-s3) suggested opening a github issue with some more details: • screenshots of your configuration (especially the S3 source config) • some sample files (if possible, or at least with sensitive info removed) also, can you expand the
gaS3
stream (click the little arrow on the left) to confirm that it has the expected fields?