Airbyte #advice-data-warehouses

Shawn Wang

05/27/2022, 1:41 PM

TIL of 3 approaches to DWH structure: https://towardsdatascience.com/how-should-organizations-structure-their-data-c19b66d629e

👀 1

Augustin Lafanechere (Airbyte)

06/02/2022, 1:59 PM

Hey @Lilashree Sahoo please post troubleshooting questions on our forum 🙏🏻

Kha Nguyen

06/07/2022, 1:58 PM

Hi, I am looking to deploy Airbyte to my own AWS infrastructure. Currently there is only a docker-compose.yml. Is there an existing template to deploy this to ECS, or should we craft the deployment to cloud ourselves?

Marcos Marx (Airbyte)

06/23/2022, 6:39 PM

Hello 👋 I’m sending this message to help you identify if this channel is the best place to post your question. Airbyte has a few channels to open discussion about data topics (architecture, ingestion, quality, etc). In these channels you may ask general questions related to the particular topic. If you’re having problem deploying or running a connection in Airbyte this is not the topic. We recommend to you open a Discourse Topic where our support team will help you troubleshooting your issue.

👍 1

Olivier AGUDO PEREZ

06/28/2022, 9:53 AM

Hello, I am using airbyte to replicate data from Mongodb to Bigquery. I would like to have just one final table in BQ per collection in mongo but I end up with the table itself, which is ok + a table "raw" with each documents in mongo in a big unnormalized json object, is it ok ? I can't find a way to disable those tables. I would like to keep "table_1" and get rid of "airbyte_raw_table_1"

👍 1

Arkadiusz Grzedzinski

06/30/2022, 1:08 PM

How does airbyte write to s3? Does it write after each file is created, or does it collect all the output from the source, and divides it into parts and then writes all at once? Asking because I wonder if the logs would show any activity when running a big download

Ashley Baer

07/01/2022, 12:48 PM

Hello all. Does Airbyte plan to expand the Databricks destination to include support for Azure Databricks? And in the meantime, is anyone aware of an existing community-developed connector that would support this?

Sefath Chowdhury

07/08/2022, 12:36 AM

Hello All! I want to build a data ware house for my company. ---------------------------------------------------------------------------------------------------------------------- Current situation ---------------------------------------------------------------------------------------------------------------------- My current stack is taking all of our seperate databases (rds postgres) and using AWS DMS to use CDC to replicate the data into their own representative schema in a huge postgres instance.

Copy code

(ie: Microservice_1_Database | public schema ->  Huge_Postgres_Instance |  microservice_1 schema)

There are two problems with this. 1. AWS DMS is not resilient to DDL changes on the source DB 2. A huge postgres instance is still a postgres instance -> designed to be OLTP and not OLAP (we wanted to use redshift but many existing analytics queries break. This is something we are okay with when moving to cloud agnostic snowflake) ---------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------- Desired Situation ---------------------------------------------------------------------------------------------------------------------- Airbyte -> Snowflake using AWS RDS postgres (CDC enabled) ---------------------------------------------------------------------------------------------------------------------- 1. Does anyone have a stack like this, that uses airbyte to replicate to snowflake real-time? 2. What Logical Decoding Output Plug-in are you using (hopefully you are using AWS RDS instances and that plugin is complaint) and why? 3. Did you deploy airbyte in a pod into a k8 cluster? and if so, how did you determine the specs it needed? I would assume on-going-replication is a heavy lift, and am unsure how to calculate the specs for this deployment

Gerard Barrull

07/11/2022, 10:41 AM

Hey all! I'm using Snowflake as a DataLake for my company and loading all the raw data coming from different sources with Airbyte into it. First question: Is it ok to use Snowflake as a DataLake? Second question: Do you guys have any advise on how to structure the data (for DataLake purpose). I was thinking on doing: • Independent database for all the raw data on the "Data Lake": All data from Airbyte or other sources in raw format. ◦ Schema for airbyte data, I'd have another schema for data coming from other sources. ▪︎ One table for each source (i.e. stripe, google ads, etc)? What do you think about it? Do you have any advise on doing it differently? Is it just a matter of structure preference or are there other pros and cons? Thanks!

👀 1

Ari Bajo (Airbyte)

07/27/2022, 7:47 PM

Hello, how do you optimize your data warehouse costs? @Madison Mae wrote a great article featuring 6 ways to reduce Snowflake costs: 1. Change your warehouse size 2. Decrease the number of warehouses running at the same time 3. Decrease the sync frequency of your data ingestion tool 4. Decrease the warehouse’s auto-resume period 5. Use materialized views 6. Change your Snowflake account plan I am curious to know which optimization has brought the most savings to you?

Jose Luis Cases

08/09/2022, 9:12 PM

Hi, I'm trying to build a data lake using gpc. My idea is use as destination google cloud storage with jsonl format instead on big query but I don't know if this it the best way to analyze data with ML or dashnoards quering gcs

Jose Luis Cases

08/09/2022, 9:12 PM

Some advice please?

Marcos Marx (Airbyte)

08/11/2022, 6:37 PM

James Egan

08/16/2022, 10:48 AM

I have set up a facebook marketing ads to BQ connection and when I select custom fields it keeps failing in sync with the following message "Last attempt: NaN Bytes | no records | no records | 3m 6s | Sync Failure Origin: normalization, Message: Something went wrong during normalization 11:27AM 08/16 3 attempts 2022-08-16 104054 - Additional Failure Information: message='io.temporal.serviceclient.CheckedExceptionWrapper: java.util.concurrent.ExecutionException: io.airbyte.workers.exception.WorkerException: Normalization Failed.', type='java.lang.RuntimeException', nonRetryable=false"

James Egan

08/16/2022, 10:48 AM

When I remove the custom fields, the tables sync but saying no records, even though if I look on FB ads there are 2 records

Vincent Koppen

08/17/2022, 10:01 AM

Hello all, I am using Airbyte Open Source to transfer data from Amazon Ads to BigQuery. In the Connection under Replication it seems that the only available sync modes are Full Refresh (Overwrite and Append). Is there no Incremental Sync Mode in this case?

Abba

08/17/2022, 2:27 PM

Trying scrolling down on the dataset in BigQuery

James Egan

08/17/2022, 2:50 PM

I have done, the raw and the tmp files are in there, the avro file is sitting in my GCS byt there are no rows in my data set just the schema

Hakeem Olu

08/18/2022, 2:30 PM

Hi everyone, glad to be here. I am having issue with my data sync Data not showing in snowflake from redshift It showed everything ran successful

Copy code

Deployment: Using docker for deployment
Airbyte Version: 0.39.39-alpha
Source name/version: Redshift
Destination name/version: Snowflake
Step: The issue is happening during sync
Description: Data not showing in snowflake from redshift.

Versions:
From the airbyte
Redshift: 0.3.11
Snowflake: 0.4.34

AWS Redshift version: 1.0.40182

Hakeem Olu

08/18/2022, 2:30 PM

So basically am seeing the tables showing, but there is no data in the table. I have about 300+ tables. Also if I sync 1 to 20 tables instead of 300+ it works, not just working for my entire tables

Sebastian Brickel

08/22/2022, 10:38 AM

Hi, I set up a connection from BingAds to BigQuerry using Airbyte OSS. The connection works fine as long as I do not include

ad_group_performance_report_hourly

and

campaign_performance_report_hourly

. This gives:

Failure Origin: source, Message: Checking source connection failed - please review this connection's configuration to prevent future syncs from failing

and

Additional Failure Information: Server raised fault: 'Invalid client data. Check the SOAP fault details for more information.

Including only

{ad,campaign}_performance_report_{daily,monthly ,weekly}

works fine. Does anyone have an idea why that could be and how I could fix that? Thank you

Shawn Wang (Airbyte)

08/22/2022, 8:30 PM

https://airbytehq-team.slack.com/archives/C01AB7G87NE/p1661200240451719

Thomas

08/26/2022, 12:05 PM

Question, Is it possible to write the unique identifier of the airbyte sync run to the datawarehouse?

Brendan McDonald

08/30/2022, 7:23 PM

does anyone have experience setting up a hubspot connector? I am trying to pull the

marketing_emails

object, however it seems to be limited at 250 records (we have a total of 720 in hubspot). I am assuming this is because of some sort of rate limit through the API. Is there a way to backfill all data if the only way around this is through an incremental load setup? For further context, I was able to get all records via the API using python directly using pagination. I am just not sure how to configure this via the airbyte UI. Looking at the source code, it looks like there is a 250 record limit setup for each pull. This is definitely a nooby question, but how do you get around the pagination limit here?

Dmytro Vorotyntsev

08/31/2022, 5:11 AM

hi 👋 I’ve setup a connection from Postgres to Amazon Redshift, the Redshift has a mechanism to Improve Query Performance and Optimize Storage with Compression Encoding It observed

An analysis of the cluster’s workload and database schema identified columns that will significantly benefit from using a different compression encoding.

All suggested tables are those configured with Postgres CDC (Deduped History) And its suggestion

Copy code

ALTER TABLE "public"."tatable_1_scd" ALTER COLUMN "_airbyte_unique_key_scd" ENCODE lzo;
ALTER TABLE "public"."table_2_scd" ALTER COLUMN "_airbyte_unique_key_scd" ENCODE lzo;
ALTER TABLE "public"."table_3_scd" ALTER COLUMN "_airbyte_unique_key_scd" ENCODE lzo;
ALTER TABLE "public"."tatable_1_scd" ALTER COLUMN "_airbyte_emitted_at" ENCODE az64;
ALTER TABLE "public"."table_3_scd" ALTER COLUMN "_airbyte_emitted_at" ENCODE az64;
ALTER TABLE "public"."table_4_scd" ALTER COLUMN "_airbyte_emitted_at" ENCODE az64;

Is it a relevant suggestion? Would it break the airbyte sync logic if encoding updated? Thanks

Shivam Thakkar

08/31/2022, 3:17 PM

Hi all, We are currently building a system that needs a data warehouse. We are exploring on the possible open source options. We are planning to use Airbyte for ETL/ELT. We have narrowed down to HDFS but soon found out that Airbyte has no direct support for the same, referring to the destinations list provided by Airbyte - https://airbyte.com/connectors?connector-type=Destinations . Is my understanding correct that there is no support for HDFS as of now ? I would like to seek advise on following, for the research we are doing - 1. Irrespective of the Airbyte support what are some of the open source techs we should look forward to for data warehousing 2. Suggestion on some of the open source techs we could use for data warehousing from the ones that are currently supported by Airbyte

Lucas Wiley

09/15/2022, 11:08 PM

Hi. Has anyone had success with key-pair authentication for Snowflake destination on OSS? I'm unsure of the issue just yet and I've tried a handful of variations on the keys and jdbc params. In any case it's throwing the following trace:

Copy code

Could not connect with provided configuration. net.snowflake.client.jdbc.SnowflakeSQLLoggedException: Private key provided is invalid or not supported: rsa_key.p8: Cannot invoke "net.snowflake.client.jdbc.internal.org.bouncycastle.util.io.pem.PemObject.getContent()" because the return value of "net.snowflake.client.jdbc.internal.org.bouncycastle.util.io.pem.PemReader.readPemObject()" is null

swyx (Airbyte)

09/16/2022, 1:29 PM

great discussion on snowflake today https://www.reddit.com/r/dataengineering/comments/xex3f5/what_makes_snowflake_different/

Parham

09/21/2022, 10:19 AM

<!subteam^S0250R4S8RY|@support> 🤠 https://discuss.airbyte.io/t/error-while-syncing-the-clickhouse-to-google-bigquery/2663 #clickhouse -> #bigquery #sync_issue

Alexis Charrier

09/30/2022, 12:28 PM

Hello folks, anyone having trouble with airbyte running with bigquery ? Since yesterday night on incremental job I get the below error:

500 An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: <https://cloud.google.com/bigquery/sla>. If the error continues to occur please contact support at <https://cloud.google.com/support>. Error: 5423415

Google status page is not reporting any issue regarding Bigquery service 🤔 any idea ?