Airbyte #advice-data-ingestion

Kyle MacKenzie

06/08/2022, 9:10 AM

From the docs:

We do not support schema changes automatically for CDC sources. We recommend resetting and resyncing data if you make a schema change.

Does this also wipe the data loaded to the raw json variant table? Would this not mean losing all historical data each time there’s a schema change in the source? Is there some recommended way to handle this so that all data isn’t wiped every time there’s a new column etc?

👀 1

Martin Carlsson

06/08/2022, 11:05 AM

Hi, I'm setting a MySQL to Snowflake connection using CDC The data is flowing without errors. However, there is a lot of duplicates in the destination Snowflake that are not in the source MySQL. My best guess is that it is due to the row has been updated multiple times. In our data warehouse we are only interested in what is currently in the source system, and not history. Are the any way I can get rid of that old data? Like a setting in Airbyte or a filter in Snowflake?

Asadbek Muminov

06/08/2022, 4:02 PM

I think this is the relatively best fit chanell to ask the following question: I’m developing Airbyte Destination in Java, what is the easiest way to make HTTP API request from my connector? Or I have to use Python CDK to make any HTTP request?

Ryan Cheatham

06/08/2022, 5:36 PM

Hi, is this a good place to ask questions about developing a new connector?

Jeff Crooks

06/08/2022, 6:39 PM

is anyone able to get custom fields from jira source?

Dusty Shapiro

06/09/2022, 1:12 AM

👋 Hello fellow Airbyters. Quick question; I’m curious how others handle changes to a source Database’s schema. For example, lets say upstream a table gets a new column, how would my Airbyte connection handle that? Ideally, I would prefer it to fail, notify of the upstream change, so I can migrate downstream before syncing. Thanks all

Sylvain Sinay

06/09/2022, 9:30 AM

Hi there 🙂 Hope this is the good channel for such question, For the Google Search Console connector, anyone knows why there is no "country" field in the table query/page - Country granularity is essential for deep analysis. Is there an how-to to ask such improvment ?

Apostol Tegko

06/09/2022, 4:42 PM

Hey all, We’re trying to add a new stream without resetting the connection. What would you recommend the best to go about this? Thinking whether we should clone the relevant tables and copy content after reset. Also saw this link suggesting interacting via the API? The important part is that this stream is also just being created so not sure will be accessible when hitting update endpoint?

v e

06/09/2022, 10:34 PM

Hi All, We are planning to ingest CDC from MS SQL server using Airbyte and write the results to Kafka before writing it back to S3 in the delta format. My question is how do you guys handle schema changes - like new column, dropped column, or re named column?

Dorian Lacaisse

06/13/2022, 8:44 AM

Hi all, I used Airbyte to transfer data from CRM and Postgres to Azure blob Storage. I then needed to use Azure data factory and databricks for some transformation. However, neither ADF, nor databricks can read the file created by Airbyte. ‘Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.’ I try several things but can’t manage to get a BLOCK_BLOC, only APPEND_BLOB is created by Airbyte. Looking for advices to resolve that!

🙏 1

Yifan Sun

06/13/2022, 11:47 PM

Hello, I just found a solution here, but I have trouble getting the "base_python package". I tried to pip install base_python but got an error. Anyone knows is it a module from AirByte codebase or how can I get it? Thanks!

Mangesh Nehete

06/14/2022, 11:25 AM

Hi Team, We have set up 2 connections as follows: Connection Source Destination No of streams Sync Mode(for all streams) Frequency Data Size/Sync Avg Time taken/Sync Connection1 Oracle DB Snowflake 15 Full refresh | Overwrite Every 24 hours 11.28 GB 1 hour 38 minutes Connection2 Oracle DB Snowflake 4 Full refresh | Overwrite Every 24 hours 8.32 GB 42 minutes We want to reduce the average time taken for these 2 jobs. Could you please suggest us any approach/solution to achieve it. Thanks & Regards, Mangesh Nehete

Gabriel Meisel

06/15/2022, 4:35 PM

Hello, has someone encountered this error when trying to connect to MySQL: io.airbyte.workers.general.DefaultReplicationWorker$SourceException: Source process exited with non-zero exit code 1

Lee Harrington

06/17/2022, 5:51 PM

Is there a way to have Airbyte preserve the column order from the source when it creates the table in the destination? I searched, but could not find the answer

Tomas Perez

06/17/2022, 7:28 PM

I'm having some trouble ingesting data from Shopify using Airbyte. Has anyone being able to setup the connection? I think I need a guide on that 😂

v e

06/17/2022, 11:20 PM

Hi Team, Quick question about Incremental I Append Sync mode. In this mode, does airbyte just pull changes since last ingestion using cursor field and perform merge operation on the destination (databricks delta)? I see it only doing append is it possible to upsert instead?

Tomas Perez

06/17/2022, 11:27 PM

Copy code

Done. PASS=109 WARN=0 ERROR=11 SKIP=177 TOTAL=297

Keep getting this error when moving data from Shopify to Snowflake, after that several tables remain empty. Any clue what might be wrong This is my current setup

Prakash

06/20/2022, 7:53 AM

Hi, after doing the transformation using Airbyte tool, columns are converting into small letters. I am using postgres as a destination..in table all columns are appearing as small letters. what could be the solution for this please..? I am using column alias in .sql file to get some specific column names..but it converts all in small letters.

Rohan Chitalia

06/21/2022, 4:10 PM

Hi folks - do we have an updated timeline for Airbyte Cloud availability?

Simon Thelin

06/22/2022, 8:24 AM

Hello! Is there any general advice/docs which explains how to tune airbyte? It is painfully slow atm, and for now it might be that I have too little resources allocated within my k8s deployment, but is there anything else that can be tuned in terms of parallelism?

Bryce Macdonald

06/23/2022, 7:30 AM

Have a Kafka Source with some JSON messages on a Topic which I want to send to a Postgres database. When creating the connection I get a "Failed to fetch schema. Please try again". The status just before the error shows "fetching schema of the data source". What am I missing?

Sheshan

06/23/2022, 11:15 AM

Is this your first time deploying Airbyte: No OS Version / Instance: Ubuntu 18.6 Memory / Disk: 8 CPU, 16Gb RAM Deployment: Docker Airbyte Version: 0.33.12-alpha Step: Sync data Description: Trying to sync data from BigQuery source. Experiencing heavy RAM use by bigquery container, single container takes 4GB of ram. due to other services getting stuck and sync is failed. anyone can suggest any way to handle this?

James Kwon

06/23/2022, 3:02 PM

Hello. Any advice on running an initial data sync of a large (90+million row) data set in small batches? the cursor would be a date field and if there's a way to set a range of dates per batch, that would be very helpful

👀 1

Olivier AGUDO PEREZ

06/23/2022, 3:28 PM

Hello, on Airbyte Open Source documentation, about connectors, I can see 2 versions of the source Mongodb connector, and one of them supports incremental append sync mode, however when I try to add the connector to my pipeline, I can only chose between the full refresh methods, is it normal ?

David Boissier

06/23/2022, 4:19 PM

Hello, We are evaluating the Airbyte Open Source version installed on an EC2 (m6a.large instance type) and we have to ingest some internal geo data from a postgresql database (the size given by

select pg_size_pretty(pg_database_size('<geo_database>'))

is around 1.5 GB). Unfortunately, during the sync process, we have encountered the following error:

Copy code

2022-06-23 13:50:38 normalization > 13:50:35.753298 [error] [MainThread]: Database Error in model geolocation_geolocation (models/generated/airbyte_tables/gis_raw/geolocation_geolocation.sql)
2022-06-23 13:50:38 normalization > 13:50:35.753720 [error] [MainThread]:   could not write to file "base/pgsql_tmp/pgsql_tmp1360.7": No space left on device
2022-06-23 13:50:38 normalization > 13:50:35.754073 [error] [MainThread]:   compiled SQL at ../build/run/airbyte_utils/models/generated/airbyte_tables/gis_raw/geolocation_geolocation.sql

We made some investigation to understand where the cause could come from (space on the EC2, disk space allocated for the containers) but we have found nothing. Do you have an idea? Thanks in advance.

Richmond Eweh

06/23/2022, 5:19 PM

Hey everyone, I've been using Airbyte for a few months now, and I absolutely love it. I started with the cloud version until I figured out to finally use the Open Source version via Docker. My first connection (Hubspot --> BigQuery) worked flawlessly 🤌🏽 . But ironically, a similar but newer connection with the same source and destination (but different repos) doesn't work. It's pulling 0-1 rows of data (from HubSpot) which is ridiculous since there are 1500+ companies in my Hubspot account. Here's the connections I currently have.

Tomas Perez

06/23/2022, 9:55 PM

Regarding the

Shopify Source

I want to get the first order each customer made on the app. I see Shopify allows this via

customer.orders

(see the link). I don't know if the current version of the source connector supports this or if it should be supported. Help 😅 Does anyone know how to retrieve `first_order`data per customer?

Alexandre Cazé

06/24/2022, 6:32 AM

Hello there 👋 I have JSON files pushed in a S3 buckets that I want to load into my DWH (a Redshift cluster). It seems that the S3 connector only allows to read CSV, parquet or Avro. Am I mistaken ? Thanks 🙂

Vijay

06/24/2022, 10:38 AM

Hi all, are there benchmarks on how long sync with a given source (say Facebook Ads) takes? It is always service provider bound? or does it improve with CPU or memory of host machine?

Dusty Shapiro

06/27/2022, 4:53 PM

Seeing this error on a first time Hubspot --> Postgres sync

Copy code

2022-06-27 16:39:51 normalization > 2022-06-27 16:39:46.671752 (MainThread): Database Error in model ticket_pipelines (models/generated/airbyte_tables/hubspot/ticket_pipelines.sql)
2022-06-27 16:39:51 normalization > 2022-06-27 16:39:46.672106 (MainThread):   invalid input syntax for type bigint: "1970-01-01T00:00:00Z"
2022-06-27 16:39:51 normalization > 2022-06-27 16:39:46.672442 (MainThread):   compiled SQL at ../build/run/airbyte_utils/models/generated/airbyte_tables/hubspot/ticket_pipelines.sql