Airbyte #troubleshooting

Mahesh

03/01/2022, 10:22 AM

Help appreciated

Mahesh

03/01/2022, 11:37 AM

Is this your first time deploying Airbyte: Yes OS Version / Instance:MacOs Memory / Disk: 8Gb Deployment: Docker Airbyte Version: 0.35.39-alpha Source name/version: Destination name/version: Step: Setting new connection Description: I can’t make my firts connection, I’ve got a Mac M1 and I’m aware of the troubles involving it, but I’ve followed the instructions in github and I still get the same error.

BERKIN

03/02/2022, 11:48 AM

AirByte Synch stop abruptly after few days of successfull run . Connection is enabled for fequency 24 hours, but synch not happening !!!. How to check the log, synch is not even triggered

Kemp Po

03/08/2022, 4:15 PM

Is this your first time deploying Airbyte: No / Yes OS Version / Instance: GKE n1-standard-4 Memory / Disk: 4gb Deployment: Kubernetes Airbyte Version: 0.35.46-alpha Source name/version: Zendesk Support 0.2.0 Destination name/version: Google Cloud Storage (GCS) 0.1.24 Step: On initial sync Description: having issues with the GCS destination parquet files? I can write to csvs fine but not parquets, I have a hunch that it might be the url its using is

s3a://

instead of

gs://

? All default settings except compression codec = SNAPPY

konrad schlatte

03/10/2022, 12:08 PM

Is this your first time deploying Airbyte: No OS Version / Instance: EC2 Docker-compose Memory / Disk: 16GB RAM / 4GB CPU t3-xlarge Deployment: Docker Airbyte Version: 0_.30.15-alpha_ I am running a custom source connector Salesforce Marketing cloud with destination Snowflake and getting the following timeout error:

Copy code

2022-03-10 08:35:13 INFO () DefaultAirbyteStreamFactory(internalLog):90 - Done retrieving results from 'sent' endpoint
2022-03-10 08:35:13 INFO () DefaultAirbyteStreamFactory(internalLog):90 - Updating state.
2022-03-10 08:35:13 INFO () DefaultAirbyteStreamFactory(internalLog):90 - Fetching sent from 2022-03-09T12:00:00Z to 2022-03-09T12:30:00Z
2022-03-10 08:35:13 INFO () DefaultAirbyteStreamFactory(internalLog):90 - Making RETRIEVE call to 'sent' endpoint with filters '{'Property': 'EventDate', 'SimpleOperator': 'between', 'Value': ['2022-03-09T12:00:00Z', '2022-03-09T12:30:00Z']}'.
2022-03-10 08:35:13 ERROR () DefaultAirbyteStreamFactory(internalLog):88 - Request failed with 'Error: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.'
2022-03-10 08:35:13 ERROR () DefaultAirbyteStreamFactory(internalLog):88 - Traceback (most recent call last):

2022-03-10 08:35:13 ERROR () DefaultAirbyteStreamFactory(internalLog):88 -   File "/usr/local/lib/python3.7/site-packages/tap_exacttarget/__init__.py", line 135, in do_sync

I can resolve this by reducing the "pagination window" from 30 minutes to 5 minutes for example. i.e. it appears that at this time interval there is too much data that needs to be processed - hence the timeout. I am wondering whether there is another way to handle this error. There is an outstanding Pr for this connector as well https://github.com/airbytehq/airbyte/pull/10026.

Oluwapelumi Adeosun

03/11/2022, 6:52 AM

Some tables are missing when I do a

refresh source schema

. Is this a bug or how can I ensure all the tables from the

source

are loaded into the

destination

I specified? The source is a PostgreSQL DB running on Amazon RDS.

Gary K

03/11/2022, 7:29 AM

Hi everyone 👋 Apologies if I'm making a few assumptions here, (it's friday afternoon and i've only done minimal searching), but I'm wondering if/how I can change the

number -> double precision

conversion that appears to be happening with the postgres connector (0.3.15 in airbyte 0.35.42-alpha)? I've got a mysql source bigint column stored with full precision in the _airbyte_data json, but the normalization is converting it to a double and I'm losing precision 😱 (Note, I'd rather not have to do a custom normalisation (from raw) of all the connection streams manually; ie no heavy lifting on my part if possible 🏋️)

Connor Francis

03/11/2022, 10:51 PM

I'm not 100% sure this is the right channel. Currently my source is a Postgres database with many schemas. Using WAL replication I'm listening to all these schemas and dumping them to the same destination. I'd like to include source schema name as an additional column in the destination table. For example I might have schemas:

moe, larry and curly.

All three of these source schemas have the same table called

stooges

. My destination would only have a single schema called

public

and I would like all three sources to dump into the same

stooges

table in this destination schema; however, I would like to add an additional text column in the destination table called source_schema which would take on the value of

moe, larry and curly

Adam Schmidt

03/14/2022, 6:53 AM

Hey team, I'm really close to having the Gitlab connector running. I'm currently able to sync repos and other items to my warehouse for the top-level group, no problems. Problem: I'm wondering if the connector supports sub-groups, as this is how my teams keep themselves organised. Does the connector recurse through groups, sub-groups, sub-groups of sub-groups, and so on? Has anyone done this before? Edit: Seems as though the connector needs either or both of a group id and project id. If the groupID is left empty, the Gitlab api will return all of the groups that the API key has access to. This would be preferable to having to set a long list of space-delimited

my-org%2fsome-subgroup

as the group ID (which works!)

Nahid Oulmi

03/14/2022, 10:32 AM

Is this your first time deploying Airbyte: No OS Version / Instance: Debian GNU/Linux, 10 (buster), amd64 built on, GCP ,e2-standard-4 Memory / Disk: 4 CPU, 8Gb RAM Deployment: Docker Airbyte Version: 0.35.7-alpha Step: Setting up resources Description: My Airbyte jobs take too much of my VM’s RAM ; at some point the VM is down, with no access to the Airbyte’s UI nor via SSH; the only fix available is to shutdown & restart the instance. In order to avoid that, I would like to set a global RAM consumption threshold for all Airbyte, but I am not sure which solution is better ; • The Docker

-memory

parameter seems a good option but I am not sure if it works well within Airbyte deployments : https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory • The Airbyte specific parameter

JOB_MAIN_CONTAINER_MEMORY_LIMIT

works at a job level if I am not mistaken ; As I don’t know how many Airbyte jobs can be triggered at the same time, if I have 10 jobs consuming only 1GB of RAM at the same time it will cause the same issue, which is why I would prefer to set a global RAM threshold. What do you think would be the best option ?

Arash Layeghi

03/14/2022, 2:00 PM

Would you please help with this?

Nitin Jain

03/14/2022, 3:17 PM

@here We are syncing events from kafka to redshift. We have deployed the airbyte on kubernetes . Sometimes our pipeline is getting stuck. Source and destination pods are in runniing state for large amount of time. I have seen the instances where our destination redshift pod is running for 3-4 hours.

Robert Andrews

03/14/2022, 4:32 PM

Hi guys, looking just to confirm an error and my understanding align, am I right with the attached log being an issue solely on the sandbox of GCP rather than AirByte? Just experimenting with the deduped sync mode. @Talia Moyal

Filipe Araújo

03/14/2022, 5:43 PM

Hi everyone! Airbyte Version: 0.35.53-alpha Source name/version: Hubspot 0.1.45 Destination name/version: Postgres 0.3.15 Step: Running Full Refresh Hubspot Description: By accessing the Hubspot dashboard I can see that I have around 362k engagement (activities), I run my connection on Saturday afternoon and till now its still importing with 32 Million (and counting) rows for the engagements. Can you help me try to understand what is happening? Since version 0.1.43 I can’t seem to get this running probably, with that version I got the 362k engagement. Thanks!

Madhup Sukoon

03/14/2022, 6:04 PM

Hi! I'm getting the following error when trying to deploy Airbyte through Helm:

Copy code

error validating data: unknown object type "nil" in Secret.data.postgresql-password

I'm trying to get it to run with an external AWS RDS PGSQL DB. I Have defined the following params:

Copy code

postgresql.enabled
externalDatabase.host
externalDatabase.user
externalDatabase.existingSecret
externalDatabase.existingSecretPasswordKey
externalDatabase.database

I have not defined

externalDatabase.password

(because I want it to take the password from the secret) and the port number (The default should be correct.) Any ideas where I might be going wrong?

William Graham

03/14/2022, 6:44 PM

After updating to the latest release, our incremental syncs with the marketo source no longer work. The error we get in the logs is “KeyError: actionResult”--nothing overly productive to help us diagnose the issue. Any ideas?

Owen Kephart

03/14/2022, 8:24 PM

Hi! Working on interacting w/ the Airbyte API programmatically and noticed a slight weirdness with the

streamName

field in the jobs/get response. Intuitively, I expected this name to be the same as the

name

field for the matching source in the

syncCatalog

of connections/get, but it seems that

streamName

actually includes the prefix, while

name

does not. So for example, if I had a connector with a prefix of

foo

, if

streamName

would be

foo_actions

, while

name

would be just

actions

Aditya Rane

03/15/2022, 1:17 AM

Is this your first time deploying Airbyte: Yes OS Version / Instance: Amazon Linux Memory / Disk: 8 GB Deployment: Docker Airbyte Version: 0.35.46-alpha Source name/version: MSSQL 0.3.17 Destination name/version: Snowflake 0.4.20 Description: 2022-03-15 011306 ERROR i.a.w.DefaultReplicationWorker(run):168 - Sync worker failed. java.util.concurrent.ExecutionException: io.airbyte.workers.DefaultReplicationWorker$DestinationException: Destination process exited with non-zero exit code 1 • Can some one please help me with the internal staging connection with snowflake is there any script which I am suppose to run and missing out. I have also given ownership and usage for all future stages to the role AIRBYTE_ROLE but it stills fails.

Octavia Squidington III

03/15/2022, 7:03 AM

Octavia Squidington III

03/15/2022, 8:07 AM

Kevin Soenandar

03/15/2022, 9:52 AM

Hi team, I'm encountering odd behaviour with the basic normalization. I'm using a modified version of the latest Hubspot connector and for the associations field, I'm building it such that the

companies

table's ticket associations would have the following value:

[ {"company_id": <some_value>, "ticket_id": <some_value>}, {"company_id": <some_value>, "ticket_id": <some_value>} ]

My expectation is it should create a separate table once ingested into my Snowflake warehouse with

company_id

and

ticket_id

as the fields, per this documentation. However, this is not the case. Any idea what I'm missing here?

Keshav Agarwal

03/15/2022, 10:30 AM

3 sheets -> postgres 1 hubspot -> postgres 11 postgres -> postgres all increment, except 2 or 3 tables none are big ones we did not have a problem earlier, we used to run 20 more connectors

Octavia Squidington III

03/15/2022, 11:12 AM

Brian Soares

03/15/2022, 12:18 PM

Hi @channel, This is my first time using Airbyte. I've come across this particular use case where I have to use Airbyte for Batch load from snowflake to Google cloud storage. So I'm able to establish the source as Snowflake and Destination as BigQuery but not Destination as GCS. The file size on an Average is approximately over 1GB. Are there any limitations as to file size which Airbyte can support while loading it to GCS from Snowflake?

Nitin Jain

03/15/2022, 12:22 PM

@here We are syncing json events from kafka to redshift with basic normalisation. We tried using

INSERT

replica strategies, data is being synced but the pipeline is very slow. Looking at the docs we changed the replica strategy to

COPY

via giving the s3 credentials in the redshift destination. In

COPY

replica strategy, csv files are being written on s3, but only some partial data is being inserted into our redshift db. In the exmaple below, you can see pipeline read 39,100 records, I verified 4 different csvs were written on s3 one having

records, another one having somewhere around 22k records, another one with 2k records.But the number of records written to redshift db is around

. I have seen this if the multiple files are written to s3, only one of the file (randomly chosen ) is being synced with db. I m using full refresh | append mode for the pipeline. Attaching Image for better understanding

Jayesh Patil

03/15/2022, 1:07 PM

Just wanted to +1 on the issue. I am seeing the same issue while pulling linkedin data into bigquery.

Maxime Sabran

03/15/2022, 1:31 PM

Hi All, I am trying to set up the Facebook Marketing connection (connector up to date) but I get the error "FacebookAPIException('Error: 2635, (#2635) You are calling a deprecated version of the Ads API. Please update to the latest version: v13.0.')" Would you know if the connector is compatible with this version or am I doing something wrong ?

Michael Horvath

03/15/2022, 1:39 PM

The CONNECT_TIME and IDLE_TIME settings are unlimited for this account. The EXPIRE_TIME setting is 10 minutes. For parallel connections, there is no limit for SESSIONS_PER_USER. Overall utilization limits are: '"RESOURCE_NAME" "CURRENT_UTILIZATION" "MAX_UTILIZATION" "LIMIT_VALUE" "processes" "81" "127" " 600" "sessions" "101" "155" " 928"

Drew Fustin

03/15/2022, 1:57 PM

Hi, all. I’m in the process of setting up our data infrastructure. Hoping to use Airbyte for replication of our backend Postgres database into our Redshift data lake/warehouse. Spun up an EC2-hosted instance of Airbyte, and I seem to be not getting all the records that are in the source into my destination. I created a bug issue here: https://github.com/airbytehq/airbyte/issues/11158

Saman Arefi

03/15/2022, 1:58 PM

Hi everyone, hope you're all having a sensational day. Could I get some pointers regarding Airbyte's scalability? The docs recommend a

t2.large

instance and describe, in details, how Airbyte is mainly memory and disk bound. I've been testing stuff out now on an

t3.xlarge

and noticed the following: Loading one large-ish Oracle table (~9GB, 7M rows) takes me about 30min, which I think is pretty good. Now, loading two at the same time via the same connector (9GB, 7M rows, 13 GB, 7M rows) takes an hour in total, with both taking up roughly an hour each. What gives? Looking at htop, I seem to be running more into a CPU limit as well, so I'm not sure what's causing this. These are my two largest table, but in production I'd use Airbyte for another 30 or so tables, each between 10k and 1M rows as well, so this doesn't seem to scale well. Or am I doing something wrong?