Airbyte #advice-data-ingestion

Rocky Appiah

06/27/2022, 7:15 PM

Is there a way to make the columns in the destination table (snowflake destination) match the order from the source (postgres source)? It might be confusing to users who are looking at the schema to have the column all shifted in the destination.

Rocky Appiah

06/27/2022, 8:30 PM

Have a postgres source, using wal2json for replication and set up publications correctly. Unable to sync, getting this in the error log:

Copy code

2022-06-27 20:14:05 source > 2022-06-27 20:14:05 ERROR i.d.p.ErrorHandler(setProducerThrowable):35 - Producer failure
2022-06-27 20:14:05 source > java.lang.NoSuchMethodError: 'java.time.OffsetDateTime org.postgresql.jdbc.TimestampUtils.toOffsetDateTime(java.lang.String)'
2022-06-27 20:14:05 source >    at io.debezium.connector.postgresql.connection.PostgresDefaultValueConverter.lambda$createDefaultValueMappers$23(PostgresDefaultValueConverter.java:169) ~[debezium-connector-postgres-1.9.2.Final.jar:1.9.2.Final]
2022-06-27 20:14:05 source >    at io.debezium.connector.postgresql.connection.PostgresDefaultValueConverter.parseDefaultValue(PostgresDefaultValueConverter.java:77) ~[debezium-connector-postgres-1.9.2.Final.jar:1.9.2.Final]
2022-06-27 20:14:05 source >    at io.debezium.relational.TableSchemaBuilder.lambda$addField$9(TableSchemaBuilder.java:391) ~[debezium-core-1.9.2.Final.jar:1.9.2.Final]
2022-06-27 20:14:05 source >    at java.util.Optional.flatMap(Optional.java:289) ~[?:?]

Tomas Perez

06/27/2022, 8:38 PM

Is there any reason why DynamoDB is not a source? I've some data stored in the service and want to move it to Snowflake. If there's already an issue/PR in progress I'd be willing to help, if not I'd be willing to code the

source-dynamodb

Jules Druelle

06/28/2022, 10:09 AM

Hi guys, I think there is a problem while I'm trying to add BigQuery destination using GCS Staging Method :

Copy code

Could not connect to the Gcs bucket with the provided configuration. The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: null; S3 Extended Request ID: null; Proxy: null)

The error mentions Amazon S3 while I am trying to setup a connection with Bigquery on GCP, does anyone have this strange error too?

✅ 1

Amanda Murphy

06/28/2022, 5:04 PM

Hi ya'll, we want to run a mailchimp integration but we get rate limited on our own infrastructure. I'm looking at the code and I'm wondering if y'all handle their rate limitting? https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-mailchimp/source_mailchimp/streams.py#L123

Tomas Perez

06/29/2022, 4:43 AM

Moving some data from S3 to Snowflake I get this error when choosing

incremental | deduped + history

any idea why?

Gustavo Guerra

06/29/2022, 1:50 PM

Problem connecting with Postgres RDS • Is this your first time deploying Airbyte?: No • OS Version / Instance: Ubuntu • Memory / Disk: 20GB • Deployment: Docker • Airbyte Version: 0.39.21 • Source name/version: Postgres 0.4.26 • Destination name/version: None • Step: Creating a new source • Description: Getting an error while trying to connect to a Postgres RDS database. The version is 12.x.x. I tested the same credentials in another connector and they are working. • The error is:

Could not connect with provided configuration. Error: HikariPool-1 - Connection is not available, request timed out after 60002ms.

Copy code

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at org.postgresql.core.SetupQueryRunner.run(SetupQueryRunner.java:55) ~[postgresql-42.3.4.jar:42.3.4]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at org.postgresql.core.v3.ConnectionFactoryImpl.runInitialQueries(ConnectionFactoryImpl.java:871) ~[postgresql-42.3.4.jar:42.3.4]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:304) ~[postgresql-42.3.4.jar:42.3.4]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) ~[postgresql-42.3.4.jar:42.3.4]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:223) ~[postgresql-42.3.4.jar:42.3.4]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at org.postgresql.Driver.makeConnection(Driver.java:402) ~[postgresql-42.3.4.jar:42.3.4]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at org.postgresql.Driver.connect(Driver.java:261) ~[postgresql-42.3.4.jar:42.3.4]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138) ~[HikariCP-5.0.1.jar:?]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:359) ~[HikariCP-5.0.1.jar:?]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:201) ~[HikariCP-5.0.1.jar:?]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:470) ~[HikariCP-5.0.1.jar:?]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:733) ~[HikariCP-5.0.1.jar:?]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:712) ~[HikariCP-5.0.1.jar:?]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - at java.lang.Thread.run(Thread.java:833) ~[?:?]

2022-06-28 19:45:45 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-28 19:45:45 INFO c.z.h.HikariDataSource(close):350 - HikariPool-1 - Shutdown initiated...

2022-06-28 19:45:46 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-06-28 19:45:46 INFO c.z.h.HikariDataSource(close):352 - HikariPool-1 - Shutdown completed.

2022-06-28 19:45:47 INFO i.a.w.t.TemporalAttemptExecution(get):134 - Stopping cancellation check scheduling...

Rocky Appiah

06/29/2022, 8:48 PM

Is there a way do have a

Sync mode

Incremental | Append

on a table which has a unique key, instead of a primary key? I’ve already configured the

REPLICA IDENTITY

in the postgres table to match the unique key, and I can see on the postgres side of things that changes are being output to replication slot;

Leo G

06/29/2022, 8:58 PM

Does cloud-based or Docker-based AirByte support Oracle Active Data Guard connectivity?

Giorgos Tzanakis

06/30/2022, 12:37 PM

Dear all. This is my first message so I hope I'm posting at the right place. At my company, we use AWS RDS with MySQL 5.7 for our transactional data. Our warehouse is Big Query. We have a production MySQL instance as well as a replica. I want to ingest our database data (CDC) into Big Query using the airbyte cloud service (we already bought credits). Following the article here I asked our technical team to 1. enable binary logging in our replica, and 2. create an airbyte user with access to the replica, and permissions including RELOAD, REPLICATION SLAVE, and REPLICATION CLIENT My technical team challenged me about this, asking: 1. What do these privilleges exactly do and why they are needed in my case 2. Will it affect anything on the production database (such as performance, or anything else for that matter) 3. Some kind of guarantee that it is safe to use a 3d party service, such as airbyte, with our private data (we are also based in EU, and want to comply with GDPR). Now, forgive my lack of knowledge, my background on databases is fairly basic. I understand that REPLICATION SLAVE/CLIENT are used to access the binary logs. I'm trying to understand what RELOAD does, but I'm not 100% sure. I see some mention of accessing the master database for the logs, so I'm a bit worried. In any case, I don't feel confident enough to answer the 2nd question specifically, i.e. whether this will affect our production database somehow. Could somebody please explain to me if this is the case? And any other comments about those questions, are more than welcome. Thank you very much in advance.

👀 1

✅ 1

Anh-Tho (Lago)

06/30/2022, 3:37 PM

Hello! 1st message here, so hope it’s where I should ask. I came across Stripe Data Pipeline, basically we ned to pay €0.03 per transaction to extract the data from Stripe, and push it into our warehouse. Why should any business pay this when we could just use Airbyte? Did I miss anything?

Simon Thelin

07/02/2022, 7:12 AM

Currently, Airbyte is painfully slow, is there any general guidance how to improve performance when moving

Postgres -> S3 (parquet)

around

~50-100gb

? It currently takes around

~10 GB

per hour for the current deployment to sync data. The workers utilise almost no memory (

~200-300 mb

), and each worker only spawns one single writer. The writer utilises around

1.5-2GB

. And the

core

usage of the writer is around

0-1

. I am running this in

K8S

. Is there anyone who has been able to tweak this in a similar scenario and made it work a bit faster? Can I force the writer to use more

cores

to increase parallelism?

Anton Peniaziev

07/03/2022, 2:08 PM

Hello fiends:) Maybe somebody has an experience with the follong setup or seen similar errors I’ve set up an s3 source and posgres as. destination Trying to ingest snappy parquets from s3 (databricks table) The table is created with the correct schema, but no data is ingested and the job is failed with the following log:

Copy code

2022-07-03 13:56:20 ERROR i.a.c.i.LineGobbler(voidCall):82 - SLF4J: Class path contains multiple SLF4J bindings.
2022-07-03 13:56:20 ERROR i.a.c.i.LineGobbler(voidCall):82 - SLF4J: Found binding in [jar:file:/airbyte/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-07-03 13:56:20 ERROR i.a.c.i.LineGobbler(voidCall):82 - SLF4J: Found binding in [jar:file:/airbyte/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-07-03 13:56:20 ERROR i.a.c.i.LineGobbler(voidCall):82 - SLF4J: Found binding in [jar:file:/airbyte/lib/logback-classic-1.2.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-07-03 13:56:20 ERROR i.a.c.i.LineGobbler(voidCall):82 - SLF4J: See <http://www.slf4j.org/codes.html#multiple_bindings> for an explanation.
2022-07-03 13:56:20 ERROR i.a.c.i.LineGobbler(voidCall):82 - SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

logs-2.txt

Albinas Plėšnys

07/04/2022, 9:50 AM

Hi, I hope I am writing to the right channel and that my question makes sense 😄 I'm trying to write my first Airbyte connector. Did the initial set up, trying to test my connector I get "no such file ... spec.json" error. However, I have a spec.yaml file and intend to use it instead of a .json . How should I proceed to point to the yaml spec?

Rocky Appiah

07/05/2022, 1:55 PM

Can anyone assist with this?

Rocky Appiah

07/07/2022, 7:38 PM

Is it possible to specify the database (not just the schema) on the destination side for snowflake?

Thomas Gerber

07/07/2022, 8:56 PM

Hi! When you write custom connectors (like the open source ones we have here: https://github.com/faros-ai/airbyte-connectors), how can we have icons associated with them when someone add them to an instance? It seems right now that the source definition must point to an svg in https://github.com/airbytehq/airbyte/tree/master/airbyte-config/init/src/main/resources/icons; any other way?

Martin Carlsson

07/08/2022, 5:34 AM

Hi I have connected out main production database (source) to Snowflake via Airbyte (self hosted) Our production database has more than 100 tables, and we only use around 10 tables right now. Every time I add a new table to the dataload, Airbyte resets all tables. Some of the tables are quite large, so this causes significant interruptions in our data delivery. Are there anyway to set Airbyte to not reload all tables when I add a new table?

Vaibhav Pal Singh

07/10/2022, 4:18 AM

Hi Team, I have data ingestion requirements where i need to ingest data using custom SQL. For example, i have the following two scenarios : 1. Ingest only specific data - Example Ingest only last 2 years data - Full Refresh Overwrite - We have a very large table and require only two years of data. Not able to figure out a way I can do something like: SELECT * FROM TABLE WHERE DT >= '2021-01-01' 2. Incrementally Ingest data on basis of multiple cursor columns and conditions - My data has two fields to track changes. Updated_On and Created_On. Value for 'Created_On' gets populated at the time of Insert and 'Updated_On' stays NULL at this point. 'Update_On' is populated only when there is a change on the record. I want to do something like SELECT *, COALESCE(Updated_on, Created_On) as CURSOR_COLUMN from TABLE. Note: ETL teams rarely have write access to source systems. And since we deal with hundreds and thousands of tables, creating views for requirements involving custom SQL is also not feasible, Any way this can be achieved with Airbyte? Otherwise doesn't seem like a very flexible solution.

Rafael Soufraz

07/11/2022, 8:30 AM

Hi people, how are you? I am facing some issues on a typeform historical sync. The log error is clear and I already added more memory and started the sync again. But I would like to know if there is some sync method that writes files directly into the destination instead of holding everything in memory. Would that be a bad or good practice? When running airbyte with low memory what is the best workaround while syncing milions of records? https://pastebin.com/Td6Ls5N1 (last sync tail log) Airbyte running in a single n1-standard-4 (15gb memory and 4 vcpu) Thanks 🙂

Jensen Yap

07/11/2022, 1:57 PM

i follow everything yall wrote on the docs for https://docs.airbyte.com/connector-development/cdk-tutorial-python-http/install-dependencies and nothing works

Kavin Rajagopal

07/11/2022, 2:15 PM

Hello, I have connections running from Google Analytics to an S3 bucket. It is a daily incremental append sync. Everyday when the sync runs it brings in data not just from where the previous sync ended but 2 days before it. So when a sync runs on June 28 it brings in data from June 25,26,27 and when the sync runs on June 29, it brings in data from June 26,27,28 so there is duplication of data. How do I fix this issue?

sarath saleem

07/11/2022, 11:18 PM

Hello, I was doing a test run to read the Airbyte S3 parquet file using Dremio. So for that, I put a sample .parquest file and do a sync to another folder. Then I try to read both the source file and copied file using Dremio from S3. The source file is readable and the airbyte copied file is not readable. From Dremio logs i got this error: _Unable to coerce from the file's data type "timestamp" to the column's data type "int64" in table "2022_07_11_1657578941501_0.parquet", column "_ab_source_file_last_modified.member0"_ I attached the schema it was showing in my connector settings. So "__ab_source_file_last_modified_" should be a string.? Also in a Dremio file preview feature, I can see that "__ab_source_file_last_modified_" is detected as an Object. So what should be the "__ab_source_file_last_modified_" type ? Why the parquet file created by Airebyte is not readable Attached the file and screenshots. Any help is appreciated

2022_07_11_1657578941501_0.parquet

Kerry Chu

07/12/2022, 12:14 AM

hello not sure where to post this issue, but the

Airbyte Specification

Link in this doc page is not found.

Pipat Methavanitpong

07/12/2022, 4:09 AM

Hi. I'm having a trouble with removing columns before loading into redshift. Is there a way to select which columns to sync in UI? Or do I have to let Airbyte run with default transformation once to pick its dbt models out every time source models change?

Rahul Yadav

07/12/2022, 5:43 AM

Hey Airbyte team, i want to sync data from 2000+ api calls, can i do that sync in parallel in airbyte?

Pranit

07/12/2022, 8:04 AM

If I try to refresh schema of one table,it refreshed all the tables in the connection. Which seems pointless. Is that correct or did I do something wrong?

Gustavo Guerra

07/12/2022, 2:04 PM

Hello, guys When using Hubspot incremental sync I actually get empty columns regarding associations. i.e. When syncing Deals, companies, contacts and line items are empty. I noticed that when I use full overwrite, i get the correct values for campanies and contacts associations (Line Items are still empty in this case). Is this suposed to happen due to the API endpoint limitation or is it actually a bug?

👀 1

Tobias Troelsen

07/12/2022, 8:58 PM

Pipedrive custom connector question | Endpoints (e.g.

Find subscription by deal

) that require other endpoint (e.g.

all deals

) as input Hi all Looking for some assistance / direction on an expansion of the pipedrive connector I am trying to do. Exemplified by this: • Need to retrieve all subscription with the

Find subscription by deal

endpoint • That endpoint take as input parameter the deal id that I am searching for subscription for • That deal-id is retrieved through the

Get all deals

endpoint • Same applies for

Get all payments of a Subscription

, where I need the subscription ID retrieved through the

Find subscription by deal

endpoint described above How do I go about implementing a logic that solves this for me? THANKS! 🙌

Vishal Jain

07/13/2022, 1:06 AM

I am running Airbyte for the first time on Mac M1. I get this error message:

Copy code

2022-07-13 01:01:26 ERROR i.a.c.i.LineGobbler(voidCall):82 - WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

Can I ignore this or is this something I should address?