Airbyte #ask-community-for-troubleshooting

Zacharias Markakis

02/02/2022, 11:26 AM

Hello everyone! I recently started playing around with Airbyte, I had a question about the API: Is there a way to enforce authentication through API tokens on the endpoints? There is a section here to provide a Bearer token to be used under the “Authentication” section, but it doesn’t seem to be enforced. Our use case would be to be able to programmatically set-up airflow without having to use the UI, but in a secure way. Thanks!

✅ 1

Matus Pavliscak

02/02/2022, 3:02 PM

Hello everyone, I’m looking into options of setting up MySQL –> Snowflake replication using CDC (binary logs) - is it possible through Airbyte? Thank you.

Raj C

02/03/2022, 2:57 AM

Setting up local Source: I'm having windows machine and can i know on how specify the path in the URL?

✅ 1

👀 1

Arvi

02/03/2022, 3:28 AM

Hi There, A generic question. I saw that in Airbyte alpha 35.12 I could see that dbt version: 0.21.1 is being used. Just curious to see if any plans to use the dbt version: 1.0.0

✅ 1

Kunal Chauhan

02/03/2022, 9:37 AM

Hello, I am trying to sync data from a self hosted mongodb setup to mongodb atlas using Namespace Configuration as “Mirror source structure”, Table prefix is empty, Sync mode as “Full refresh | Append”. It shows proper schema in the “*Select the data you want to sync”* section. But the data in the destination database comes in the below form. I would like the source data object to be different keys instead of it being added in “_airbyte_data” itself. How can I implement that ?

Copy code

{
  "_id": {
    "$oid": "61fb98c874f7580e76c626dc"
  },
  "_airbyte_data": <source_data_object>,
  "_airbyte_data_hash": "fd0b96e8-61c7-36d4-a266-af2bf7e43988",
  "_airbyte_emitted_at": "2022-02-03T08:56:40.120"
}

👀 1

Andrei Batomunkuev

02/03/2022, 9:13 PM

Hi everyone! I have a question regarding data preprocessing (data transformation). AirByte is a ELT tool, which extracts data from the sources and uploads it to the destination as a raw data. I would like to preprocess that raw data with Pandas (Python), and I want it to be automated as well. I'm going to describe my plan: 1. Extract Shopify data using AirByte and store it in Postgre database 2. Get the data for each particular product from products table (by

product_type

) 3. Preprocess this data. (extract addtional information about the product from

tags

field, and add it to the corresponding fields) 4. Store the final (preprocessed data) into a separate tables in the Postgre. For example, I have got the data about shoes, preprocess it using Pandas (extract information from

tags

) add this extracted data to additional fields. Save it as a separate table in the Postgre Database (shoes_table). Therefore, I have a question: Is there a way to preprocess the data using Pandas (Python code) in AirByte? Or are there any approaches ? for example, using both AirFlow + Airbyte ?

✅ 1

Renzo B

02/03/2022, 9:18 PM

Does anyone know how I can specify (in the helm chart) image URIs for

JOB_POD_SOCAT_IMAGE

JOB_POD_BUSYBOX_IMAGE

JOB_POD_CURL_IMAGE

? (deployment: K8s/ helm chart -- 0.33.15-alpha)

👀 1

Yiyang (Heap.io)

02/03/2022, 9:45 PM

I plan to deploy an airbyte to AWS. I have a couple customers, I will extract data on behalf of them. Should I create a workspace for each customer or use the same workspace for all customers? I saw workspace is available under the API, but not the UI. Thanks.

✅ 2

Phoebe Yang

02/04/2022, 12:22 AM

Hi! I’ve just started experimenting with Airbyte. It’s a powerful tool and I want to adopt it for production use for my organization. A couple of questions I have: • Can Airbyte handle big load of data migration ~1TB (from Heroku Postgres to AWS Postgres), and what’s the best way to optimize the migration? For context, we cannot have downtime for the source and need to move a lot of data over and perform custom dbt transformation after loading the data to the destination. Is this sync feasible and how long does it usually take? If so, what are the machine requirements (for both the Airbyte host and the destination Postgres) to accomplish this as fast as possible? • How the sync handles schema changes/updates after the initial full refresh? If we changes the source schema or the transformation logic, will the destination be automatically updated? The next follow up is more a dbt question - but if I change the transformation logic and run full refresh on a big table (~300GB), how long does this usually take and will the table be locked during the refresh? Performance and availability are important for my use case. Would love to get thoughts and recommendations on these. Thanks in advance!

✅ 1

Guilherme Calixto

02/04/2022, 3:54 AM

Hi Guys! I'm trying to get started to Airbyte, but when I try to clone I keep getting errors like: error: unable to create file airbyte-integrations/bases/base-normalization/integration_tests/normalization_test_output/snowflake/test_nested_streams/second_output/airbyte_incremental/scd/TEST_NORMALIZATION/NESTED_STREAM_WITH_COMPLEX_COLUMNS_RESULTING_INTO_LONG_NAMES_SCD.sql: Filename too long Anyone know how to proceed?

✅ 1

👀 1

Arvi

02/04/2022, 5:40 AM

Hi All, Quick Question regarding the upgrade.current version is v0.35.16-alpha. I am currently on 0.35.12-aplha as per the documentation “Airbyte intelligently performs upgrades automatically based off of your version defined in your `.env` file and will handle data migration for you.” My Questions are 1.Does it mean that it we would get notification to upgrade ? 2.How will it decide to upgrade or not?

✅ 1

Shah Newaz Khan

02/04/2022, 6:22 AM

Hello team, I am running a Big Query destination with GCS staging, I see the

_airbyte_tmp

tables show up in the target dataset and looks like the connection from source is running. However I don't see any of the

.avro

files accumulating in gcs and the

_airbyte_tmp

tables are empty. I have set the

gcs staging

to not delete the tmp files? How can I tell if data is being lifted and shifted?

👀 1

Arvi

02/04/2022, 11:13 AM

Hi Airbyte Team, I am curious to see if you have any timeline for Kubernetes version to move to alpha

✅ 1

Peem Warayut

02/04/2022, 11:34 AM

Hi Airbyte Team, i have problem Source : Postgre Destinations : Google Cloud Storage (GCS) 1. I created 1 job sync. 2. Run job to be created 3. Check the files at Google Cloud Storage (GCS) 4. Have Missing column 5. Check data at source : Columns missing because of All NULL data Can you suggest a way to prevent missing column?

👀 1

Vikram Kumar

02/04/2022, 11:42 AM

Hello. We are using Airbyte for replicating Aurora to Postgres RDS with about 2T rows and 2TB database. We have barely processed 200M rows in 24 hours. At this rate we are looking at 10-15 days for the initial sync. Is there anyway the initial full refresh can be accelerated?

✅ 1

Justin Cole

02/04/2022, 1:38 PM

Good morning! Is there a way to get the output prefix to be applied to the filename rather than a containing folder?

✅ 1

Олег Томарович

02/04/2022, 2:59 PM

Hey guys! How can I make non-DBT transformation using Airbyte? I.E. we have Google Sheets source which has only 1 sheet with 1 column - URLs. This column is populated with some URLs, some of them is 404. So our case is take this URLs data from source, send request to every URL using Python and populate second column "Response code" in Google BQ destination.

✅ 1

Lukas Novotny

02/04/2022, 4:02 PM

Hello 👋 , we're deploying Airbyte opensource on Kubernetes using provided helm charts. We're running on dynamic nodes which may cause downtime if we run only one replica of deployments. However, simply bumping up the replicas may cause race-conditions if the infrastructure is not ready for it. My question is which of these deployments may run more replicas in parallel -

scheduler

server

temporal

webapp

worker

? Thanks

👀 1

Alexander Uryumtsev

02/06/2022, 4:43 PM

Hi, all. I'm trying to setup as a source MSSQL 2006 database. And I'm getting the error

The server selected protocol version TLS10 is not accepted by client preferences [TLS13, TLS12]"

. I saw the answer from @Noah Kawasaki https://airbytehq.slack.com/archives/C01VDDEGL7M/p1641908509097000?thread_ts=1641875909.083000&cid=C01VDDEGL7M, and I'm wondering how to follow following option that @Noah Kawasaki proposed: How can I follow the second option that you've mentioned in your response?

The other thing it could be is the JDK Airbyte is running with not allowing TLS 1.0 (it was turned off by default in JDK 11) and there is a JVM argument you can change to re-enable it.

Can anyone explain how to change JVM argument mentioned in the quote?

✅ 1

Anand

02/06/2022, 7:01 PM

Hi , a newbie question here .. i'm trying to get Incremental + Dedup sync from Hubspot to Postgres.. when it comes new record or any other updates to existing record, sync is happening properly, however i read Deleted records at source is not getting updated to the destination [ guess its a limitation ]. Is there any other alternative to get the get the records cleaned up at destination with incremental option?

👀 1

✅ 1

gunu

02/07/2022, 7:20 AM

CDC incremental dedupe question here

Copy code

2 connections:
- MySQL --> S3 (Incremental Append)
- S3 --> Snowflake (Incremental Dedupe)

Once the data is written to S3, it now contains the additional metadata

Copy code

{
  "_airbyte_ab_id": "b00c41e6-8a2f-4ed7-a10f-123",
  "_airbyte_emitted_at": 1644216617782,
  "_ab_cdc_log_pos": 123,
  "_ab_cdc_log_file": "mysql-bin-changelog.123",
  "_ab_cdc_updated_at": "2022-02-07T06:04:14Z"
}

when configuring the S3 --> Snowflake connection: the cursor field is source-defined but for the primary key, can I now use one of the metadata columns e.g.

_airbyte_ab_id

or do i still need to use the primary keys that were defined in the original MySQL table?

✅ 1

Tyler Buth

02/07/2022, 5:56 PM

Looking at standard replication on MySQL it says

will not be able to represent deletions incrementally

. Does that mean on tables using incremental sync methods it won’t process deletions? Also, what about updates?

✅ 1

Tyler Buth

02/07/2022, 6:02 PM

Also, does standard replication take table changes into account? I’d assume CDC does since it uses binlogs

👀 1

Arvi

02/08/2022, 5:38 AM

Hi Airbyter’s (if that’s a thing) Quick question: Is there a way to copy existing sources, targets, connections so we don’t have to fill repetitive information and only edit the parts we need.

👀 1

Oluwapelumi Adeosun

02/08/2022, 7:43 AM

Hello. I'm trying to extract and load some data from a postgres DB to BigQuery. It synced successfully but no records is added to the destination dataset.

👀 1

Ram

02/08/2022, 8:50 AM

Hi Team, I see a lot of sources & destination connectors - But in documentation page - I could not see the actual GIT page of the sources . For example If a source is RedShift - Where can i find the GIT repo of the RedShift Source Connector ?

Daniel Eduardo Portugal Revilla

02/08/2022, 12:52 PM

Hello! I was trying to connected to MongoDb but I need use ssh connection because I use a .pem file and Mongodb connector does not have this option. is it right?

👀 1

Daniel Eduardo Portugal Revilla

02/08/2022, 2:15 PM

what is the current version of airbyte?

✅ 1

Elliot Trabac

02/08/2022, 9:30 PM

Hey there! I’m running Airbyte on GCP and I’m wondering how much a simple VM can scale, I have no idea about the volume of resources needed by the application. I’ll have 4-5 connections that extract 4-5M rows per week in total. What is your recommandation?

✅ 1

Pedro Machado

02/08/2022, 11:49 PM

Hi everyone. Has anyone used the Redshift source to sync data to s3 or Snowflake? We are discussing a migration of an expensive Redshift cluster to Snowflake. The ideal initial step is to replicate the data without logic changes so we can move the BI load from Redshift to Snowflake with minimal refactoring. Once the data is replicated to Snowflake and Looker is not querying Redshift anymore, we can scale down Redshift and work on refactoring to eventually discontinue Redshift. Looking at the docs, it looks like the Redshift source does not support incremental syncs, but apparently it's planned. Is there a target date? Even then, would the source be suitable for syncing a lot of data (many GB per day) or is it designed to work with small to medium size tables?

👀 1