Airbyte #advice-data-architecture

xiang chen

08/07/2022, 8:22 AM

Hello everyone, I am a new user of airbyte. I would like to run airbyte on k8s. I recently read an article about k8s deployment (https://airbyte.com/blog/scaling-data-pipelines-kubernetes) and at the end of the article I see that it seems that airbyte is working on a refactoring of k8s V2, can anyone please tell me where I can see the progress about V2? I am concerned about this and would like to help~

Zach Brak

08/09/2022, 3:29 PM

Hey all, wanted to make my case for prioritizing GCS as primarily load destination while working in GCP. Promoting my issue to get hive partitioning available as a path option on the GCS destination. Previously, we had been loading via the BigQuery Denormalized connecter, as we were happy that we received raw JSON data in its source schema format. However a build came along that was injecting values of

big_query_array

into the schema, therefore corrupting the source schema. • There is a pending fix on this, but has been stalled as the schema change can impact existing running connections and require data resets. • This fix as well only removes

big_query_array

schema values from top-level objects but not all that may be nested lower down. We also look at the action of bigquery-denormalized connector with GCS staging: • Read data from application • Load to GCS as Avro • Upload Avro to BigQuery So leading me to think - why would I rely on the Avro interpretation of the source schema, if I feel it's not aligning with my source schema? Our newer approach: • Use GCS destination over the bigquery-denormalized destination, loading JSON files direct to GCS bucket. • Create external table definitions on BigQuery to read from the GCS bucket. ◦ This then allows in options to interpret or declare a schema ◦ You can also set

ignore_unknown_values

true

, allowing for reading across changing schemas. ◦ Multiple table definitions can be made on the same source data to serve different purposes. ◦ Should allow more granular control over managing historical data. The main challenge with this newer approach is being selective on what parts of your GCS bucket to be efficient on querying the objects. To solve this I have been rapidly creating and destroying external definitions to only look at the last day or a few days of return. This is why I would love to have the hive partition spec be an option for the upload file path, as it would solve for reading only portions of the bucket, and would effectively be a partition filter for the entire collection. Appreciate any comments, questions, suggestions here. I'll say we've had decent success in the past week and fewer load failures overall taking this approach.

Don H

08/11/2022, 5:41 PM

Hello, I was wondering if and when there will be support for deploying Airbyte open source on AWS ECS. If ECS will not be supported anytime soon, what is the recommended way to host Airbyte in AWS, EC2 or K8s? Thanks in advance.

Jordan Fox

08/16/2022, 5:05 PM

Anyone using airbyte behind a router using NGINX reverse proxy and port forwarding? I've got it set up so public_ip_address:200 is forwarded to the NGINX box and NGINX is forwarding to private_ip_address:8000 and airbyte is running smoothly on private_ip_address. • When I'm on a network box and I type private_ip_address:8000 it works • When I'm on the airbyte box and type localhost:8000 it works • When i'm on a network box and I type in nginx_private_ip_address:200 it works • When I type in public_ip_address:200, I get the banner in the tab, I get the side bars showing source, destination, connection, but it just spins and then eventually goes to 'something went wrong, maybe the server isn't up'. Any thoughts? Config for forward is: server { listen 200; server_name _; location / { proxy_pass http//{private ip address}8000; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection keep-alive; proxy_set_header Host $host; proxy_cache_bypass $http_upgrade; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } } Does airbyte require any specific headers or additional configs? I've looked at https://shadabshaukat.medium.com/deploy-and-secure-airbyte-with-nginx-reverse-proxy-basic-authentication-lets-encrypt-ssl-72bee223a4d9 but I'm not at the step of setting up https yet and my config looks similar.

kylashpriya NA

08/18/2022, 8:56 AM

Hello Team, Upon setting up self-hosted Airbyte locally our infra team wrote this comment to us, During i want to install airbyte i found out that airbyte only have versions which are marked as

alpha

. From this point i can absolutely not recommend to use an alpha verions in production environments. Was it carefully evaluated, tested, decided to really use it for a production environment? Here you can find what is an alpha versions and what are the risks: Software release life cycle Could someone helps us with the above? Is that still in alpha phase or we shall try with “stable” release? We have passed setup documentation page as : https://docs.airbyte.com/quickstart/deploy-airbyte/?_ga=2.89522395.1160840054.1659428690-1169156739.1659428688 Thanks in advance!

Lenin Mishra

08/19/2022, 12:26 PM

Hey folks, Does Airbyte have a connector card feature like Fivetran? https://www.fivetran.com/blog/powered-by-fivetran-connect-cards

Ramesh Shanmugam

08/23/2022, 10:23 PM

Does airbyte release binary packages (like application.tar)?. github release tag archive contains only source code

Ignacio Reyna

08/24/2022, 12:28 AM

hey! not sure if this is the right channel, but I'm trying to install the latest tag (which currently is 0.40.1) and I'm getting an error from a missing variable

Copy code

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: IPv6 listen already enabled
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
20-envsubst-on-templates.sh: Running envsubst on /etc/nginx/templates/default.conf.template to /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2022/08/24 00:22:26 [emerg] 1#1: unknown "is_demo" variable
nginx: [emerg] unknown "is_demo" variable

I wanted to take a look at the source code so I went to the repo and I saw that the last commit removed demo mode from UI. Could this commit be related to my error?

Daniel Meyer

08/24/2022, 7:11 AM

Is it possible to config Airbyte to use an external instance of temporal? I already have temporal up it production and would love to config Airbyte to use that. rather than running two instances of temporal. 🙂

Ramon Vermeulen

08/24/2022, 7:47 AM

Hi, I don't really know where is a good place to ask. But the airbyte/scheduler image on docker hub isn't updated in quite some time (https://hub.docker.com/r/airbyte/scheduler/tags?page=1) while all other images are actually ugpraded to latest (0.40.2). Is there a specific reason for this? For instance is the scheduler container not needed anymore in the deployment?

Gopal Chand

08/24/2022, 10:50 AM

Dear Team, I installed the Airbyte recently, but every time its stop responding after some time. Please assist, hosted on AWS cloud with 2CPU, 4GB RAM

Rolex

08/25/2022, 1:20 PM

Why mysql to bigquery, update and delete cannot be incremented?

Rolex

08/25/2022, 1:21 PM

mysql to bigquery, what is the principle of increment

Andre Gallo

08/27/2022, 12:23 AM

Not intending to spam all changes but perhaps this is where I should’ve asked in the first place? Thanks for any pointers!

Federico Cipriani Corvalan

08/31/2022, 2:20 AM

Is there any built-in way of specifying temporal and temporal_visibility databases to be other than default?

Federico Cipriani Corvalan

09/02/2022, 2:12 AM

Latest Airbyte release introduced a new environment variable called

NORMALIZATION_JOB_MAIN_CONTAINER_MEMORY_REQUEST

is there anything about it anywhere?

Craig Condie

09/12/2022, 4:56 PM

Got a question. I'm not sure what the best way to do this. We've got a postgres database that has hundreds of tables. But that same database schema is copied a hundred times with just a slightly different host (for different clients). Anyone have a good way to replicating all the data into a data warehouse?

Hakeem Olu

09/14/2022, 4:58 PM

Hi guys, just curious while normalization airbyte tables "*_AIRBYTE_RAW"* always get created for each tables, what is the purpose of it and is it possible to hide it from display show tables?

Dominik Mall

09/15/2022, 12:02 PM

Update: Looks like the repo was fixed, but would still be nice to know an answer to the original question -- Hi, I’ve been trying to setup a k8s deployment with helm, now when I look at

helm search repo <name>

it’s missing the

<name>/airbyte

chart. The github repo seems to have been updated ~2 hours ago, which I guess broke something, is there a way that I can use the previous version when doing

helm repo add …

Pedro Manuel

09/15/2022, 1:24 PM

Hi guys. Don't know if this is the correct channel to talk about this but here it goes ... Is Airbyte Fedramp complaint?

Ihor Konovalenko

09/15/2022, 1:48 PM

Hi guys. We successfully deployed Airbyte in AWS Elastic Kubernetes Service, following instructions in docs. But in docs used

eksctl

command line utility. Does exist some

Terraform

module (or just script) that creates all needed stuff to deploy Airbyte in EKS?

Andrii Zelinskyi

09/16/2022, 1:00 PM

Hi guys. I'm currently using the

/v1/sources/get

Airbyte API call and trying to get the values in the

connectionConfiguration

which in spec.json are marked as

"airbyte_secret": true

. Is it any way to get the value of the

airbyte_secret

property via API instead of

'**********'

? Or I have to update all Source connectors by removing

"airbyte_secret": true

Don H

09/16/2022, 3:15 PM

Hello, I am trying to start Airbyte in Kubernetes via Helm. When I run

helm install <name> airbyte/airbyte

It starts all the services but the the server fails to start with the following message.

Copy code

2022-09-16 15:05:10 ERROR i.a.w.WorkerApp(main):592 - Worker app failed
java.lang.IllegalArgumentException: 'INTERNAL_API_HOST' environment variable cannot be null
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:220) ~[guava-31.0.1-jre.jar:?]
	at io.airbyte.config.EnvConfigs.getEnsureEnv(EnvConfigs.java:1107) ~[io.airbyte.airbyte-config-config-models-0.40.6.jar:?]
	at io.airbyte.config.EnvConfigs.getAirbyteApiHost(EnvConfigs.java:490) ~[io.airbyte.airbyte-config-config-models-0.40.6.jar:?]
	at io.airbyte.workers.WorkerApiClientFactoryImpl.<init>(WorkerApiClientFactoryImpl.java:35) ~[io.airbyte-airbyte-workers-0.40.6.jar:?]
	at io.airbyte.workers.WorkerApp.initializeCommonDependencies(WorkerApp.java:442) ~[io.airbyte-airbyte-workers-0.40.6.jar:?]
	at io.airbyte.workers.WorkerApp.main(WorkerApp.java:578) [io.airbyte-airbyte-workers-0.40.6.jar:?]

I can see that in airbyte/charts/airbyte/templates/env-configmap.yaml it is supposed to be set with the following value:

Copy code

INTERNAL_API_HOST: {{ .Release.Name }}-server-svc:{{ .Values.server.service.port }}

Any insight into why I am seeing this issue and how to proceed? Thanks

Zaza Javakhishvili

09/17/2022, 12:48 AM

Hi, denormalized BigQuery destination connector has issues: https://github.com/airbytehq/airbyte/issues/16841

Albert Lie

09/18/2022, 2:03 AM

Does anyone have a recommendation to allow multiple streams with the same format digested into one destination? https://github.com/airbytehq/airbyte/issues/2224

Jovan Sakovic

09/21/2022, 6:55 PM

Hi all octavia wave Not an Airbyte question, more of a data engineering ask that I’d appreciate any thoughts and input octavia thinking I’m working on replicating a set of MySQL tables into Snowflake. They are not huge, but there is a limitation in our current approach, which is AWS data pipelines into S3, and then Snowpipes to get it from S3 Into Snowflake ❄️ There are a couple of constraints/requirements: • we are not using Airbyte (or similar) due to SRE pushing against us, the data team, pulling data from MySQL and thus having access to the database(s). Instead, the preferred approach would be something that they could implement and maintain, where they’d be pushing the data we need into a destination we can access (or even are managing) • there are multiple of these MySQL databases, all with the same tables and schemas. So it would have to be something that the SRE team could very easily “_copy+paste”_ across all databases, and implement for any new ones that may arise • allows for incremental updates of all tables (see limitation of current approach) • preferably managed by Terraform • if possible, leveraging AWS services as best as possible Current approach & limitations As mentioned, we wanted to make things work with AWS data pipelines (for each table, across all databases). It would be dumping all the tables into S3, with a nested file structure, on top of which we have a Snowpipe for each table. Managed with Terraform, and could relatively easily be applied for additional databases/tables/fields/… However, the only way it can do incremental updates is with a

last_modified_at

type of column, and then using the start time of the pipeline at runtime. Unfortunately, not all of our tables have a timestamp column that would allow this… Note that these tables do have an auto-increment

id

field, so technically a form of incremental extracts is possible if we would save that state somewhere. 💡 I believe we bounced ideas around being hacky with the AWS data pipelines by fetching the last loaded id of the table from DynamoDB (or anything else for that matter) and injecting into the Data Pipeline (DP) parameter value, but that’s out of bounds of what this DP service can do octavia rolling eyes Bottom line Q: Ideas for dumping MySQL data into S3, reproducible across multiple DB’s, and allowing for incremental updates. If you’ve read this far, thanks a bunch, I appreciate your time! ♥️ Here for any additional questions, context or feedback 🙏

Göktuğ Aşcı

09/28/2022, 5:06 PM

Hi everybody, I am new to Airbyte. Is there a built-in authentication system that comes with the default docker images?

terekete

09/29/2022, 1:17 PM

Hi Airbyte team, I am spinning up airbyte via podman-compose. The error that is coming up is below, Any ideas on why this error is coming up? Host system is centos. Thanks.

ERRO[0000] json-file logging specified but not supported. Choosing k8s-file logging instead

podman start -a airbyte-db

2022-09-29 05:45:50 INFO i.a.c.EnvConfigs(getEnvOrDefault):1096 - Using default value for environment variable SHOULD_RUN_SYNC_WORKFLOWS: 'true'

2022-09-29 05:45:50 INFO i.a.c.EnvConfigs(getEnvOrDefault):1096 - Using default value for environment variable WORKER_PLANE: 'CONTROL_PLANE'

2022-09-29 05:45:50 INFO c.z.h.HikariDataSource(<init>):80 - HikariPool-1 - Starting...

Exception in thread "main" java.lang.RuntimeException: Driver org.postgresql.Driver claims to not accept jdbcUrl, ${CONFIG_DATABASE_URL:-}

at com.zaxxer.hikari.util.DriverDataSource.<init>(DriverDataSource.java:110)

at com.zaxxer.hikari.pool.PoolBase.initializeDataSource(PoolBase.java:326)

at com.zaxxer.hikari.pool.PoolBase.<init>(PoolBase.java:112)

at com.zaxxer.hikari.pool.HikariPool.<init>(HikariPool.java:93)

at com.zaxxer.hikari.HikariDataSource.<init>(HikariDataSource.java:81)

at io.airbyte.db.factory.DataSourceFactory$DataSourceBuilder.build(DataSourceFactory.java:304)

at io.airbyte.db.factory.DataSourceFactory.create(DataSourceFactory.java:40)

at io.airbyte.bootloader.BootloaderApp.main(BootloaderApp.java:224)

exit code: 1

ERRO[0000] json-file logging specified but not supported. Choosing k8s-file logging instead

Don H

09/29/2022, 7:56 PM

Hey, I am new to Kubernetes and I am having some difficulty deploying Airbyte and exposing the webapp to applications on the same network. How do you recommend that I expose webapp when deploying Airbyte with K8s? I was able to change the webapp service from ClusterIP to NodePort by installing it with the following command:

helm install airbyte-helm airbyte/airbyte --set webapp.service.type=NodePort

This "works", but it requires me to keep track of the node's DNS name, and if that node is replaced by another in the cluster I don't think it would work anymore.

curl --location --request POST 'ip-xx-x-x-xx.ec2.internal:30334/api/v1/workspaces/list'

I am new to K8s and tried to use ingress as an option to see what the results were.

helm install --values ../airbyte/charts/airbyte/test-values.yaml airbyte-helm airbyte/airbyte

where test-values.yaml looked like this

Copy code

webapp:
  ingress:
    enabled: true
    className: ""
    annotations: {}
    hosts:
     - host: chart-example.local
       paths:
       - path: /
         pathType: ImplementationSpecific

    tls: []

However, I received the following error (looks like an issue in ingress.yaml):

Error: INSTALLATION FAILED: template: airbyte/charts/webapp/templates/ingress.yaml:53:33: executing "airbyte/charts/webapp/templates/ingress.yaml" at <.Release.Name>: nil pointer evaluating interface {}.Name

I will be deploying this cluster via AWS CDK and need to know how to reach the webapp at the time of deployment. This is a requirement since I will pass the hostname down to other CDK stacks to reference. Using NodePort will work because I can get the node's address from the cluster. But my concern about the nodes changing in the future still exists. If I use a loadbalancer, I will need to know the hostname before it is deployed. The service type LoadBalancer may work, but it will generate the hostname when helm runs and I would need to query the cluster to figure it out. How do you recommend that I expose that service? Thanks in advance I know there is a lot to that question.

tanuj soni

09/30/2022, 9:26 AM

Hi Airbyte team, Could anyone please guide me on how to remove the minio service from airbyte cloud and set the state storage location to GCS.