https://linen.dev logo
Join Slack
Powered by
# advice-data-warehouses
  • s

    Simon Späti

    03/30/2022, 7:34 AM
    Hi there, which is the best Cloud Data Warehouse these days? Snowflake, Google BigQuery, Amazon Redshift, Azure Synapse, Firebolt or any others?
    ❄️ 7
    firebolt 2
    👍 2
    t
    a
    +5
    • 8
    • 12
  • c

    Chase Roberts

    03/31/2022, 8:06 PM
    While the cloud data warehouse is great b/c it’s fully managed, it still struggles for ML/AI/data science/event streaming workloads and gets crazy expensive at scale. Layer on the adjacent tools like ELT, observability, rETL, dbt, BI, etc — all of which push compute into the warehouse — and the costs are are skyrocketing. Conversely, the “data lakehouse” expands the aperture of possible use cases beyond analytics, but it’s still very “DIY” (sorting & clustering, space reclamation, file sizing, CDC, log events ingest, etc). Databricks seems well poised here, but it’s still not great for SQL workloads. Regardless of how this space unfolds, it still seems like a safe bet that Databricks will build an empire. It seems like the best option would be store anywhere and toggle the processing engine according to what makes the most sense given the use case. If there was a non-DIY to do this, that would be 🔥 . What does everyone else think?
    j
    w
    +3
    • 6
    • 9
  • w

    William Phillips

    04/06/2022, 2:17 AM
    My team is thinking about using redshift for our DW. Is it smart to use RA3 nodes? I like the cross database query option that comes with it. Does it make sense to create different databases as like raw and analytics since you can query across them now? How is the performance with cross database querying?
    a
    s
    j
    • 4
    • 4
  • n

    Nicolas M

    04/14/2022, 5:25 PM
    Hi all, any general advice (max start date, Block size) on connections settings Google Ads -> Google Cloud Storage ? The connection is confirmed but fails during synchronization.
    u
    • 2
    • 1
  • a

    Alexander Butler

    04/20/2022, 1:59 AM
    All of my connectors to BigQuery are now failing with the same error. Our mission critical pipelines are affected. Anyone else using BigQuery destination seeing this?
    errors: $.credential: is not defined in the schema and the schema does not allow additional properties, $.part_size_mb: is not defined in the schema and the schema does not allow additional properties, $.gcs_bucket_name: is not defined in the schema and the schema does not allow additional properties, $.gcs_bucket_path: is not defined in the schema and the schema does not allow additional properties, $.keep_files_in_gcs-bucket: is not defined in the schema and the schema does not allow additional properties, $.method: must be a constant value Standard
    This seems very bad 😕
    a
    • 2
    • 1
  • r

    Raju K

    04/22/2022, 10:30 PM
    Hello Team, Can somebody help with the bulk data load.
    ✅ 1
    a
    • 2
    • 1
  • m

    Mario Burian

    04/29/2022, 7:11 PM
    hello we have an error with syncing from postgres to postgres:
    Copy code
    at io.temporal.internal.activity.POJOActivityTaskHandler$POJOActivityInboundCallsInterceptor.execute(POJOActivityTaskHandler.java:214) ~[temporal-sdk-1.8.1.jar:?]
    	at io.temporal.internal.activity.POJOActivityTaskHandler$POJOActivityImplementation.execute(POJOActivityTaskHandler.java:180) ~[temporal-sdk-1.8.1.jar:?]
    	at io.temporal.internal.activity.POJOActivityTaskHandler.handle(POJOActivityTaskHandler.java:120) ~[temporal-sdk-1.8.1.jar:?]
    	at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:204) ~[temporal-sdk-1.8.1.jar:?]
    	at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:164) ~[temporal-sdk-1.8.1.jar:?]
    	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93) ~[temporal-sdk-1.8.1.jar:?]
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    	at java.lang.Thread.run(Thread.java:833) [?:?]
    Caused by: java.util.concurrent.CancellationException
    	at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478) ~[?:?]
    	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getCancellationChecker$3(TemporalAttemptExecution.java:201) ~[io.airbyte-airbyte-workers-0.36.3-alpha.jar:?]
    	at io.airbyte.workers.temporal.CancellationHandler$TemporalCancellationHandler.checkAndHandleCancellation(CancellationHandler.java:53) ~[io.airbyte-airbyte-workers-0.36.3-alpha.jar:?]
    	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getCancellationChecker$4(TemporalAttemptExecution.java:204) ~[io.airbyte-airbyte-workers-0.36.3-alpha.jar:?]
    	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
    	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
    	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
    	
    	at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:160) ~[?:?]
    		at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:248) ~[?:?]
    		at java.io.BufferedWriter.flush(BufferedWriter.java:257) ~[?:?]
    		at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.notifyEndOfStream(DefaultAirbyteDestination.java:98) ~[io.airbyte-airbyte-workers-0.36.3-alpha.jar:?]
    		at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.close(DefaultAirbyteDestination.java:111) ~[io.airbyte-airbyte-workers-0.36.3-alpha.jar:?]
    		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:126) ~[io.airbyte-airbyte-workers-0.36.3-alpha.jar:?]
    		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:57) ~[io.airbyte-airbyte-workers-0.36.3-alpha.jar:?]
    		at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:155) ~[io.airbyte-airbyte-workers-0.36.3-alpha.jar:?]
    		at java.lang.Thread.run(Thread.java:833) [?:?]
    a
    • 2
    • 1
  • b

    Bikram Dhoju

    05/03/2022, 9:42 AM
    Hi there i am having issue while syncing postgres to bigquery some jobs are halting at
    i.a.c.h.LogClientSingleton(setJobMdc):137 - Setting kube job mdc
    Please help possible cause of the issue
    logs-173.txt
    a
    • 2
    • 1
  • s

    Shawn Wang

    05/27/2022, 1:41 PM
    TIL of 3 approaches to DWH structure: https://towardsdatascience.com/how-should-organizations-structure-their-data-c19b66d629e
    👀 1
    c
    s
    • 3
    • 5
  • a

    Augustin Lafanechere (Airbyte)

    06/02/2022, 1:59 PM
    Hey @Lilashree Sahoo please post troubleshooting questions on our forum 🙏🏻
  • k

    Kha Nguyen

    06/07/2022, 1:58 PM
    Hi, I am looking to deploy Airbyte to my own AWS infrastructure. Currently there is only a docker-compose.yml. Is there an existing template to deploy this to ECS, or should we craft the deployment to cloud ourselves?
    k
    l
    • 3
    • 3
  • m

    Marcos Marx (Airbyte)

    06/23/2022, 6:39 PM
    Hello 👋 I’m sending this message to help you identify if this channel is the best place to post your question. Airbyte has a few channels to open discussion about data topics (architecture, ingestion, quality, etc). In these channels you may ask general questions related to the particular topic. If you’re having problem deploying or running a connection in Airbyte this is not the topic. We recommend to you open a Discourse Topic where our support team will help you troubleshooting your issue.
    👍 1
    l
    • 2
    • 2
  • o

    Olivier AGUDO PEREZ

    06/28/2022, 9:53 AM
    Hello, I am using airbyte to replicate data from Mongodb to Bigquery. I would like to have just one final table in BQ per collection in mongo but I end up with the table itself, which is ok + a table "raw" with each documents in mongo in a big unnormalized json object, is it ok ? I can't find a way to disable those tables. I would like to keep "table_1" and get rid of "airbyte_raw_table_1"
    👍 1
    m
    • 2
    • 1
  • a

    Arkadiusz Grzedzinski

    06/30/2022, 1:08 PM
    How does airbyte write to s3? Does it write after each file is created, or does it collect all the output from the source, and divides it into parts and then writes all at once? Asking because I wonder if the logs would show any activity when running a big download
    l
    • 2
    • 2
  • a

    Ashley Baer

    07/01/2022, 12:48 PM
    Hello all. Does Airbyte plan to expand the Databricks destination to include support for Azure Databricks? And in the meantime, is anyone aware of an existing community-developed connector that would support this?
    m
    • 2
    • 1
  • s

    Sefath Chowdhury

    07/08/2022, 12:36 AM
    Hello All! I want to build a data ware house for my company. ---------------------------------------------------------------------------------------------------------------------- Current situation ---------------------------------------------------------------------------------------------------------------------- My current stack is taking all of our seperate databases (rds postgres) and using AWS DMS to use CDC to replicate the data into their own representative schema in a huge postgres instance.
    Copy code
    (ie: Microservice_1_Database | public schema ->  Huge_Postgres_Instance |  microservice_1 schema)
    There are two problems with this. 1. AWS DMS is not resilient to DDL changes on the source DB 2. A huge postgres instance is still a postgres instance -> designed to be OLTP and not OLAP (we wanted to use redshift but many existing analytics queries break. This is something we are okay with when moving to cloud agnostic snowflake) ---------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------- Desired Situation ---------------------------------------------------------------------------------------------------------------------- Airbyte -> Snowflake using AWS RDS postgres (CDC enabled) ---------------------------------------------------------------------------------------------------------------------- 1. Does anyone have a stack like this, that uses airbyte to replicate to snowflake real-time? 2. What Logical Decoding Output Plug-in are you using (hopefully you are using AWS RDS instances and that plugin is complaint) and why? 3. Did you deploy airbyte in a pod into a k8 cluster? and if so, how did you determine the specs it needed? I would assume on-going-replication is a heavy lift, and am unsure how to calculate the specs for this deployment
    k
    • 2
    • 2
  • g

    Gerard Barrull

    07/11/2022, 10:41 AM
    Hey all! I'm using Snowflake as a DataLake for my company and loading all the raw data coming from different sources with Airbyte into it. First question: Is it ok to use Snowflake as a DataLake? Second question: Do you guys have any advise on how to structure the data (for DataLake purpose). I was thinking on doing: • Independent database for all the raw data on the "Data Lake": All data from Airbyte or other sources in raw format. ◦ Schema for airbyte data, I'd have another schema for data coming from other sources. ▪︎ One table for each source (i.e. stripe, google ads, etc)? What do you think about it? Do you have any advise on doing it differently? Is it just a matter of structure preference or are there other pros and cons? Thanks!
    👀 1
    a
    • 2
    • 5
  • a

    Ari Bajo (Airbyte)

    07/27/2022, 7:47 PM
    Hello, how do you optimize your data warehouse costs? @Madison Mae wrote a great article featuring 6 ways to reduce Snowflake costs: 1. Change your warehouse size 2. Decrease the number of warehouses running at the same time 3. Decrease the sync frequency of your data ingestion tool 4. Decrease the warehouse’s auto-resume period 5. Use materialized views 6. Change your Snowflake account plan I am curious to know which optimization has brought the most savings to you?
    a
    e
    • 3
    • 4
  • j

    Jose Luis Cases

    08/09/2022, 9:12 PM
    Hi, I'm trying to build a data lake using gpc. My idea is use as destination google cloud storage with jsonl format instead on big query but I don't know if this it the best way to analyze data with ML or dashnoards quering gcs
  • j

    Jose Luis Cases

    08/09/2022, 9:12 PM
    Some advice please?
  • m

    Marcos Marx (Airbyte)

    08/11/2022, 6:37 PM
    Hello 👋 I’m sending this message to help you identify if this channel is the best place to post your question. Airbyte has a few channels to open discussion about data topics (architecture, ingestion, quality, etc). In these channels you may ask general questions related to the particular topic. If you’re having problem deploying or running a connection in Airbyte this is not the topic. We recommend to you open a Discourse Topic where our support team will help you troubleshooting your issue.
  • j

    James Egan

    08/16/2022, 10:48 AM
    I have set up a facebook marketing ads to BQ connection and when I select custom fields it keeps failing in sync with the following message "Last attempt: NaN Bytes | no records | no records | 3m 6s | Sync Failure Origin: normalization, Message: Something went wrong during normalization 11:27AM 08/16 3 attempts 2022-08-16 104054 - Additional Failure Information: message='io.temporal.serviceclient.CheckedExceptionWrapper: java.util.concurrent.ExecutionException: io.airbyte.workers.exception.WorkerException: Normalization Failed.', type='java.lang.RuntimeException', nonRetryable=false"
  • j

    James Egan

    08/16/2022, 10:48 AM
    When I remove the custom fields, the tables sync but saying no records, even though if I look on FB ads there are 2 records
  • v

    Vincent Koppen

    08/17/2022, 10:01 AM
    Hello all, I am using Airbyte Open Source to transfer data from Amazon Ads to BigQuery. In the Connection under Replication it seems that the only available sync modes are Full Refresh (Overwrite and Append). Is there no Incremental Sync Mode in this case?
  • a

    Abba

    08/17/2022, 2:27 PM
    Trying scrolling down on the dataset in BigQuery
  • j

    James Egan

    08/17/2022, 2:50 PM
    I have done, the raw and the tmp files are in there, the avro file is sitting in my GCS byt there are no rows in my data set just the schema
  • h

    Hakeem Olu

    08/18/2022, 2:30 PM
    Hi everyone, glad to be here. I am having issue with my data sync Data not showing in snowflake from redshift It showed everything ran successful
    Copy code
    Deployment: Using docker for deployment
    Airbyte Version: 0.39.39-alpha
    Source name/version: Redshift
    Destination name/version: Snowflake
    Step: The issue is happening during sync
    Description: Data not showing in snowflake from redshift.
    
    Versions:
    From the airbyte
    Redshift: 0.3.11
    Snowflake: 0.4.34
    
    AWS Redshift version: 1.0.40182
  • h

    Hakeem Olu

    08/18/2022, 2:30 PM
    So basically am seeing the tables showing, but there is no data in the table. I have about 300+ tables. Also if I sync 1 to 20 tables instead of 300+ it works, not just working for my entire tables
  • s

    Sebastian Brickel

    08/22/2022, 10:38 AM
    Hi, I set up a connection from BingAds to BigQuerry using Airbyte OSS. The connection works fine as long as I do not include
    ad_group_performance_report_hourly
    and
    campaign_performance_report_hourly
    . This gives:
    Failure Origin: source, Message: Checking source connection failed - please review this connection's configuration to prevent future syncs from failing
    and
    Additional Failure Information: Server raised fault: 'Invalid client data. Check the SOAP fault details for more information.
    Including only
    {ad,campaign}_performance_report_{daily,monthly ,weekly}
    works fine. Does anyone have an idea why that could be and how I could fix that? Thank you
    e
    • 2
    • 2
  • s

    Shawn Wang (Airbyte)

    08/22/2022, 8:30 PM
    https://airbytehq-team.slack.com/archives/C01AB7G87NE/p1661200240451719
1Latest