here We are syncing json events from kafka to redshift with Airbyte #troubleshooting

@here We are syncing json events from kafka to red...

Nitin Jain

03/15/2022, 12:22 PM

@here We are syncing json events from kafka to redshift with basic normalisation. We tried using

INSERT

replica strategies, data is being synced but the pipeline is very slow. Looking at the docs we changed the replica strategy to

COPY

via giving the s3 credentials in the redshift destination. In

COPY

replica strategy, csv files are being written on s3, but only some partial data is being inserted into our redshift db. In the exmaple below, you can see pipeline read 39,100 records, I verified 4 different csvs were written on s3 one having

records, another one having somewhere around 22k records, another one with 2k records.But the number of records written to redshift db is around

. I have seen this if the multiple files are written to s3, only one of the file (randomly chosen ) is being synced with db. I m using full refresh | append mode for the pipeline. Attaching Image for better understanding

Nitin Jain

03/15/2022, 12:24 PM

• @Harshith (Airbyte) @Agustin Cano Alvarez Can you guys help me here. If i m doing something wrong here or there is some bug on the airbyte side. We are using 0.35.45-alpha
version on k8s

Nitin Jain

03/15/2022, 12:33 PM

I tried the same setup twice, As per my understanding airbyte only syncing last csv file written to s3 not the list of csvs generated and upload to s3. Veirfied the number of records in the last csv with the number of records being inserted

Augustin Lafanechere (Airbyte)

03/15/2022, 2:13 PM

Hi @Nitin Jain, could you please first check if the count of records in the raw tables is correct. If it is it will mean that the normalization is filtering out some records. If you need more help from our side please attach your full sync logs.

Nitin Jain

03/15/2022, 3:58 PM

@[DEPRECATED] Augustin Lafanechere Attached the logs, Number of rows in raw table is also not correct. Let me know if you anything else

Augustin Lafanechere (Airbyte)

03/15/2022, 4:00 PM

Hi @Nitin Jain, @Drew Fustin has a similar problem. Do you mind trying the same steps Drew shared in this issue and https://github.com/airbytehq/airbyte/issues/11158 and continue the conversation on Github? Thanks!

Aditya Rane

03/15/2022, 9:00 PM

@Drew Fustin @Nitin Jain even I am facing the same issue and I am trying to connect source: MSSQL and Destination: Snowflake. For S3 staging it creates 4 diff files in S3 as the record in table is 250k and only the latest file is send to snowflake rest 3 files are in s3 itself. Do you guys have any idea for this issue. I have also created this issue on git. https://github.com/airbytehq/airbyte/issues/11052

4 Views

Open in Slack

Previous Next