@here We are syncing json events from kafka to red...
# troubleshooting
n
@here We are syncing json events from kafka to redshift with basic normalisation. We tried using
INSERT
replica strategies, data is being synced but the pipeline is very slow. Looking at the docs we changed the replica strategy to
COPY
via giving the s3 credentials in the redshift destination. In
COPY
replica strategy, csv files are being written on s3, but only some partial data is being inserted into our redshift db. In the exmaple below, you can see pipeline read 39,100 records, I verified 4 different csvs were written on s3 one having
16252
records, another one having somewhere around 22k records, another one with 2k records.But the number of records written to redshift db is around
16301
. I have seen this if the multiple files are written to s3, only one of the file (randomly chosen ) is being synced with db. I m using full refresh | append mode for the pipeline. Attaching Image for better understanding
@Harshith (Airbyte) @Agustin Cano Alvarez Can you guys help me here. If i m doing something wrong here or there is some bug on the airbyte side. We are using
0.35.45-alpha
version on k8s
I tried the same setup twice, As per my understanding airbyte only syncing last csv file written to s3 not the list of csvs generated and upload to s3. Veirfied the number of records in the last csv with the number of records being inserted
a
Hi @Nitin Jain, could you please first check if the count of records in the raw tables is correct. If it is it will mean that the normalization is filtering out some records. If you need more help from our side please attach your full sync logs.
n
@[DEPRECATED] Augustin Lafanechere Attached the logs, Number of rows in raw table is also not correct. Let me know if you anything else
a
Hi @Nitin Jain, @Drew Fustin has a similar problem. Do you mind trying the same steps Drew shared in this issue and https://github.com/airbytehq/airbyte/issues/11158 and continue the conversation on Github? Thanks!
a
@Drew Fustin @Nitin Jain even I am facing the same issue and I am trying to connect source: MSSQL and Destination: Snowflake. For S3 staging it creates 4 diff files in S3 as the record in table is 250k and only the latest file is send to snowflake rest 3 files are in s3 itself. Do you guys have any idea for this issue. I have also created this issue on git. https://github.com/airbytehq/airbyte/issues/11052