Hi team, We are currently having an issue with th...
# ask-community-for-troubleshooting
t
Hi team, We are currently having an issue with the
salesforce connector
. We notice every time we sync
Case
salesforce object there are significant number of duplicates (i.e. per case_id we could have up to 13 duplicates) in the source table (i.e.
_airbyte_raw_case
). It looks like the normalization stage takes care of the deduplication in the transform stage, however the duplication of the raw data is causing significant overhead when loading into a DWH. I was thinking of raising this as an issue on GH but want to first make sure: 1. This is not expected behaviour? 2. Could this be a quick fix and if so does anybody know where in the codebase this issue might be originating from? Thanks
✍️ 1
u
@Marcos Marx (Airbyte) turned this message into Zendesk ticket 2430 to ensure timely resolution!
u
Duplicates in destination raw tables can happen due to the way incremental syncs use checkpointing to update state. There is some more info on the design considerations here: https://discuss.airbyte.io/t/incremental-syncing-issues-on-several-connectors/504
u
It might also be worth confirming whether the duplicates exist in the source data, if that's the case maybe they can be cleaned up before ingestion?
b
Hi @Sunny Hashmi (Airbyte) thanks for getting back. I'm in @Tiri Georgiou's team. To add to the issue description: • Our Case object has ~3M records but when we try to sync with airbyte, airbyte reads 20M+ records. We have to stop the sync because it takes days and it would be impossible to load the data in our destination anyway considering the size of the raw files. This seems like a massive duplication • This magnitude of duplication does not occur on other objects. We have tested several of them and they are synced quickly with very limited duplicates Should we expect "at-least-once delivery" to produce such duplication factors?
u
Thanks for the additional context, that is definitely more than I would expect. Trying to reproduce this I'll let you know.
b
@Sunny Hashmi (Airbyte) I think we can rule out the incremental sync issue from the picture. Yesterday we reset the connection and started a
Full refresh
run on just the Case object. It's still running, currently at
Records read: 3885000 (75 GB)
and I just checked our Case table in salesforce has
1058326
records.
u
I wasn't able to reproduce this issue but I also do not have the kind of scale you are ingesting. Which version of Airbyte and salesforce connector are you running, and are you able to share full logs? Also, can you try sending to another destination to see if it has the same behaviour?
b
The scale is probably key as our Case table is one of the biggest tables in our salesforce db.
We are running airbyte 0.40.7 and salesforce connector 1.0.16
I've shared the full logs with you in a direct message yesterday. Let me know if you haven't received it
started a sync 7 hours ago with destination S3, it's still running and sailing past the actual number of records in our source
read 20Gb already which to me indicates massive duplication
s
Hey @Benoit Fayolle, I'm still reviewing the log file and will update again when I have more clues, but I did find this github issue that matches some of the messages, and similar behaviour re hanging on syncs https://github.com/airbytehq/airbyte/issues/17148 If your team wants to add thoughts and a thumbs up to the issue that would be helpful. Are you also running in kubernetes?
Also just to confirm, this is only happening on the
Case
object?
Ah, doesn't look like kubernetes based on the logs 👍
b
thanks @Sunny Hashmi (Airbyte) indeed we don't use k8. I confirm it's only happening on the
Case
object.
Opportunity
,
Account
and
OpportunityLineItem
run fine
I posted on the above issue. Do you think the hanging sync issue could be related to the problem of duplication?
s
@Paul Charlet
u
This issue has been escalated to the connectors team to be investigated, if you see any other related behaviour please update there. https://github.com/airbytehq/airbyte/issues/17148
m
just bumping this, was their any resolution? I am also duplicates with the Salesforce connector
t
hello, any updates on this? currently trying to replicate the "Case" stream and still having the same duplicate issue in november 2024
for anyone facing this issue (case stream on salesforce connector capturing duplicated data), I have a "non conclusive" statement, but may help: • the problem continues to occur on salesforce source version 2.6.3 (latest as of today) with "force use bulk API" OFF • I was not having this behavior with version 2.5.24 with "force use bulk API" ON. • I didn't had the chance to test if the flag its whats causing this or the version updates becauase each sync is taking literal days due to the size of the case table, but thats what I had on my end. • I found this post from 2 years ago mentioning a similar problem, though no idea why this is now becoming a problem again after, per how the thread goes, a fixed was applied some time ago on much older versions of the connector: https://github.com/airbytehq/airbyte/issues/17148