We did a POC with Airbyte (and Salesforce) and te...
# announcements
s
We did a POC with Airbyte (and Salesforce) and team came up with following questions. Can someone help me on these questions • We noticed duplication of data into 3 different tables for the same object with syncMode=’incremental’ and destinationSyncMode=’append_dedup’ configuration. How can we archive the unwanted data from those tables? • Despite providing the syncCatalog configuration to fetch specific fields from the tables, airbyte actually syncs the entire object on
_airbyte_raw_<object>
table as a json object and then it is duplicated in
_scd
and actual object tables. Overall, the data is duplicated thrice in our datastore. How can we reduce data duplication? • How can we monitor if the data is getting synced properly? Is there any way from the logs to determine if the sync failed for a specific workspace (like having some workspace identifier in logs) ? • What is the difference between a free and paid plan. What additional features do we get apart from support?
c
yes, you’re right, for the moment the data is stored with duplicates in different tables…
How can we archive the unwanted data from those tables?
What is the concern here? is the fact that it takes more space related to cost? Could you share more details? For archiving, you could for example choose not to keep the
_scd
tables and reduce the duplication to two tables (raw and final), would that be ok?
Is there any way from the logs to determine if the sync failed for a specific workspace (like having some workspace identifier in logs)
Workspaces do have IDs, is that what you are referring to as described in the following docs? https://docs.airbyte.io/tutorials/exploring-workspace-folder#identifying-workspace-ids
d
So in this 9 is the workspace id, correct me if I am wrong.
c
yes 9 is the workspace and the 2 is the attempt id
d
Yes the concern is it would store more data on the DB and hence increase the cost. For lots of data can have concerns
Is there a way to archive the raw data too to reduce the disk usage
c
So yes, the first implementation of this “destination sync mode” is called
append_dedup
because it both append to the raw table but then deduplicates the final table We would need to iterate and implement the
dedup
sync mode (or
upsert_dedup
) which would only keep minimal data in destination (even drop raw tables?) to minimize storage
d
This is not currently available right?
c
unfortunately, no
d
So in the current circumstances there is no way to either not duplicate raw and _scd table or do some archiving of this tables after some time
c
if you export the normalization files that are generated for your sync following this tutorial: https://docs.airbyte.io/tutorials/connecting-el-with-t-using-dbt, you could tune and customize it to deduplicate and handle incremental update of the dedup table yourself (while dropping the raw table everytime your transformation step has completed) in the meantime, we’ll be working on https://github.com/airbytehq/airbyte/issues/2959 to make such customization easier Feel free to create issue or comment on the issues with your use cases if your requirements are not covered yet so we have a view to reprioritize topics from community needs. There is also this ticket talking about some possible improvement to implement: https://github.com/airbytehq/airbyte/issues/2566
So in the current circumstances there is no way to either not duplicate raw and _scd table or do some archiving of this tables after some time
you could setup your own transformations running once in a while that can clean up rows from the raw table that are not existing in the deduplicated final table it would be safe to drop the SCD table entirely too if you don’t need it
d
Ok
Can u please also answer the last question
c
to start answering that question, we have the following page https://airbyte.io/pricing/
for your issue with duplicates data in multiple table, could you add your concerns as comment to existing github issue or create your own? it helps us to search and reprioritize areas to focus on for example maybe this issue would fit?https://github.com/airbytehq/airbyte/issues/2683
d
Sure will look into it and take action accordingly
m
@Deep most of our focus today is on the community edition, for the standard we are currently charing for support & SLA, meaning we help company on install, configuration, upgrades, migration and operations. We are working on a fully hosted solution as well
j
*charging