I got a response that this might not be solved wit...
# ingestion
p
I got a response that this might not be solved with Kafka as well. Yet to look at a fix…thought if someone who has seen this issue can help…
w
I think the SqlAlchemy ingestion module would need a code change to have the behaviour you're asking for. The feature sounds reasonable to me, perhaps controllable via a flag. I imagine that on a heavily used Redshift cluster, the ingestion module will have a high probability of failing due to these kinds of issues. Maybe you can create an issue about it on github?
p
any workaround that you can think of @wonderful-quill-11255? (divide the workload by schema etc)?
Thank you for your reply, will raise an issue
w
Yes, workload division like that should lower the probability of triggering the issue.
p
and some way to do that?
w
You have the
table_pattern
and
schema_pattern
parameters to work with to specify inclusion/exclusion patterns.
p
thanks…will see if I can use these patterns to allow/ deny
Done (hope it is as per the guidelines): https://github.com/linkedin/datahub/issues/2627
l
@gray-shoe-75895 ^ please take a look
Thanks for reporting @powerful-telephone-71997
p
Thanks @loud-island-88694 @gray-shoe-75895
g
This is a fun edge case - the way ingestion works is that it lists all the tables, and then iterates through each one to fetch additional information. Skipping the table completely might not be the right approach in all cases since it might hide real bugs, but I can totally see where having an option to skip tables that were concurrently dropped during ingestion would be a useful thing to have
I’ll work on adding an option for this
If possible in the meantime, I’d suggest using the table_pattern options to completely exclude the tables that might possibly get dropped during ingestion
p
A simple if exists check should go a long way, also, I feel there is a need to batch instead of ingesting one table at a time. i have seen Amundsen do this whole ingestion in less than 10 mins for 8 k tables, but looks like Datahub does something different, may be that I do not know…here is what it does to pull all the metadata at once from postgres/ redshift:
it took more than an hour with datahub, need to check where the performance bottleneck is
Also a question on whether incremental updates are possible with datahub ingestion, only for the schema changes, instead of a full ingestion each time
ideally ingestion should autodetect changes and not let user worry about specifying incremental/ full, but rather go through and finish the incremental ingestion
dropped tables are anycase not required in the catalog - so skipping is fine in my view with a report in the logs that the table might have been dropped during the ingestion
or update the dropped time if that is possible (marked as delete)
because it is important to catalog versions as well eventually
happy to have a call to discuss and thank you @gray-shoe-75895 for looking into this… 🙏
I would suggest having some switch to restart last ingestion from where it left off also..
g
Yep absolutely - I’ve spun up a redshift instance of my own and am playing around to see if we can do batch ingest instead of doing one table at a time
I’m hoping that this will also make it fast enough to mitigate the incremental ingestion problem
p
I tried this on a blank database with just the metadata (no actual data) and it took about 10-15 mins for 9300 tables…
g
@powerful-telephone-71997 this PR should make it significantly faster https://github.com/linkedin/datahub/pull/2635
🙌 1
I’ll ping again when it’s merged and included in a pip module release
@powerful-telephone-71997 acryl-datahub 0.4.0 is up
p
Hi @gray-shoe-75895 is this available via docker?
and it has fix to both the performance issue and the drop table handling part?
Thank you
g
yep the datahub-ingestion:head will always refer to the latest commit, so it’s also available there
I believe improving the performance issue will mitigate the drop table piece since it will be closer to an atomic operation