I got a response that this might not be solved with Kafka as DataHub #ingestion

I got a response that this might not be solved wit...

powerful-telephone-71997

06/01/2021, 6:18 AM

I got a response that this might not be solved with Kafka as well. Yet to look at a fix…thought if someone who has seen this issue can help…

wonderful-quill-11255

06/01/2021, 6:56 AM

I think the SqlAlchemy ingestion module would need a code change to have the behaviour you're asking for. The feature sounds reasonable to me, perhaps controllable via a flag. I imagine that on a heavily used Redshift cluster, the ingestion module will have a high probability of failing due to these kinds of issues. Maybe you can create an issue about it on github?

powerful-telephone-71997

06/01/2021, 7:04 AM

any workaround that you can think of @wonderful-quill-11255? (divide the workload by schema etc)?

powerful-telephone-71997

06/01/2021, 7:04 AM

Thank you for your reply, will raise an issue

wonderful-quill-11255

06/01/2021, 7:11 AM

Yes, workload division like that should lower the probability of triggering the issue.

powerful-telephone-71997

06/01/2021, 7:12 AM

and some way to do that?

wonderful-quill-11255

06/01/2021, 7:13 AM

You have the

table_pattern

and

schema_pattern

parameters to work with to specify inclusion/exclusion patterns.

wonderful-quill-11255

06/01/2021, 7:14 AM

See https://github.com/linkedin/datahub/tree/master/metadata-ingestion#microsoft-sql-server-metadata-mssql

powerful-telephone-71997

06/01/2021, 7:15 AM

thanks…will see if I can use these patterns to allow/ deny

powerful-telephone-71997

06/01/2021, 7:31 AM

Done (hope it is as per the guidelines): https://github.com/linkedin/datahub/issues/2627

loud-island-88694

06/01/2021, 2:34 PM

@gray-shoe-75895 ^ please take a look

loud-island-88694

06/01/2021, 2:34 PM

Thanks for reporting @powerful-telephone-71997

powerful-telephone-71997

06/01/2021, 3:12 PM

Thanks @loud-island-88694 @gray-shoe-75895

gray-shoe-75895

06/01/2021, 6:02 PM

This is a fun edge case - the way ingestion works is that it lists all the tables, and then iterates through each one to fetch additional information. Skipping the table completely might not be the right approach in all cases since it might hide real bugs, but I can totally see where having an option to skip tables that were concurrently dropped during ingestion would be a useful thing to have

gray-shoe-75895

06/01/2021, 6:03 PM

I’ll work on adding an option for this

gray-shoe-75895

06/01/2021, 6:04 PM

If possible in the meantime, I’d suggest using the table_pattern options to completely exclude the tables that might possibly get dropped during ingestion

powerful-telephone-71997

06/02/2021, 4:09 AM

A simple if exists check should go a long way, also, I feel there is a need to batch instead of ingesting one table at a time. i have seen Amundsen do this whole ingestion in less than 10 mins for 8 k tables, but looks like Datahub does something different, may be that I do not know…here is what it does to pull all the metadata at once from postgres/ redshift:

powerful-telephone-71997

06/02/2021, 4:09 AM

https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/extractor/postgres_metadata_extractor.py

powerful-telephone-71997

06/02/2021, 4:10 AM

it took more than an hour with datahub, need to check where the performance bottleneck is

powerful-telephone-71997

06/02/2021, 4:11 AM

Also a question on whether incremental updates are possible with datahub ingestion, only for the schema changes, instead of a full ingestion each time

powerful-telephone-71997

06/02/2021, 4:12 AM

ideally ingestion should autodetect changes and not let user worry about specifying incremental/ full, but rather go through and finish the incremental ingestion

powerful-telephone-71997

06/02/2021, 4:13 AM

dropped tables are anycase not required in the catalog - so skipping is fine in my view with a report in the logs that the table might have been dropped during the ingestion

powerful-telephone-71997

06/02/2021, 4:14 AM

or update the dropped time if that is possible (marked as delete)

powerful-telephone-71997

06/02/2021, 4:14 AM

because it is important to catalog versions as well eventually

powerful-telephone-71997

06/02/2021, 4:15 AM

happy to have a call to discuss and thank you @gray-shoe-75895 for looking into this… 🙏

powerful-telephone-71997

06/02/2021, 6:04 AM

I would suggest having some switch to restart last ingestion from where it left off also..

gray-shoe-75895

06/02/2021, 6:26 AM

Yep absolutely - I’ve spun up a redshift instance of my own and am playing around to see if we can do batch ingest instead of doing one table at a time

gray-shoe-75895

06/02/2021, 6:29 AM

I’m hoping that this will also make it fast enough to mitigate the incremental ingestion problem

powerful-telephone-71997

06/02/2021, 12:00 PM

I tried this on a blank database with just the metadata (no actual data) and it took about 10-15 mins for 9300 tables…

gray-shoe-75895

06/03/2021, 5:35 PM

@powerful-telephone-71997 this PR should make it significantly faster https://github.com/linkedin/datahub/pull/2635

🙌 1

gray-shoe-75895

06/03/2021, 5:36 PM

I’ll ping again when it’s merged and included in a pip module release

gray-shoe-75895

06/03/2021, 10:34 PM

@powerful-telephone-71997 acryl-datahub 0.4.0 is up

powerful-telephone-71997

06/07/2021, 6:13 AM

Hi @gray-shoe-75895 is this available via docker?

powerful-telephone-71997

06/07/2021, 6:13 AM

and it has fix to both the performance issue and the drop table handling part?

powerful-telephone-71997

06/07/2021, 6:13 AM

Thank you

gray-shoe-75895

06/07/2021, 4:47 PM

yep the datahub-ingestion:head will always refer to the latest commit, so it’s also available there

gray-shoe-75895

06/07/2021, 4:48 PM

I believe improving the performance issue will mitigate the drop table piece since it will be closer to an atomic operation

2 Views

Open in Slack

Previous Next