Hi all when a mysql column changed will DataHub get the meta DataHub #random

Join Slack

Hi all, when a mysql column changed, will DataHub ...

# random

silly-summer-53814

12/29/2021, 11:17 AM

Hi all, when a mysql column changed, will DataHub get the meta change message asap, is it possible?

incalculable-ocean-74010

12/29/2021, 11:30 AM

Hello 👋 , DataHub can be support such a use-case indirectly by ingesting MCE events directly from kafka or via the ingestion REST endpoint. There is nothing built-in to address this for now (it is planned I think) but you have the primitives to do so

incalculable-ocean-74010

12/29/2021, 11:31 AM

Something like enabling CDC (change data capture) in MySQL. Having some custom process listening on those events and when it finds a metadata change, forward that to DataHub either via a kafka message or REST call.

incalculable-ocean-74010

12/29/2021, 11:33 AM

You need to convert the CDC message into something that DataHub understands, MCE. Perhaps have a look at this: https://datahubproject.io/docs/metadata-ingestion#using-as-a-library

incalculable-ocean-74010

12/29/2021, 11:34 AM

Alternatively you could run the ingestion framework for the MySQL source periodically or as soon as you detect the change using some other mechanism.

silly-summer-53814

12/29/2021, 11:42 AM

Thanks for all your reply. So If I schedule use cron every on minute, what's the backend logic, will it cost a lot of time, if I had 100,000 tables, how much time will the only changed table data go into DataHub store?

incalculable-ocean-74010

12/29/2021, 11:52 AM

I don’t have information on how fast the crawler would be for a given scale. I’m pretty sure however that the crawler will retrieve all the metadata for those 100k tables and then DataHub’s backend would do a diff between what was crawled and what exists in DataHub and apply the difference.

incalculable-ocean-74010

12/29/2021, 11:55 AM

This means that the crawling part should be roughly the same, the diff calculation might be the bottleneck. Let’s suppose that crawling 100k tables in MySQL takes 2 minutes, that would not change for subsequent crawl executions. I’m assuming there is not statistics or usage configured in the crawler, that would slow it down. Calculating the diff on DataHub’s backend may take a little bit, how long I don’t know. Still running the crawler every minute does not seem adequate.

incalculable-ocean-74010

12/29/2021, 11:56 AM

The best approach is to use a push-based ingestion approach with something like: CDC event -> Kafka topic -> Filter function -> Transformer function to MCE -> DataHub’s ingestion topic

silly-summer-53814

12/29/2021, 12:01 PM

Yes, I appreciate your solution, if not it may cause delay, and made metadata not fresh and reliable.

silly-summer-53814

12/29/2021, 12:01 PM

Thanks again.

Open in Slack

Previous Next