Hi all, when a mysql column changed, will DataHub ...
# random
s
Hi all, when a mysql column changed, will DataHub get the meta change message asap, is it possible?
i
Hello đź‘‹ , DataHub can be support such a use-case indirectly by ingesting MCE events directly from kafka or via the ingestion REST endpoint. There is nothing built-in to address this for now (it is planned I think) but you have the primitives to do so
Something like enabling CDC (change data capture) in MySQL. Having some custom process listening on those events and when it finds a metadata change, forward that to DataHub either via a kafka message or REST call.
You need to convert the CDC message into something that DataHub understands, MCE. Perhaps have a look at this: https://datahubproject.io/docs/metadata-ingestion#using-as-a-library
Alternatively you could run the ingestion framework for the MySQL source periodically or as soon as you detect the change using some other mechanism.
s
Thanks for all your reply. So If I schedule use cron every on minute, what's the backend logic, will it cost a lot of time, if I had 100,000 tables, how much time will the only changed table data go into DataHub store?
i
I don’t have information on how fast the crawler would be for a given scale. I’m pretty sure however that the crawler will retrieve all the metadata for those 100k tables and then DataHub’s backend would do a diff between what was crawled and what exists in DataHub and apply the difference.
This means that the crawling part should be roughly the same, the diff calculation might be the bottleneck. Let’s suppose that crawling 100k tables in MySQL takes 2 minutes, that would not change for subsequent crawl executions. I’m assuming there is not statistics or usage configured in the crawler, that would slow it down. Calculating the diff on DataHub’s backend may take a little bit, how long I don’t know. Still running the crawler every minute does not seem adequate.
The best approach is to use a push-based ingestion approach with something like: CDC event -> Kafka topic -> Filter function -> Transformer function to MCE -> DataHub’s ingestion topic
s
Yes, I appreciate your solution, if not it may cause delay, and made metadata not fresh and reliable.
Thanks again.