Hi All we are setting up hive metadata ingestion on the airf DataHub #ingestion

Hi All, we are setting up hive metadata ingestion ...

rich-winter-40155

02/10/2022, 4:22 AM

Hi All, we are setting up hive metadata ingestion on the airflow cluster. We are trying to follow the architecture proposed here https://github.com/linkedin/datahub/blob/master/docs/architecture/metadata-ingestion.md . How do we publish hive events to kafka and to datahub. If there is any example config, can you please point me to it. Thanks.

rich-winter-40155

02/10/2022, 4:41 PM

Need help here. If you can point to me any docs I'll take a look. We are trying to see if there is any way to push events from hive to datahub and how we should configure Kafka in between. Thank you

incalculable-ocean-74010

02/10/2022, 7:59 PM

Hello Archie, Please take a look at the ingestion documentation. Here are the details for Hive: https://datahubproject.io/docs/metadata-ingestion/source_docs/hive

incalculable-ocean-74010

02/10/2022, 7:59 PM

Once you configure the source, you need to configure the sink as well which in this case you can point straight to DataHub’s backend (GMS): https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub

incalculable-ocean-74010

02/10/2022, 8:00 PM

DataHub internally receives the events from the GMS endpoint and publishes them to kafka for queue-based processing.

incalculable-ocean-74010

02/10/2022, 8:00 PM

You can check more info about the ingestion framework and lifecycle here: https://datahubproject.io/docs/architecture/metadata-ingestion

rich-winter-40155

02/11/2022, 2:55 AM

Thanks @incalculable-ocean-74010. I am confused now 🙂 . if I understand correctly 1. Hive source is polling hive to get metadata 2. Sink ingests into datahub backend GMS 3. which publishes into Kafka and how does Kafka events gets processed do I need to add any connector or does it happen from datahub directly. I went through the doc but I am still confused on how to enable to push based ingestion from hive. Thank you

incalculable-ocean-74010

02/11/2022, 10:12 AM

Once GMS has received the metadata from the Hive Source it will generate an MCE and publish that into a Kafka topic (mce proposal stream) that another component

MCE consumer

will consume and process.

rich-winter-40155

02/11/2022, 5:52 PM

@incalculable-ocean-74010 Thank you. how do I configure MCE consumer. Are these right docs to follow https://datahubproject.io/docs/metadata-jobs/mce-consumer-job/ or is it internal to GMS service. We are trying to bring up all the services on a non-docker, bare-bone machines.

incalculable-ocean-74010

02/11/2022, 5:57 PM

You do not configure the MCE consumer if using docker-compose or K8s directly, it is an internal component of DataHub where are the defaults are usually good enough. If you wish to run the MCE consumer job in a non-docker, bare-bone deployment I would suggest looking the content of the MCE-Consumer Dockerfile which you can find here: https://github.com/linkedin/datahub/tree/master/docker/datahub-mce-consumer That + the docker-compose definition or K8s Helm definition to launch the service should give you a sense of any configuration that is required to launch the MCE consumer.

thank you 1

gorgeous-optician-32034

02/22/2022, 7:32 PM

@rich-winter-40155 & @incalculable-ocean-74010: I think there might be a misunderstanding here. I believe @rich-winter-40155 was initially asking how to push metadata from their Hive metastore to DataHub. The solutions you're linking to are pulling snapshots of the metadata at a given time. There are two things to think about here. First, for push, I believe one has to use hive hooks and publish Metadata Change Events (MCE) to DataHub. This is not as well documented but there are some emitters and examples in Java and Python. There is some talk here in slack about DataHub potentially writing the Hive hooks integration, but I've seen just as many calls for other community members to do it. I can say we at Wikimedia will be looking at this and potentially writing one. Second, especially since the python emitter above is a non-blocking publish to Kafka, there is the potential to lose some events and create an inconsistent replication of metadata. If correctness is important, it's probably a good idea to use a lambda architecture by pushing and pulling. We will probably start with pulling and see where it takes us. It's important here to realize the difference between delta propagation and snapshot replication, explained really well by the folks behind Amundsen.

👍 2

rich-winter-40155

02/23/2022, 12:22 AM

Thanks @gorgeous-optician-32034. This is exactly what we are looking an hive hook. If you have open source hook please share

2 Views

Open in Slack

Previous Next