Hi All, we are setting up hive metadata ingestion ...
# ingestion
r
Hi All, we are setting up hive metadata ingestion on the airflow cluster. We are trying to follow the architecture proposed here https://github.com/linkedin/datahub/blob/master/docs/architecture/metadata-ingestion.md . How do we publish hive events to kafka and to datahub. If there is any example config, can you please point me to it. Thanks.
Need help here. If you can point to me any docs I'll take a look. We are trying to see if there is any way to push events from hive to datahub and how we should configure Kafka in between. Thank you
i
Hello Archie, Please take a look at the ingestion documentation. Here are the details for Hive: https://datahubproject.io/docs/metadata-ingestion/source_docs/hive
Once you configure the source, you need to configure the sink as well which in this case you can point straight to DataHub’s backend (GMS): https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub
DataHub internally receives the events from the GMS endpoint and publishes them to kafka for queue-based processing.
You can check more info about the ingestion framework and lifecycle here: https://datahubproject.io/docs/architecture/metadata-ingestion
r
Thanks @incalculable-ocean-74010. I am confused now 🙂 . if I understand correctly 1. Hive source is polling hive to get metadata 2. Sink ingests into datahub backend GMS 3. which publishes into Kafka and how does Kafka events gets processed do I need to add any connector or does it happen from datahub directly. I went through the doc but I am still confused on how to enable to push based ingestion from hive. Thank you
i
Once GMS has received the metadata from the Hive Source it will generate an MCE and publish that into a Kafka topic (mce proposal stream) that another component
MCE consumer
will consume and process.
r
@incalculable-ocean-74010 Thank you. how do I configure MCE consumer. Are these right docs to follow https://datahubproject.io/docs/metadata-jobs/mce-consumer-job/ or is it internal to GMS service. We are trying to bring up all the services on a non-docker, bare-bone machines.
i
You do not configure the MCE consumer if using docker-compose or K8s directly, it is an internal component of DataHub where are the defaults are usually good enough. If you wish to run the MCE consumer job in a non-docker, bare-bone deployment I would suggest looking the content of the MCE-Consumer Dockerfile which you can find here: https://github.com/linkedin/datahub/tree/master/docker/datahub-mce-consumer That + the docker-compose definition or K8s Helm definition to launch the service should give you a sense of any configuration that is required to launch the MCE consumer.
thank you 1
g
@rich-winter-40155 & @incalculable-ocean-74010: I think there might be a misunderstanding here. I believe @rich-winter-40155 was initially asking how to push metadata from their Hive metastore to DataHub. The solutions you're linking to are pulling snapshots of the metadata at a given time. There are two things to think about here. First, for push, I believe one has to use hive hooks and publish Metadata Change Events (MCE) to DataHub. This is not as well documented but there are some emitters and examples in Java and Python. There is some talk here in slack about DataHub potentially writing the Hive hooks integration, but I've seen just as many calls for other community members to do it. I can say we at Wikimedia will be looking at this and potentially writing one. Second, especially since the python emitter above is a non-blocking publish to Kafka, there is the potential to lose some events and create an inconsistent replication of metadata. If correctness is important, it's probably a good idea to use a lambda architecture by pushing and pulling. We will probably start with pulling and see where it takes us. It's important here to realize the difference between delta propagation and snapshot replication, explained really well by the folks behind Amundsen.
👍 2
r
Thanks @gorgeous-optician-32034. This is exactly what we are looking an hive hook. If you have open source hook please share