@rich-winter-40155 &
@incalculable-ocean-74010: I think there might be a misunderstanding here. I believe
@rich-winter-40155 was initially asking how to
push metadata from their Hive metastore to DataHub. The solutions you're linking to are
pulling snapshots of the metadata at a given time. There are two things to think about here.
First, for push, I believe one has to use hive
hooks and publish Metadata Change Events (MCE) to DataHub. This is not as well documented but there are some emitters and examples in Java and
Python. There is some talk here in slack about DataHub potentially writing the Hive hooks integration, but I've seen just as many calls for other community members to do it. I can say we at Wikimedia will be looking at this and potentially writing one.
Second, especially since the python emitter above is a non-blocking publish to Kafka, there is the potential to lose some events and create an inconsistent replication of metadata. If correctness is important, it's probably a good idea to use a
lambda architecture by pushing
and pulling. We will probably start with pulling and see where it takes us. It's important here to realize the difference between delta propagation and snapshot replication,
explained really well by the folks behind Amundsen.