Hi, I'll be implementing DataHub, my infrastructur...
# all-things-deployment
r
Hi, I'll be implementing DataHub, my infrastructure will contain various data lakes, each data lake will have a different owner, I want that anyone can access the metadata from all of these data lakes (each data lake owner should be able to ingest metadata only to their own section, shouldn't be able to change other data lakes' metadata in datahub and should be able to see all the metadata from all data lakes) so I thought of a centralized DataHub, but I read about Federated Metadata Serving in DataHub website and I am trying to grasp this concept and want to know what are the advantages of implementing this option instead of just ingesting the metadata from all the data lake into one DataHub. Also want to know if there is any information on how to implement this federated metadata serving. Thank you.
m
Hi @red-window-75368! Thanks for the question and sorry for not replying to your previous attempt at asking the same question 🙂
I think if your goal is primarily to prevent writers from overwriting metadata in domains they don't own, while being able to see across all domains and explore, you would be fine implementing a central (one) DataHub strategy.
What you could do from an operational perspective is segregate the input pipes coming into the central DataHub by using Kafka in-front of DataHub. This is sometimes useful when you don't want remote http connections going from different environments into a single central service. Also helps with decoupling the availability of the metadata stream from the metadata service.
depending on whether your company has a multi-env deployment of Kafka, you could opt for this option:
Copy code
[Data Lake A] -> [Kafka local to A] -> (mirror-maker) [Kafka aggregate] -> [DataHub central]
Happy to hop on a call if you'd like to discuss this pattern further!