Hi guys, calling to anyone who has experience spre...
# ingestion
t
Hi guys, calling to anyone who has experience spreading self service DataHub usage throughout their organisation. I would appreciate any tips on how to make the adoption easier. (E.g. are you using available connectors to scrape extract metadata from databases and allow them to be used by anyone or are you building abstracted APIs on top to have more control over what users can do?)
b
We're planning to use a data catalog and have done thoughts on how we're gonna do it. Can you explain your second point?
t
There are numerous connectors that can connect to different databases, it is easy for a central Data Platform team to extract the data and load it into the Data Catalog, however in this process you might want to enrich the metadata with some mandatory data (for example ownership and lineage). If these metadata collection libraries would be given to the users as they are it would be very hard to maintain order in the metadata. How are you planning to manage these aspects? In what form would you expect the teams to provide the lineage data in cases where it is not available from the connector? Would you require it to come in the form that DataHub expects it to be or maybe construct an API endpoint which would take in a human readable input and transform it into DataHub expected form?
s
We are starting out with datahub so sharing thoughts instead of experience. TLDR is I am hoping to slowly introduce things to people which solve one of their problems. That should give a positive impression to people and they may start using it for at least that purpose. Hopefully may use it for more things later. • We use Apache superset heavily in the org for analytics. As part of central data team I have ingested the databases which are added in superset. That gave us a starting point as people were mostly aware those things are present. They have questions about them which I am hoping to get answered via datahub. • Initially it did not have anything. No lineage, ownership. Had to talk to a few people to get correct owners added. Made it a bit simpler by making teams as owners instead of individual people. Later we may add individuals as primary owners. • I had to make changes in the browse paths so that the structure of datahub feels similar to what people are already used to in superset • There is a lot of Institutional knowledge which gets thrown around on slack. Like ◦ "use this table for THIS instead of that". It will be faster ◦ this table is not reliable. It has this column values missing • Whenever I hear such institutional knowledge I ask the people discussing whether they could add this in datahub by giving them a direct link to the relevant tables. I am doing it for some of the important tables so that people have an idea what to write about and can see examples • Show inside the company how it is solving a problem for me and for some of the other people Still quite early days and very few people using it. We have not really rolled it out to the whole company yet because Access Control is not there. Hoping to add some useful things in it before the roll out happens
Regarding the lineage data, I have added some thin functions on top of datahub's API and am using it. Hoping to refine them to be easy to use so others can also use it. e.g. added a simple config file which contains
table -> view
lineage to track dependencies
Copy code
LINEAGE = dict()
LINEAGE['VIEW_1'] = ['TABLE1', 'TABLE2']
LINEAGE['VIEW_2'] = ['TABLE1', 'TABLE3']
This is using a small cli which we are also using to send lineage on ad-hoc basis through jenkins, when needed. Should be quite simple to add an API on top of such things. Datahub's ingestion codebase is easy to extend. So should not be a problem to do such stuff if you wanted to add an API
b
Note that we also intend to build a Python SDK leveraging our GraphQL API under the hood to support more understandable operations (add a tag, add an owner, add lineage etc). Timeline TBD but this SDK may serve as a useful foundation around which to build custom infra
👍 1
It will differ from the ingestion SDK in that it will provide 2-way communication with DataHub. As opposed to just pushing metadata in, you'll also be able to query for metadata
👍 1