DataHub #getting-started

<!here> 📣 Release 0.8.10 is here. Fixes the issues reported by the community on 0.8.9. • mae-consumer and datahub-upgrade should now run without issues. • pip package has been upgraded to 0.8.10.0

pip install -U acryl_datahub

to get the default timeout increase. • helm charts have been updated as well Let us know if you face any issues!

➕ 2

🙌 4

big-carpet-38439

08/15/2021, 4:18 PM

Welcome @dazzling-fish-16225 !!

ripe-horse-50846

08/16/2021, 3:21 AM

Has anyone tested datahub connectivity with GCP?

ambitious-airline-8020

08/16/2021, 6:45 AM

Hi Team. Sorry for the cross-post. I see similar questions in a search, but did not get answer, which works for me: https://datahubspace.slack.com/archives/C01HPV8EKTK/p1628836962028100

square-activity-64562

08/16/2021, 10:26 AM

Is there some way to map entities to github? e.g. we have datasets which are domain models. Those domain models are maintained in github repositories. It would be good to be able to link those together and show an icon or button. Like the button "View in airflow" for airflow pipelines which helps in linking datahub back to airflow.

square-activity-64562

08/16/2021, 10:59 AM

I noticed in the recent town hall ML features that there was a Primary key shown for a column in schema. Will that be added to datasets (primary key and foreign keys)? Basically because it would be great if we have something like "Answer how to best use this table — with queries " from https://medium.com/amundsen-io/amundsen-monthly-update-may-2021-d384e3d1be50. Frequent joins can be shown via queries. But even without queries from the schema itself we should be able to add what column can be joined with what else. And for the cases where this information is spread across databases we can manually add this information.

orange-airplane-6566

08/16/2021, 2:56 PM

Would admins here consider disabling join and leave messages? These are the messages like

[person]

joined #general along with 3 others

In my experience, those messages don't provide much value for spaces that are this big, but they do have a noticeable impact on the time it takes to refresh your messages when loading Slack.

big-carpet-38439

08/16/2021, 7:45 PM

who is using Google Cloud Identity or Google Workspace for managing their SSO?

🙋‍♂️ 1

abundant-dinner-2901

08/17/2021, 10:03 AM

@chilly-holiday-80781 @mammoth-bear-12532 I tried to reproduce usage of MLModel similar to presentation from

the July 23rd meeting▾

, but I cannot get the same lineage for Training dataset —> MLModel as it was presented in the meeting. Below you can see screenshots from the

bootstrap_mce.json

metadata-ingestion

. I had to add MLModelGroup to make the

scienceModel

selectable in UI. These are issues I found with this sample `bootstrap_mce.json`: 1.

scienceModel

MLModel details page doesn’t have any lineage buttons or description of connections to training/evaluation datasets (compare it with view from demo). ML Group has

View Graph

button so Model -> Group lineage is visible in the graph. 2. Browse

ML Models

and

ML Groups

don’t show any entities, but these entities are searchable 3. MLModel has

TrainingData

and

EvaluationData

defined, but it’s not visible in the model details, also this is not visible in the dataset details. I’ve added

pageViewsHive

dataset in ingestion sample dataset, but it didn’t show any relations to

scienceModel

in lineage graph and in Downstream section of Lineage details I’ve tested it on

v0.8.8

v0.8.10

master

from yesterday. I tested both

elasticseach

and

neo4j

graph storage - the same results. What have I missed to make it working?

breezy-guitar-97226

08/18/2021, 3:55 PM

Hi folks! We are currently evaluating

Datahub

as a more modular and lightweight alternative to

Apache Atlas

. Now my question is about the best practices to follow when modelling our custom business entities. When using Atlas we used to model each business entity to a new type inheriting from a common supertype:

Dataset

. What we would then add to the subtypes were their specific fields (aspects). Search and lineage would also work out-of-the-box because each of these subtypes would also be (extend) the

Dataset

supertype and ie. relationships are defined with a Dataset type as source and destination. In this scenario each new business entity would be mapped to a different subtype, all of them being ultimately Datasets. In

Datahub

, afaik, this concept of inheritance does not exist as such. What would be then the best way to model custom business entities (all conceptually datasets, but with different properties) according to its data model? I found two options: • To onboard a new Datahub entity for each of our custom business entities. This looks fine in principle, however it is immediately noticeable how lineage and search are closed and defined within the Datahub Dataset entity itself. • To add new aspects to the existing Datahub Dataset entity in order to cover the peculiar characteristics of our custom business entities. This looks fine too at first glance, however I’m worried that it could drift into a quite messy flat structure with tons of different aspects which I believe it can be quite difficult to maintain. We also thought about using custom properties, but the lack of a schema plus the impossibility to search against them does not make this solution very attractive. This said, I would be very grateful if you could make me aware of any best practices to follow in this case. I apologise in advance if I missed some points or if the explanation above contains imprecisions! Many thanks!

👍 1

narrow-kitchen-1309

08/19/2021, 5:50 PM

Hi, Folks, I would like to raise a question. What’s the vision to supporting Egeria Open Standards in DataHub? Let me know if there are any consideration or timelines for this feature.

breezy-guitar-97226

08/20/2021, 2:57 PM

Hi here, I was playing a bit with the python ingestion library and I found very convenient the way the python classes are generated from the data model and allow to interact with high level objects. It is really a pleasure to work with it. However, due to the problem I’m solving, ingesting data is only one of the aspects I have to deal with. The other is getting and showing information from the Datahub GMS api. This side of the problem is proving a little more challenging as there is no ready-to-use library (afaik) to convert the json documents into python objects. This, imho would result in cleaner code (no need to mix two different client with different abstraction levels), and would also be very convenient for seamless get-modify-ingest operations. Therefore I have a couple of questions: • Is there any plan to add a rest client using the same high level python object abstractions? • Is there currently a way to perform the deserialisation of the json objects from the REST api into python classes? Many thanks!

big-carpet-38439

08/23/2021, 4:40 PM

Good morning Community! Hope everyone has / had a great Monday 🙂

hihi 4

nutritious-bird-77396

08/23/2021, 9:37 PM

Hi…Could someone help me in providing a sample rest.li call for BrowsePaths Endpoint… I don’t see it here…. https://github.com/linkedin/datahub/tree/master/metadata-service#sample-api-calls

mammoth-bear-12532

08/24/2021, 1:58 AM

<!here> 📣 DataHub Town-hall is this Friday! • When: Aug 27th at 9am US PT 🕘 • Signup to get a calendar invite: here • Townhall Zoom: https://zoom.datahubproject.io • Agenda: ◦ Project Updates by Shirshanka ▪︎ Aug Release highlights ◦ Demo: Role Based Access Control by @big-carpet-38439 (Acryl Data) ◦ Case Study: DataHub and Redash by @square-greece-86505 (Warung Pintar Group) ◦ Deep Dive: Performance monitoring by @early-lamp-41924 (Acryl Data) ◦ One more thing: Surprise!

🎉 8

🚀 4

plain-football-97599

08/24/2021, 11:31 AM

@mammoth-bear-12532 - Is there a resource that we could use to correct or update comparison of datahub with other projects/tools, if needed? e.g. this article from early 2020. https://medium.com/bigeye/data-discovery-in-2020-8c85eed328bb

future-smartphone-53257

08/24/2021, 1:32 PM

Hi, https://datahubproject.io/docs/rfc/active/2042-graphql_frontend/queries/ This uses the acronym GQL, does that refer to GraphQL, or GQL, because the page seems to switch from talking about GraphQL to GQL, and it is not clear if the person writing it just decided to call GraphQL GQL, or if they are actually talking about GQL.

handsome-football-66174

08/24/2021, 7:10 PM

General- I was able to connect Airflow and run a generic recipe(Following this guide https://datahubproject.io/docs/metadata-ingestion/#lineage-with-airflow). How do we view the airflow dags in Datahub ?

high-hospital-85984

08/25/2021, 6:01 AM

Had a quick question about the graphql and rest api moves to what used to be gms. The release note says: ”Container names are not changed, but folks using specific read-write permissions on Kafka topics might need to expand them for datahub-metadata-service.” Is this related to collectung the datahub usage metrics or something else?

square-dream-22991

08/25/2021, 3:53 PM

hi ，How to change the username and password when using docker quickstart

crooked-toddler-8683

08/25/2021, 8:45 PM

Hello everyone. I had to reinstall ubuntu on my box and now I am working on installation of the datahub. I have ubuntu 20.04, docker version 20.10.7, docker-compose 1.29.2 and installed everything required before the step "docker quickstart". When I try to do python3 -m datahub docker quickstart I get this:

square-activity-64562

08/26/2021, 8:32 AM

It would be great if as part of the pull requests themselves we can add in a common file what all deprecations or breaking changes are happening. Creating release notes from commits is ok. But as people are running datahub in production people would need to know if there are any deprecations/breaking changes/config changes/new configs. This will become a serious concern if people actually start using datahub heavily. Everyone who is running datahub in production will have to try to go over all commits to find out if something is a breaking change. Apache superset does this. They have a CHANGELOG https://github.com/apache/superset/blob/master/CHANGELOG.md which is generated via commits and another UPDATING.MD file which contains these breaking changes https://github.com/apache/superset/blob/master/UPDATING.md. This UPDATING file is updated in every PR that has a breaking change. It is part of their process. During release all of those can optionally be added to the release notes. It would be great if datahub can adopt a similar process. Otherwise updating datahub will introduce risk whenever we want to do an update in production.

➕ 5

clever-river-85776

08/26/2021, 8:52 AM

Hi all. What mechanisms exist for publishing a glossary to Datahub? And is the pattern that people would edit / curate these definitions wihtin DataHub itself (I can't find any edit tools in the UI), or in an external tool?

thousands-tailor-5575

08/26/2021, 2:29 PM

Hi guys, calling to anyone who has experience spreading self service DataHub usage throughout their organisation. I would appreciate any tips on how to make the adoption easier. (E.g. are you using available connectors to scrape extract metadata from databases and allow them to be used by anyone or are you building abstracted APIs on top to have more control over what users can do?)

➕ 1

modern-napkin-86408

08/27/2021, 4:47 PM

A bit of feedback on the fine-grained control: it would be great if changes were blocked before submission. I could imagine it would be frustrating for a user to spend a lot of time editing a text field, then losing their work when they try to submit. I realize this may be non-trivial given that the policy is currently enforced at write time.

➕ 1

square-activity-64562

08/27/2021, 5:01 PM

question/suggestions about today's demo • How will we login to datahub user to setup initial policies when login is done via OIDC? • View restrictions will be added, right? Or is it open for views to everyone? I am not 100% sure whether we would be ok with everyone being able to view everything. • one suggestion for huge lineage is to look at this tool's graph library. https://obsidian.md/ The tool itself is not open source but on their forums https://forum.obsidian.md/ they mentioned the libraries used. It is a directed graph which they render. First image shows the number of nodes I have in my personal graph. The second one is a zoomed in view which shows the arrows

crooked-toddler-8683

08/27/2021, 8:11 PM

Any suggestions how you can implement an authorization/authentication piece if you've never worked with it before? Like the easiest way to start and develop it... I will be creating it exclusively to support datahub app, we use active directory (non-azure active directory) for the rest of our apps...

nutritious-bird-77396

08/27/2021, 9:11 PM

We are hitting the limits of kafka broker message size limits when we have a huge schema pushed in our MAE (Message is coming via GMS API) Are there any plans of compressing messages sent/read from kafka from gms?