DataHub #getting-started

fresh-waiter-73197

08/17/2022, 9:51 AM

Hi everyone, I just started with my DataHub journey. How can I synchronize the description of fields throughout the whole lineage? For example: A table "revenue" references a table "products" that already has a description for "product_name". How can I adopt this description into my "revenue" table?

quiet-wolf-56299

08/17/2022, 1:43 PM

Am I missing something? I added a transformer to an ingestion file, to specify the data as test. Meaning it should go into the test environment. I ran the ingestion and if I click on the source and an object directly I can see its in test, but under Explore your metadata->datasets the “test” environment doesn’t show up. Only Prod. Did I miss something in the docs for setting it up to see test? Using the default root datahub user so it shouldn’t be a perms issue

rhythmic-stone-77840

08/17/2022, 3:05 PM

Hi! I'm trying to confirm that OpenAPI doesn't have a way to query the downstream lineage of a given URN. I see the

upstreamLineage

schema, but I'm assuming thats for setting the upstream lineage of a given URN. Am I missing anything here about how to query downstream lineage?

best-fireman-42901

08/17/2022, 3:07 PM

Does anyone know what configuration i need to use in order to integrate with our Kafka MSK cluster in AWS - we use SCRAM/SASL - the guide here just advises to use plaintext? https://datahubproject.io/docs/deploy/aws/

silly-finland-62382

08/17/2022, 4:46 PM

hey team,

silly-finland-62382

08/17/2022, 4:46 PM

I basically want to know how datahub uses elastic search and MySQL to store index or data in datahub?

silly-finland-62382

08/17/2022, 4:46 PM

can someone expert can help me

miniature-painting-28571

08/17/2022, 6:22 PM

• Posting in this question here as well, to try to gather a response: HI Folks, We are trying to work on gathering metrics for the Glossary to meet corporate requirements. I spoke with a couple of our Developers on Friday afternoon. Found out our company uses "DataDog" to gather metrics on what I call the "Data Catalog" side of the DataHub tool. We can see via Amazon's "OpenSearch Service", there are "CloudWatch" Logs which are enabled... i.e. Slow Search Logs, Index Slow Logs, Error Logs. What is not enabled are: "Audit logs". Based on John Joyce's suggestion: Use "datahub_usageevent index" to find the information we are looking for..... Our company's Developer has the following questions: 1) Can all of these logs be strung together via "Splunk"? (I am guessing so he can locate this specific index and other useful information in one massive Splunk based log file to exact all of the various metrics we want to gather from one log) 2) Our Development team is more comfortable with DataDog vs OpenSearch.....can they use DataDog vs Kibana/ElasticSearch? ===> Thank you for any ideas or methods anyone has used to gather metrics specifically about the GLOSSARY inside of DataHub. We don't care at this point if the metrics are displayed inside of DataHUb, we just want to gather details like: Number of users • Number of users with > 1 search ◦ > 5 searches ◦ > 20 searches ◦ Re-evaluate with more data the exact numbers • Number of distinct site visits? ◦ TBD - Who is coming back for more searches? • Total searches per week ◦ Per Month ◦ Per Quarter ◦ Per Year ◦ Lifetime • Average searches per user ◦ For users with >0 searches ◦ Can use # of new hires as denominator to see how good the adoption rate is during orientation ▪︎ Can we pull Hire Date and match against email? ▪︎ Parking lot • Number of searches for a particular term ◦ See what’s most popular (misunderstood) • Number of terms • Number of edited terms • Number of edits by term ◦ Has one term been edited or contested multiple times ◦ Rejected Edits will be reportable via Jira • Average number of terms added per month ◦ Per Quarter ◦ Per Year • Number of Glossary Requests (Jira) for new terms (Slack for now) • Number of searches with no results • Glossary downtime

late-rocket-94535

08/18/2022, 7:27 AM

Hi all. Is there a way to use OpenAPI to get lists? For example get a list of domains. As far as I see GET request receives information only on a specific urn.

alert-ram-30868

08/18/2022, 8:55 AM

what profiling in glue do? & Can we manually ingest and generate the stats info in UI refering to this thread , what i am looking for ? similarily https://datahubspace.slack.com/archives/D03TZCZHD6W/p1660728436278209

alert-fall-82501

08/18/2022, 9:39 AM

Hi Team -getting this error while starting datahub

alert-fall-82501

08/18/2022, 9:39 AM

Fetching docker-compose file https://raw.githubusercontent.com/datahub-project/datahub/master/docker/quickstart/docker-compose-without-neo4j.quickstart.yml from GitHub [2022-08-18 150441,571] ERROR {datahub.entrypoints:188} - Command failed with HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /datahub-project/datahub/master/docker/quickstart/docker-compose-without-neo4j.quickstart.yml (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)'))). Run with --debug to get full trace

gentle-camera-33498

08/18/2022, 2:50 PM

Hello Guys, I want to share my recent experiences with DataHub deployment with the intention that I could help someone. To be more precise, I want to show some characteristics of our use case. Data Stack: - Data Warehouse on Google BigQuery. - Metabase as a BI platform. - Airflow is our principal ETL tool. - Everything is deployed on GKE. - We use Helm to deploy DataHub on our cluster. The volume: - More than 1k database tables. - About 300 views. - About 500 Dashboards ( including personal dashboards). - About 7k charts. The problem: - Using the default implantation with the DataHub chart, a lot of errors occurred. - We could not configure well the standalone consumers. - The frontend experience was awful because of the volume of errors (500 status code error messages). With that, what we decided to do: - Implement our Helm chart with a different structure and configuration options (and with CI/CD workflows). - Do not use standalone consumers, configure ingestion via Airflow and KubernetesPodOperator and disable frontend ingestion. - Change the Elasticsearch default parameters (2 Gb wasn't enough for our volume): JVM heap size is now 4 Gb, and we use 3 replicas with 1 master node. - Incresed all timeout variables, elasticsearch threads, reduce GMS max connections to 8 and play memory buffer size to 50 Mb (The post requests is quite large). With the actions above, we deployed a stable platform. Now we are discovering the next steps and expecting to help the community. Feel free to ask me anything. I will try to help!

thanks bear 1

🆒 8

💯 2

busy-dusk-4970

08/19/2022, 9:36 AM

I'm getting this error during

./gradlew build

:metadata-service:auth-ranger-impl:test

Upon further inspection of the test I see this 🧵

silly-finland-62382

08/19/2022, 11:36 AM

Hey @little-megabyte-1074 @bulky-soccer-26729 @big-carpet-38439 , Hope you are doing well

silly-finland-62382

08/19/2022, 11:36 AM

can you please tell me use case of Kafka & schema registry of kakfa in datahub, what kind of data , kafka stored?

delightful-sugar-63810

08/21/2022, 11:56 AM

Hey folks 👋🏻 While navigating the docs, I realized that we are missing some critical features of Datahub, or I just couldn't find them, documented in the docs: • Entity health check status • Column level lineage • Groups • Entity usage statistics & Profiling • A figure of what platforms Datahub supports on ingestion, similar to the one in the main page. ^ These pages can include the recommended use cases for these features, integration, sample screenshots of the UI. Also, it would be so nice if there were two demo pages, one is like an actual demo page and one more like a sandbox environment for everybody to try and discover Datahub. FYI

thanks bear 1

lemon-engine-23512

08/21/2022, 5:37 PM

Hi team, can we post lineage to datahub api? Like if we have some custom code? Can you please share if so how to do it

wonderful-egg-79350

08/21/2022, 11:14 PM

Hello All. How could I delete all data in mysql(datahub database) by using data cli? or Is there another method to delete all data?

some-cpu-10022

08/22/2022, 12:26 PM

Hi everybody, I have a glossary related question. I've seen it's possible to add link terms in the 'Related Terms' tab with a choice between 'Contains'/'Inherits' as relation type. Is it possible to add custom relation types easily? If yes, via UI/API/Metadata store? @little-megabyte-1074

victorious-xylophone-76105

08/22/2022, 2:37 PM

👋 Hi everyone! I am trying to install a former version of datahub with:

datahub docker quickstart --version 0.8.41

and getting a few errors immediately:

Copy code

Pulling elasticsearch          ... done
Pulling elasticsearch-setup    ... error
Pulling mysql                  ... done
Pulling datahub-gms            ... error
Pulling datahub-frontend-react ... error
Pulling datahub-actions        ... done
Pulling mysql-setup            ... error
Pulling zookeeper              ... done
Pulling broker                 ... done
Pulling schema-registry        ... done
Pulling kafka-setup            ... error

and eventually a lot of messages:

ERROR: manifest for linkedin/datahub-elasticsearch-setup:0.8.41 not found: manifest unknown: manifest unknown

with eventually failing. How do I make it work?

✅ 1

clever-helicopter-29529

08/22/2022, 6:25 PM

Hi everyone! I'm just getting started and trying to follow along with the quickstart docs. I've installed all of the pre-requisites and am familiar with docker and docker compose, but when i run the quickstart command

datahub docker quickstart

I get the following error:

Copy code

[2022-08-22 14:07:59,661] ERROR    {datahub.entrypoints:188} - Command failed with [Errno 2] No such file or directory: 'docker-compose'. Run with --debug to get full trace
[2022-08-22 14:07:59,661] INFO     {datahub.entrypoints:191} - DataHub CLI version: 0.8.43.2 at /home/zenith/.local/lib/python3.9/site-packages/datahub/__init__.py

I can't see any issue in my docker environment, the only thing i think it can be is that my system recognizes

docker compose

and doesn't recognize

docker-compose

has anyone else encountered this? I made an alias where

alias docker-compose="docker compose"

and confirmed it worked with

docker-compose version

but I'm still getting the above error on the

datahub docker quickstart

command.

✅ 1

great-branch-515

08/23/2022, 8:30 AM

Hi Everyone! Quick Question: How does it compare to Databricks Unity catalog?

breezy-shoe-41523

08/23/2022, 9:40 AM

Hi team, is there anyway to generate PAT (Personal Access Token ) without having existing one in programatic way???

mammoth-insurance-91360

08/23/2022, 10:22 AM

Hi, Getting started with the install of Datahub and having a few issues(Mainly due to packages not downloading due to network restrictions). I'm running "datahub docker quickstart" and is stalling on the download of a jar file. Is it possible to do an entirely offline install? - Any guidance appreciated.

breezy-shoe-41523

08/23/2022, 2:24 PM

hello team, do you have graphql api to generate personal access token by id/pw ?

silly-finland-62382

08/24/2022, 6:40 AM

Hey,

silly-finland-62382

08/24/2022, 6:41 AM

can you please tell me best method to ingest data from spark based data into datahub using lineage , which method will use ? Datahub Spark Listener or Emitter ? @bulky-soccer-26729 @little-megabyte-1074 @big-carpet-38439

silly-finland-62382

08/24/2022, 6:41 AM

Any reference code for the same to ingest data into datahub using spark lineage?

silly-finland-62382

08/24/2022, 12:32 PM

Hey