DataHub #getting-started

wonderful-quill-11255

02/25/2021, 8:07 AM

Hello. I'm looking around a bit at how to monitor the different components. The mce and mae are using springs actuator library but for the gms and frontend I'm not seeing anything similar. Do they have something similar that I'm just not seeing? If not, would it make sense to try and use the same library for them as well?

incalculable-ocean-74010

02/26/2021, 5:19 PM

Hello, does datahub support entity versioning & history. As in, suppose a given dataset that defines a SQL table changes over time (columns added/removed/updated), can a user see the metadata at a given point in time?

gentle-exabyte-43102

02/26/2021, 8:30 PM

Hello there! Anyone seen this before?

Copy code

$ ./docker/quickstart.sh
Pulling elasticsearch        ... done
Pulling mysql                ... done
Pulling elasticsearch-setup  ... done
Pulling kibana               ... done
Pulling neo4j                ... done
Pulling zookeeper            ... done
Pulling broker               ... done
Pulling schema-registry      ... done
Pulling schema-registry-ui   ... done
Pulling kafka-setup          ... done
Pulling datahub-mae-consumer ... done
Pulling kafka-rest-proxy     ... done
Pulling kafka-topics-ui      ... done
Pulling datahub-gms          ... done
Pulling datahub-mce-consumer ... done
Pulling datahub-frontend     ... done
Building elasticsearch-setup
Sending build context to Docker daemon  27.78MB

Step 1/10 : ARG APP_ENV=prod
Step 2/10 : FROM jwilder/dockerize:0.6.1 AS base
 ---> 849596ab86ff
Step 3/10 : RUN apk add --no-cache curl jq
 ---> Running in d6b9d0968be4
fetch <http://dl-cdn.alpinelinux.org/alpine/v3.6/main/x86_64/APKINDEX.tar.gz>
WARNING: Ignoring <http://dl-cdn.alpinelinux.org/alpine/v3.6/main/x86_64/APKINDEX.tar.gz>: temporary error (try again later)
fetch <http://dl-cdn.alpinelinux.org/alpine/v3.6/community/x86_64/APKINDEX.tar.gz>
WARNING: Ignoring <http://dl-cdn.alpinelinux.org/alpine/v3.6/community/x86_64/APKINDEX.tar.gz>: temporary error (try again later)
ERROR: unsatisfiable constraints:
  curl (missing):
    required by: world[curl]
  jq (missing):
    required by: world[jq]
The command '/bin/sh -c apk add --no-cache curl jq' returned a non-zero code: 2

curved-magazine-23582

03/01/2021, 2:56 AM

will UserGroup be used as mechanism for dataset access control? Or is there such a thing in the roadmap for DataHub?

big-carpet-38439

03/01/2021, 7:03 PM

PSA: This Wednesday we will be hosting the inaugural React Office Hours sessions! 🎉 Feel free to stop in to ask questions or just hack on the app with @green-football-43791 and I! We will be conducting 2 sessions: • Morning: 8-10am PST • Afternoon: 3-5pm PST Both will be hosted at https://meet.google.com/rbr-vbsy-yuy?authuser=1 .

mammoth-bear-12532

03/02/2021, 3:38 AM

We are aware that datahub's build is broken right now due to a linkedin jfrog artifactory issue. @microscopic-receptionist-23548 is looking into it.

acceptable-architect-70237

03/02/2021, 4:39 PM

hello, team, a general question about

data replay strategy

. for example, in our case, we need to calculate the dataset's data quality. The data quality is calculated based on the aspects of a dataset. Since all datasets are already in datastore (MySQL, Neo4j and Elastic Search), we need to one way to pull data and do the calculation. Right now we are pulling data from MySQL using Python script. Do you guys have some suggestions?

incalculable-ocean-74010

03/02/2021, 5:29 PM

Hello, does datahub support deleting concrete entities? From https://github.com/linkedin/datahub/tree/master/gms I see get/search/update & list but no delete.

nutritious-bird-77396

03/03/2021, 12:04 AM

As i am working on getting the PR out for GraphQL MLModel Query.. I am facing an issue in the

MLModels

Client where the Snapshot aspects array is empty in here - https://github.com/linkedin/datahub/blob/master/gms/impl/src/main/java/com/linkedin/metadata/resources/ml/MLModels.java#L121 Any clues on where the issue might be?

gentle-exabyte-43102

03/04/2021, 12:11 AM

fresh install of

datahub

, browsing to

/browse/datasets

i see "An error occurred. Please try again shortly." and in the console a request to

api/v2/browse?type=dataset&count=100&start=0

is a 400 with "Bad Request. type parameter can not be null"

curved-crayon-1929

03/04/2021, 7:28 AM

Hi I am new to datahub after cloning https://github.com/linkedin/datahub/blob/master/docs/quickstart.md when i run

Copy code

./docker/quickstart.sh

it got stuck as below and keep repating the same can someone help me

nutritious-bird-77396

03/04/2021, 10:15 PM

We are looking at a Use-case where data-profiling information such as count of events, max, min etc… are pushed every few mins for every dataset in the org. Has linkedin dealt with such a use-case? What special considerations need to be taken care in the architecture? For ex: Data profiling info for 30,000 datasets pushed every 5 mins….

high-hospital-85984

03/05/2021, 11:22 AM

@clean-bear-94984 (or someone else): there has been some work on adding support for DataJobs and DataTasks: https://github.com/linkedin/datahub/pull/2008 but it seems like the feature is not fully implemented yet. Any plans on doing so? If not, mind if we pick up the work?

billions-scientist-31934

03/06/2021, 1:26 PM

Hi All. I've been spending some time digging into datahub's backend and I had a quick question I noticed that the MAE's have an internal java representation that can be serialized into Avro, but no part of them seem to get put into any formal query intermediate representation (calcite for example). I thought that pegasus was this, but it looks like pegasus is just an object format to help decorate the rest layer. Does this meant that datahub is mean to be strictly only a federated metadata discovery tool, unlike a tool like Dremio which meant to be more like a federated Query or Execution engine? If so (apologies in advance if I overlooked something), is the long term plan to collide with the coral / dali community to start to get the execution side? Since coral only supports hive view definitions what is the interim plan to get things like pushdown optimization into queries before it supports more of the backends that datahub currently supports? Is datahub meant to avoid approaching query execution altogether only focus on metadata query?

mammoth-bear-12532

03/09/2021, 5:00 PM

<!here> News Alert! We've just published the project roadmap for the first half of 2021. Check it out here! https://datahubproject.io/docs/roadmap/

👍 8

🥳 1

🙌 1

incalculable-ocean-74010

03/10/2021, 9:17 PM

Hello, does datahub provide operational metric endpoints like jmx metrics for Prometheus? Is there documentation on this?

some-crayon-90964

03/11/2021, 5:48 PM

Hey guys, I am reading this document, so I have a question. What is the difference between Entity and Snapshot, conceptually and technically? @fancy-advantage-41244 fyi

mammoth-bear-12532

03/12/2021, 4:59 AM

Some good news after all those build failures 🙂 • SSO using OIDC is now in

master

! 🎉 • Please take it for a spin and let @big-carpet-38439 know if you run into any issues. • We've tested it with Google SSO and Okta. • Docs here: https://datahubproject.io/docs/how/configure-oidc-react

🎉 2

🙌 1

gentle-exabyte-43102

03/12/2021, 7:49 PM

DatasetUrn's look to be of the form

urn:li:dataset:(urn:li:dataPlatform:{platform},{dataset_name},PROD)

where platform seems to be an enum, something like hive, hdfs, kafka, mysql, etc. is it possible to specify other values for

platform?

can i supply whatever value i want? it seems like i can't, i'm getting pegasus errors

incalculable-ocean-74010

03/15/2021, 4:40 PM

Hello, is there a particular reason why docker images are created directly from datahub's source instead of relying on published artifacts? I.e: published jars for GMS? published packages for python? Right now, if I need to modify a particular image I need to have the entire codebase locally available to perform relatively minor changes.

astonishing-yak-92682

03/15/2021, 4:46 PM

Getting this error while trying to login in datahub react application using quickstart-react script

curved-magazine-23582

03/17/2021, 4:17 AM

is possible to add aws S3 to list of dataPlatforms? most of our datasets are in AWS S3 lake.

worried-flower-88750

03/17/2021, 10:24 PM

Hello everyone 👋 Is there a way to edit descriptions through the UI? Just curious

mammoth-bear-12532

03/19/2021, 2:32 AM

Folks: an important announcement: We are officially on Elasticsearch-7 now! 🚀 Thanks to everyone who worked hard for this milestone: @microscopic-waitress-95820, @microscopic-receptionist-23548 and a cameo by @early-lamp-41924. There is a migration guide if you need it here: https://datahubproject.io/docs/advanced/es-7-upgrade. Happy searching!

acoustic-printer-83045

03/21/2021, 10:29 PM

👋 Just wondering if anyone else is experiencing an elasticsearch failure when running

./docker/quickstart.sh

When I try to fire up elasticsearch I get this (snipped) log:

Copy code

elasticsearch             | {"type": "server", "timestamp": "2021-03-21T22:25:21,301Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "uncaught exception in thread [main]", 
elasticsearch             | "stacktrace": ["org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",
elasticsearch             | "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:174) ~[elasticsearch-7.9.3.jar:7.9.3]",
elasticsearch             | "at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161) ~[elasticsearch-7.9.3.jar:7.9.3]",
elasticsearch             | "at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.9.3.jar:7.9.3]",
elasticsearch             | "at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127) ~[elasticsearch-cli-7.9.3.jar:7.9.3]",
elasticsearch             | "at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.9.3.jar:7.9.3]",
elasticsearch             | "at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126) ~[elasticsearch-7.9.3.jar:7.9.3]",
elasticsearch             | "at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.9.3.jar:7.9.3]",
elasticsearch             | "Caused by: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",

I don't think this is caused by resource contention but I could be wrong. Thanks!

high-hospital-85984

03/22/2021, 10:58 AM

We just tried to update from 0.6.1 to 0.7.0 and suddenly the MCE isn’t consuming message anymore. No errors in the log, no configs has been changed. Any ideas as to what could be the issue?

incalculable-ocean-74010

03/23/2021, 2:29 PM

Hello everyone. With the introduction of Elasticsearch 7 we no longer need to define mappings.json files right? The docker image for elastic setup, still uses the mappings.files, is this now legacy?

some-crayon-90964

03/23/2021, 2:39 PM

image.png

some-crayon-90964

03/23/2021, 2:39 PM

At this point, We don't know what to do in order to fix this, please advise

👀 1

mammoth-bear-12532

03/23/2021, 5:38 PM

Hi folks, just wanted to let you know that we merged in the

dbt

source last night. Thanks to great work by @acoustic-printer-83045! Please give it a spin in your dbt environment and let us know how it works for you! (https://datahubproject.io/docs/metadata-ingestion#dbt-dbt)

👍 1

🙌 4