DataHub #getting-started

little-megabyte-1074

08/12/2020, 4:08 PM

Hi folks! My team is going to be building out a POC for DataHub in our upcoming sprint and I’m curious if anyone has suggestions/thoughts about the best way to collect usage analytics? I want to be able to quantify how folks are using the tool, what features have the highest adoption, etc. Our company uses Segment, but I’m not sure we want to invest too much time integrating with Segment for a POC.

billowy-eye-48149

08/12/2020, 8:25 PM

Hello Team, Why is mysql container setup missing in the new docker compose file ?

calm-minister-22324

08/31/2020, 2:58 PM

Hi everyone o/ I've been reading the documentations, and I wanted to make sure if DataHub would support my use case and if I understood its documentation correctly.... Use Case: I need to expose an API with the date of the latest available data on External Tables on Redshift Solution: Create a SQL ingestion to push the latest samples from the tables as datasets, and then use the dataset to extract/query the latest date available Would this be correct?

fast-exabyte-18411

08/31/2020, 4:06 PM

Hi all, working on a POC of Datahub. Running into some issues with connecting to our external Kafka provider, as it uses SSL and Basic Auth for schema registry. Was hoping there would a way to configure consumers to use SSL via environment vairables, but it looks like the Spring Kafka library doesn't support these specific configs. There's some stackoverflow stuff that implies they at least don't have it for SSL https://stackoverflow.com/questions/51316017/spring-boot-spring-kafka-ssl-configuration-by-environment-variables-impossible So it would seem that this configuration has to be done here https://github.com/linkedin/datahub/blob/master/metadata-jobs/mae-consumer-job/src/main/resources/application.properties Has anybody else run into this? I'm thinking of testing out a fix locally and opening a PR, would love to know if I'm missing soemthing

silly-apple-97303

09/04/2020, 9:53 PM

I'm trying to configure access to schema registry with basic auth enabled for the GMS. I was able to configure schema registry access for the MAE/MCE services with the following env variables:

Copy code

- name: SPRING_KAFKA_PROPERTIES_BASIC_AUTH_CREDENTIALS_SOURCE
      value: USER_INFO
    - name: SPRING_KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO
      valueFrom:
        secretKeyRef:
          name: "kafka-schema-registry-credentials"
          key: "user-info"

And the logs from both MAE/MCE look like:

Copy code

16:21:26.721 [main] INFO  i.c.k.s.KafkaAvroDeserializerConfig - KafkaAvroDeserializerConfig values: 
	schema.registry.url = [redacted]
	<http://basic.auth.user.info|basic.auth.user.info> = [hidden]
	auto.register.schemas = true
	max.schemas.per.subject = 1000
	basic.auth.credentials.source = USER_INFO
	<http://schema.registry.basic.auth.user.info|schema.registry.basic.auth.user.info> = [hidden]
	specific.avro.reader = false
	value.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicNameStrategy
	key.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicNameStrategy

16:21:26.857 [main] INFO  o.a.kafka.common.utils.AppInfoParser - Kafka version: 2.2.1-cp1

However, doing the same for the GMS is not working. Specifically I get these warn log messages on startup and the configs are not attached to the serializer:

Copy code

16:20:23.481 [main] INFO  i.c.k.s.KafkaAvroSerializerConfig - KafkaAvroSerializerConfig values: 
	schema.registry.url = [redacted]
	max.schemas.per.subject = 1000

16:20:24.213 [main] WARN  o.a.k.c.producer.ProducerConfig - The configuration '<http://basic.auth.user.info|basic.auth.user.info>' was supplied but isn't a known config.
16:20:24.215 [main] WARN  o.a.k.c.producer.ProducerConfig - The configuration 'basic.auth.credentials.source' was supplied but isn't a known config.

16:20:24.217 [main] INFO  o.a.kafka.common.utils.AppInfoParser - Kafka version: 2.3.0

When digging into this I noticed the MAE/MCE are using kafka

2.2.1-cp1

(confluent platform version) while the GMS is using

2.3.0

(non-confluent platform version). I'm thinking regular non confluent platform clients might not support the same set of schema registry configurations.

swift-account-97627

09/07/2020, 12:28 PM

I've noticed various references to a "top consumers" feature in the front-end code, but nowhere else. I'm guessing this is an internal feature? I don't see it in the roadmap. Is it anticipated to be open sourced at some point or is it internal forever?

swift-account-97627

09/07/2020, 12:40 PM

I have a number of related use cases for what I'd describe as "Data Profile" and/or "Data Quality" attributes, most of which are per-field. For example: completeness (~non-null percentage), distinct values, or histograms of distinct values in enumeration-like fields. I've been doing some quick-and-dirty prototyping by adding these attributes to the SchemaField model, but that feels wrong. It seems like "Data Profile" is really a separate aspect to "Schema", but they both contain per-field information, so I'm not sure how best to model this. I could add "DataProfile" to the set of dataset aspects. But then I'd have two aspects containing field-level information (

SchemaMetadata.SchemaFields

and something like

DataProfile.FieldProfiles

). If this is the correct model, what would be a good way to associate each particular FieldProfile with a particular SchemaField? Or is there a different model that would be better? More generally, it seems like there's a tension between two models for field-level aspects: 1. Dataset has many Aspects, some of which have metadata for many Fields 2. Dataset has many Fields, some of which have multiple Aspects I don't have an opinion on which of these models is more "correct", but the current implementation only seems to really support one aspect per field, and pushes any extensions to favour model (1) above. Is this a conscious design decision, or has this question just not come up yet?

high-hospital-85984

09/14/2020, 6:15 PM

Hi all! Complete newbie question. After running

quickstart.sh

and

ingestion/ingestion.sh

I see items in the Upstream/Downstream tables in the “Relationship” tab for all dummy datasets. The lineage data is however empty, as well as the graph visualisation. Am I missing something?

able-garden-99963

09/15/2020, 11:00 AM

Hi datahub team! A couple of quick questions: 1. Is it possible to add constraints to entity fields (say, my CorpUser entity has an "email" field and I want it to be unique)? 2. Is it possible to add custom queries based on entity fields? (say, my CorpUser entity has "email" field and I want to query CorpUsers by email)? Thanks!

👋 1

high-hospital-85984

09/15/2020, 7:05 PM

Another question, from a newb trying to understand how Datahub works. In

datahub/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/

we define e.g. ChartSnapshot and MLModelSnapshot. However, in

Snapshot.pdl

we only list MLModelSnapshot and not ChartSnapshot in the union. Why is that? Similary, in

datahub/metadata-models/src/main/pegasus/com/linkedin/metadata/entity/

we define a ChartEntity, but not a MLModelEntity, and the ChartEntity is not listed in the union in

Entity.pdl

. Why is that?

high-hospital-85984

09/16/2020, 1:47 PM

whats the best practice for keeping up to date with master while working with custom entities in their own fork/version? Seems like it could be prone to merge conflicts?

aloof-fall-4769

09/16/2020, 11:53 PM

Hi All, i was trying out datahub. Am having difficulties while building this project. can anyone help me out? "The import com.linkedin.pegasus2avro cannot be resolved" . And in metadata-events/mxe-schemas there is a script named renaem-namespace.sh which necessarily renames all com.linkedin.* to com.linkedin.pegasus2avro.

swift-account-97627

09/21/2020, 10:29 AM

(Moved to #datahub-ui )

high-hospital-85984

09/21/2020, 11:39 AM

I’m trying to build the

datahub-frontend

image, but hitting

| Launcher Chrome not found. Not installed?

. Couldn’t find any mention about this in the docs. Any ideas?

some-crayon-90964

09/22/2020, 5:47 PM

Hello all, i have got an issue trying to run mysql docker for datahub but I got this. I have tried to grant permissions to

/var/run/mysqlId

but it does seem to help. Anyone has idea how I can fix this? Thanks in advance.

high-hospital-85984

09/25/2020, 4:37 PM

Has anyone extended their metadata tracking to APIs, either third or first party? As this seems to require a new entity etc, I’d be interested to know how you modelled it? Thanks!

some-crayon-90964

09/25/2020, 8:36 PM

I am having trouble running GMS with error a NPE. I have explored channels and found a solved case. It mentioned that its a SQL config issue, but I dont see anything wrong with my SQL config. I am running mysql with:

docker run --name mysql --hostname mysql -e "MYSQL_DATABASE=datahub" -e "MYSQL_USER=datahub" -e "MYSQL_PASSWORD=datahub" -e "MYSQL_ROOT_PASSWORD=datahub" -v mysql:/docker-entrypoint-initdb.d -v mysqldata:/var/lib/mysql -v mysql:/var/run/mysqld:rw --publish 3306:3306 --tmpfs /tmp:rw --tmpfs /run:rw --tmpfs /var/run:rw --network geotab_docker_bridge --read-only --security-opt=no-new-privileges -d <http://gcr.io/data-infrastructure-test-env/myql:5.7|gcr.io/data-infrastructure-test-env/myql:5.7> --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci

Not sure what goes wrong, what is the credential / setup GMS is looking for? Thanks in advance!

acceptable-architect-70237

09/25/2020, 8:56 PM

I want to share my experience to use Apache Skywalking with Datahub. Apache Skywalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures. https://github.com/liangjun-jiang/distributed-tracing-in-datahub-with-skywalking/blob/master/README.md

high-hospital-85984

09/30/2020, 7:57 AM

Hi! We’re moving ahead with out PoC, and now we’re wondering what the minimum/maximum supported version are for the databases: • MySQL or Postgres • Neo4J • Elasticsearch

flat-answer-18123

09/30/2020, 11:49 AM

Hi, I am trying to build the datahub project, it is failing. Can anyone help me !!

strong-pharmacist-65336

09/30/2020, 7:19 PM

I am not able to start a datahub-frontend container Could you please help me start it?

some-crayon-90964

09/30/2020, 7:51 PM

Question: does Datahub currently have the audit log that tracks user activities such as user A access data 123. Or is the linkedIn team planning to add that?

chilly-barista-6524

10/06/2020, 9:30 AM

/gradlew build

failing with the following error :

Copy code

> Task :datahub-web:emberWorkspaceTest FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':datahub-web:emberWorkspaceTest'.
> Process 'command '/home/shubham.gupta2/datahub/datahub-web/build/yarn/yarn-v1.13.0/bin/yarn'' finished with non-zero exit value 1

can someone help with this? This is hosted on an EC2 instance

strong-pharmacist-65336

10/06/2020, 11:54 AM

Hello <!here>, I am getting error unable to find valid certification path to required target while executing this command at https://github.com/linkedin/datahub/tree/master/metadata-ingestion

Copy code

./gradlew :metadata-events:mxe-schemas:build

hallowed-dinner-34937

10/07/2020, 2:00 PM

Hello LinkedIn Team, running a docker container for elasticsarch today gave me this in the logs "# License [will expire] on [Saturday, October 31, 2020]. If you have a new license, please update it. # Otherwise, please reach out to your support contact." Wondering if this is something that needs to be looked into (if not already being looked into).

hallowed-dinner-34937

10/07/2020, 4:25 PM

Hello Team, I was wondering what the linkedin github release cycle is like. This is to see if we can regularly update the code on our side with changes made by Linkedin. Thank you

👋 1

nutritious-bird-77396

10/07/2020, 10:52 PM

Dear Team….I am working on exposing an API endpoint to populate

DatasetSnapshot

metadata….. I am having some issues when Deserializing the

fields-> type-> type

within the

SchemaMetadata

Aspect…. https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/schema/SchemaFieldDataType.pdl I guess i should set the type of data as value in my Jackson Deserialization inorder for me to set the corresponding type but i am having challenges with that… If linkedin or anyone in the community handled such case with Jackson Deserialization kindly help out…. Details of input/error in the

Thread

high-hospital-85984

10/08/2020, 9:58 AM

We’re trying to get an understanding of the storage volume needs on ES and Neo4J. Is anyone willing to share some numbers, for example Gb in Neo4J, ES versus number of elements in Datahub?

high-hospital-85984

10/09/2020, 8:52 AM

We’re thinking about the level of persistence needed especially form Neo4J, as we might end up hosting it ourselves. In case we need to regenerate the data in Neo4J, whats the best approach? I guess we can keep a full history in the Kafka-topic, and reset the cursor manually, but it feels suboptimal from a scaling perspective? Is there a way to tell GMS to resend the MAE messages based on whats in MySQL? Backups are of course nice, but we see a risk of the ES and Neo4J backups getting out of sync in case of a disaster. Therefore, it would be nice to have a way to repopulate the DBs as a fallback. Or maybe we’re just overthinking this 😅

hallowed-dinner-34937

10/09/2020, 2:06 PM

Hello, This may be a question that could be easily answered if I just venture into the github repo and look around but I decided to take this route instead! aha!.. I'm wondering if there is any documentation or if someone could point me towards the code that would show me how the datahub API works. We're currently looking to try to ingest data into datahub through external purposes. For example, an external process will create some documentation in google drive when a new table/entity is created, after this another seperate process will push the link to this google doc into the datahub along with the new table.