DataHub #getting-started

bumpy-keyboard-50565

03/07/2020, 2:17 PM

Welcome @dazzling-judge-80093 ! Since most people are signing up using their personal email addresses, you might wanna share your company name as well.

magnificent-exabyte-60779

03/17/2020, 7:37 PM

Ah, is that the line saying ‘Dataset & field-level commenting’ (6m to 1y)? Are there already some ideas forming around this topic that we can learn from / contribute to? Or others that are interested in this part? I saw the previous town hall some questions were asked about this by Jordan Preston and Anand Mehrotra.

agreeable-boots-73250

03/23/2020, 4:56 AM

nutritious-ghost-21337

03/26/2020, 5:14 PM

Hello #general, I'd like to raise a question f Can we use DataHub to store metadata other than store the metadata structure? I'd like to give you some context to better understand my question: My application process daily an high volume of inbound feeds for several customers. Each feed is transformed to a datamodel, processed to clean/compute/add some information and then stored in a parquet file on a given location (so far then we are working on Hadoop but this may change at some point). Each of those feed i'd be interesting to store information like: - the version of parser/processor/ecc which did generate it - the version of datamodel used ( and if it's deprecated ) - the inbound feed which did generate it - the date when it was generated - the location where i can find the output/input feed - the customer owning the inbound feed Those are just few example of metadata i would like to attach to a given dataset and to store with the main purpose of search through them later on. At the same time I would need to implement an ACL to restrict access to those metadata. I'm currently analyzing DataHub solution to asses if it could satisfy those requirements. I therefore clone the repository and tried some data ingestion. I've played with Rest.li to create some dataset. My first impression is that DataHub is meanly meant to store, manage and search through metadata structure only but maybe I'm approaching this tool from the wrong point of view. Given the usecase i described above can you suggest me if DataHub can fit my needs upon having implemented the appropriate extension to the actual model?

mammoth-whale-58647

03/26/2020, 8:31 PM

I am a bit confused as to whether the GMS API should expose the aspect model or not. I see for instance that there are two Ownerships, one that seems to be a generic aspect, assignable to any URN and the other being an ownership dedicated to dataset ownership. There is a lot of repetition there, for instance between the "generic" OwnershipType and the dataset-specific "OwnershipCategory". I initially assumed that this was because the aspect model should not be directly exposed through the GMS but rather abstracted away into payloads designed specifically for the GMS API. However, then I noticed that the GMS's dataset ownership resource actually exposes the generic Ownership aspect, and not the dedicated dataset ownership payload from the API module. So now, I don't fully understand if we should abstract away the aspect-driven model in the gms or not.

mammoth-whale-58647

03/30/2020, 8:13 PM

Hi everyone, I have another question. I noticed the GMS uses ComplexKeyResourceTask template underneath, yet I did not find any asynchronous implementations. Is there a particular reason why the async resource template was chosen over the synchronous one? Are there any plans for an async DAO for instance?

plain-arm-6774

04/14/2020, 3:28 PM

Hello all, I came across this blog post on how LinkedIn does authorizations. I was wondering whether DataHub also follows that model. A quick github repo search didn't find obvious ways to configure authorizations. Could someone link me to docs or code that may point me in the right direction? Thanks!

brash-lock-91510

04/27/2020, 7:19 AM

Hello all, here to learn more about how to use DataHub.

wide-teacher-69432

04/27/2020, 10:02 AM

Each dataset is through its URN definition associated with a platform. There exists a model definition (com.linkedin.dataplatform.DataPlatformInfo.pdsc). I believe this dataplatform model is currently not used. Was the idea to have data platform as an entity along with the other 3 entities (dataset, user, userGroup)? I can imagine that information about the platform where datasets are stored might be interesting…

mammoth-whale-58647

04/28/2020, 6:13 PM

Hello everyone, I have another modeling question. I have noticed that for most "standard" resource operations, a complex key (e.g. DatasetKey) is used whereas for the actions (e.g. ingest, backfill), the URN instead seems to be the preferred identifier. Is there a particular reason for this? The reason I ask is because I would like to model "sub-entities" where both root and sub entity can have aspects. I decided against associations because really, the sub resource cannot exist without the parent and I didn't want to embed the sub resource because it is often accessed and modified independant of the parent resource. Sub resources generally work quite well in Rest.li , and they do here for the "standard" resource operations, where I can access/combine parent and child keys as I see fit. It is slightly different for the actions though. To give you some context, my parent resource has a composite key of 3 fields, my sub entity adds a fourth field to that key. Just to access the actions on the subresource, I already need to provide those 3 fields, e.g. /parent/field1,field2,field3/child?action=myaction. But then, I would have to provide the full URN to the subresource which repeats those 3 fields and adds a fourth. It seems so redundant and so now it had me wondering why the actions ingest and backfill aren't just on the Entity resource level, leveraging the same key documents that GET functions would. Before making such modifications, I wanted to better grasp why sometimes URN's are preferred, and sometimes the document keys

wide-teacher-69432

04/29/2020, 5:51 AM

When trying to build datahub, gradle hangs here:

Copy code

<====---------> 34% EXECUTING [1h 20m 42s]
> :datahub-web:emberWorkspaceTest
> :datahub-frontend:compilePlayBinaryScala

This happens on OSX as well as on Linux. Any idea how this can be resolved?

wide-teacher-69432

04/29/2020, 8:39 PM

…another newbie question… I’ve followed the “Compensation” tutorial from one of the last town hall recordings - many thanks for that, really useful! It works and I do see the changes as expected. I do see, however, also an exception in mae-consumer-job (which I’ve restarted after a clean build to pick up the changes). So I’m wondering what I was missing… I don’t expect that the new compensation aspects will automagically appear in Elastic or in Neo4J, I’m just wondering about that exception… Thanks for your support and your patience!

Copy code

19:23:26.513 [mae-consumer-job-client-StreamThread-1] INFO  c.l.m.k.MetadataAuditEventsProcessor - {com.linkedin.metadata.snapshot.CorpUserSnapshot={urn=urn:li:corpuser:datahub, aspects=[{com.linkedin.identity.CompensationPackage={weeklyPay=1234, targetBonus=10}}]}}
19:23:26.544 [mae-consumer-job-client-StreamThread-1] ERROR c.l.m.k.MetadataAuditEventsProcessor - java.lang.RuntimeException:
com.linkedin.metadata.dao.utils.RecordUtils.invokeProtectedMethod(RecordUtils.java:257)
 com.linkedin.metadata.dao.utils.RecordUtils.getRecordTemplateField(RecordUtils.java:176)
 com.linkedin.metadata.builders.graph.BaseGraphBuilder.build(BaseGraphBuilder.java:41)
 com.linkedin.metadata.kafka.MetadataAuditEventsProcessor.updateNeo4j(MetadataAuditEventsProcessor.java:78)
 com.linkedin.metadata.kafka.MetadataAuditEventsProcessor.processSingleMAE(MetadataAuditEventsProcessor.java:62)
 com.linkedin.metadata.kafka.config.KafkaStreamsConfig.lambda$kStream$0(KafkaStreamsConfig.java:64)
 org.apache.kafka.streams.kstream.internals.KStreamPeek$KStreamPeekProcessor.process(KStreamPeek.java:42)
 org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:117)
 org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:201)
 org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:180)
 org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:133)
 org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:87)
 org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:366)
 org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199)
 org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:420)
 org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:890)
 org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:805)
 org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:774)

billowy-eye-48149

04/30/2020, 6:04 PM

Hello all, I ingested data I extracted from Google BigQuery to datahub. All functionalities except the view as json for schema field is working. Could you help to understand the requirement for viewing the schema as json?

billowy-eye-48149

05/05/2020, 8:14 PM

Hi everyone, Is there any reason for using elasticsearch version=5.6.8 ? Can we use a newer version of elasticsearch? The reason why I am asking is due to permission restrictions in our servers, we have to get the Dockerfile from the docker github repository and update the entrypoint.sh script to build the image. But it seems the Dockerfile for version 5.6.8 is not present in the github.

wide-teacher-69432

05/07/2020, 4:10 PM

Hi datahubers. I do have questions about aspect versions. When running the metadata-ingestion bootstrap multiple times, I do see that the SchemaMetadata aspect exists in multiple versions but other aspects have only one version. My question is: When does the system decide to create a new version? A related question is about the ingestion agents (metadata producers): Is the idea that they are stateless, e.g. should they run periodically and push metadata into datahub independent of whether metadata has changed or not? And the final question: Assume that a table gets removed from a relational database. How and when should this table be removed from datahub and by whom? Below is the content of the MySQL table where SchemaMetadata has multiple versions although the content is always the same.

Copy code

> select urn, aspect,version from metadata_aspect where urn='urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)';
+--------------------------------------------------------------------+-----------------------------------------+---------+
| urn                                                                | aspect                                  | version |
+--------------------------------------------------------------------+-----------------------------------------+---------+
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.common.InstitutionalMemory |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.common.Ownership           |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.common.PlatformLocation    |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.dataset.DatasetProperties  |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.dataset.UpstreamLineage    |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.schema.SchemaMetadata      |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.schema.SchemaMetadata      |       1 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.schema.SchemaMetadata      |       2 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.schema.SchemaMetadata      |       3 |
+--------------------------------------------------------------------+-----------------------------------------+---------+

fancy-advantage-41244

05/11/2020, 4:04 PM

Hi everyone, Did anyone implement table-level tagging (popularity, quality/health indicator, compliance related, etc.)? I know LinkedIn team is currently working on field-level tagging but I would love to learn what others are doing in meantime for table-level tagging.

fancy-advantage-41244

05/12/2020, 12:11 PM

Is the Chat icon (one below the last modified) supposed to have some functionality?

green-holiday-21281

05/17/2020, 1:08 AM

Hi Everyone, I am building an application for data scientists to track and document their work. I am thinking to use DataHub as the backend for storing all the metadata and dependencies (dataset, notebook, dashboards...)? It seems a lot of those entities are on the roadmap but not available yet. I am concerned if this would require too many customizations in the short term, and it would be easier to start with a generic database for now and maybe get back to DataHub later on. Any opinions out there? Sorry I am not very familiar with the code yet, and just trying to get a sense how difficult it is to customize. Any pointers would be appreciated, if anyone attempted something similar.

bumpy-keyboard-50565

05/21/2020, 5:52 PM

<!here> Now that https://github.com/linkedin/datahub/pull/1678 has got in, please use PDL instead of PDSC for modeling going forward.

many-accountant-26574

05/29/2020, 12:34 AM

I feel really bad for asking this and I would say I am quite well versed in Docker, docker-compose, databases and the like but DataHub got me stumped at the moment. I've got it all set up quite nicely but the one thing I can't seem to fathom is, how to create users, or even: how to change the default user's datahub password? I've scanned the initial init.sql but it does not reveal how the user datahub was assigned the password datahub? And why are these controls (not yet?) in the front-end? I've read almost every README.md I could find in the GitHub repo, scanned the issues for multiple keywords but I got none the wiser. Am I missing something?

many-accountant-26574

05/29/2020, 12:54 AM

Oh great, just scrolled upwards and read it hasn't been implemented yet.

many-accountant-26574

05/29/2020, 11:06 AM

It occurs to me that there's no pegasus schema type for timestamps or date(times).

many-accountant-26574

05/29/2020, 8:05 PM

Got a question about the consumers: how can I configure them listening to an external kafka cluster, more importantly using https and basic auth headers for API keys? Currently using Confluent Cloud as a test environment. I couldn't find a descriptive environment variable for specific security settings.

nutritious-bird-77396

06/01/2020, 6:34 PM

Ahh…got it…So you are saying that this is not an issue. In that case if i try to use the generated avsc in an other project to generate java objects it throws error as it doesn’t understand

"com.linkedin.common.Ownership"

. Does that mean i will have to manually change the

avsc

in this case?

many-accountant-26574

06/03/2020, 12:29 AM

🙂 got confluent cloud working.

👍 1

many-accountant-26574

06/03/2020, 12:30 AM

Now I want to set up a few tests but, I am a bit of newbie to GraphQL especially now with Avro/Pegasus combined so please help me out haha. I'v e run the ms-cli tasks to ingest the sample datasets but: 1. Can't figure out how to delete them.

many-accountant-26574

06/03/2020, 12:33 AM

But why is the value of that key a tuple instead of yet another dict?

acceptable-architect-70237

06/03/2020, 4:16 AM

I didn't see the

status

, but

removed:false

just showed up

plain-arm-6774

06/24/2020, 5:09 AM

Hi! I see that we have a model for DeploymentInfo but I can’t find the association to MetadataChangeEvent. Is there no way to define it via Kafka stream?

fancy-analyst-83222

08/11/2020, 7:29 PM

Hello everyone, I am new to datahub and was exploring a little. Had a doubt around if there’s an option for a minimal setup? Can I bypass Kafka and the schema registry and just use the rest? Or is there a hard dependency on these parts?