hello team a general question about `data replay strategy` f DataHub #getting-started

hello, team, a general question about `data replay...

acceptable-architect-70237

03/02/2021, 4:39 PM

hello, team, a general question about

data replay strategy

. for example, in our case, we need to calculate the dataset's data quality. The data quality is calculated based on the aspects of a dataset. Since all datasets are already in datastore (MySQL, Neo4j and Elastic Search), we need to one way to pull data and do the calculation. Right now we are pulling data from MySQL using Python script. Do you guys have some suggestions?

mammoth-bear-12532

03/02/2021, 5:02 PM

@acceptable-architect-70237: do you mean "data quality" as metadata quality scores computed on the aspects stored in datahub?

acceptable-architect-70237

03/02/2021, 5:09 PM

yes, it's one of my use cases. another use case might be that, for example, I add a new property for an aspect,

number of schema fields

, and need ES index this property. I need to

replay

all data to do so.

mammoth-bear-12532

03/04/2021, 6:01 AM

@acceptable-architect-70237: sorry for dropping this thread. The answer is not as straightforward today as we would like it, even though we have a good source-of-truth story with our metadata. I'll create an issue to track this.

mammoth-bear-12532

03/04/2021, 6:12 AM

https://github.com/linkedin/datahub/issues/2170

acceptable-architect-70237

03/04/2021, 2:25 PM

Thanks. Tried to get an idea what you guys think but you have provided your feedback of solutions in the Issue.

Open in Slack

Previous Next