We are looking at a Use case where data profiling informatio DataHub #getting-started

We are looking at a Use-case where data-profiling ...

nutritious-bird-77396

03/04/2021, 10:15 PM

We are looking at a Use-case where data-profiling information such as count of events, max, min etc… are pushed every few mins for every dataset in the org. Has linkedin dealt with such a use-case? What special considerations need to be taken care in the architecture? For ex: Data profiling info for 30,000 datasets pushed every 5 mins….

loud-island-88694

03/04/2021, 10:24 PM

@mammoth-bear-12532 can comment. Also @gray-shoe-75895 who is looking into data profiling

loud-island-88694

03/04/2021, 10:25 PM

just curious - why every 5 mins?

loud-island-88694

03/04/2021, 10:25 PM

is this for streaming use-cases?

mammoth-bear-12532

03/04/2021, 10:29 PM

@nutritious-bird-77396 : we haven’t done this at LinkedIn yet although it was always on the roadmap :) the volume of events is fine, but you would want to skip the MySQL / relational store and add something like Pinot as a consumer to the Kafka stream. GMA doesn’t support this yet.

mammoth-bear-12532

03/04/2021, 10:30 PM

I mean the volume of events is fine for Kafka , but you wouldn’t want to store it in MySQL

loud-island-88694

03/04/2021, 10:32 PM

hourly should be sufficient for majority of datasets?

nutritious-bird-77396

03/04/2021, 10:46 PM

@loud-island-88694 Yes its for streaming datasets. @mammoth-bear-12532 Not only Mysql but also Neo4j, we saw issues in the past with Neo4j sync when one of our producers mistakenly sent Metadata for all the datasets every 5 mins.

loud-island-88694

03/04/2021, 10:48 PM

yeah for high freq data like this, we need something like Pinot

nutritious-bird-77396

03/05/2021, 1:18 AM

If we even decide to take the Pinot route...Is there a value in passing thru Datahub? Wouldn’t directly storing and retrieving from Pinot would be sufficient.

big-carpet-38439

03/05/2021, 1:55 AM

For visualization, yes, but if you aren't visualizing via DH UI I'm not so sure

loud-island-88694

03/05/2021, 2:55 AM

It is important when you get into Data Observability use cases. Let's say you notice a week-over-week difference in data profile and you need to debug - did it happen because of a change in a dimensional table that you are joining against in your streaming pipeline or is it because completeness of data is different compared to last week (metadata about completeness of partitions will be attached to datasets)

loud-island-88694

03/05/2021, 2:57 AM

Lineage information is important for debugging differences in profile

Open in Slack

Previous Next