We are looking at a Use-case where data-profiling ...
# getting-started
n
We are looking at a Use-case where data-profiling information such as count of events, max, min etc… are pushed every few mins for every dataset in the org. Has linkedin dealt with such a use-case? What special considerations need to be taken care in the architecture? For ex: Data profiling info for 30,000 datasets pushed every 5 mins….
l
@mammoth-bear-12532 can comment. Also @gray-shoe-75895 who is looking into data profiling
just curious - why every 5 mins?
is this for streaming use-cases?
m
@nutritious-bird-77396 : we haven’t done this at LinkedIn yet although it was always on the roadmap :) the volume of events is fine, but you would want to skip the MySQL / relational store and add something like Pinot as a consumer to the Kafka stream. GMA doesn’t support this yet.
I mean the volume of events is fine for Kafka , but you wouldn’t want to store it in MySQL
l
hourly should be sufficient for majority of datasets?
n
@loud-island-88694 Yes its for streaming datasets. @mammoth-bear-12532 Not only Mysql but also Neo4j, we saw issues in the past with Neo4j sync when one of our producers mistakenly sent Metadata for all the datasets every 5 mins.
l
yeah for high freq data like this, we need something like Pinot
n
If we even decide to take the Pinot route...Is there a value in passing thru Datahub? Wouldn’t directly storing and retrieving from Pinot would be sufficient.
b
For visualization, yes, but if you aren't visualizing via DH UI I'm not so sure
l
It is important when you get into Data Observability use cases. Let's say you notice a week-over-week difference in data profile and you need to debug - did it happen because of a change in a dimensional table that you are joining against in your streaming pipeline or is it because completeness of data is different compared to last week (metadata about completeness of partitions will be attached to datasets)
Lineage information is important for debugging differences in profile