Hi Datahub team! I’m looking through the <Q3 2021 ...
# feature-requests
a
Hi Datahub team! I’m looking through the Q3 2021 Roadmap and was trying to find the Data Quality and Health visualization features. Have those been released yet?
l
Hi Juan! We haven’t released it yet - we are aiming to tackle some of the foundational modeling work required to support it by the end of 2021; surfacing that detail in the UI will come thereafter 🙂
thankyou 2
a
Thanks @little-megabyte-1074! Super interested in that feature, so if there’s a way we can collaborate on the requirements that would be great!
b
@acceptable-potato-35922 We'd love to! What do you have in mind?
a
Hey @big-carpet-38439! One of our areas of focus is building trust in the data, and we think transparent Quality and Health metrics would be a great way to build that trust. We are looking for a module where we can systematically collect metrics for the data as opposed to manually “stamping” it as good or bad data. We haven’t fully defined what our quality KPIs would be, but off the top of my head I’m thinking of something along the lines of: • On time delivery based on a digital contract and the % of times the dataset has been on time in the past X days (e.g. 7, 60, 90 days) • Data points within the normal distribution (and highlight outliers) • Consistent record/volume increases (based on past heuristics) • Completeness (i.e. was data for all markets/users collected or are any of them still missing?) It would be great to have a module in DataHub that can reflect these metrics agnostically and then let the user decide if that quality is good enough for them to use or not.
Any chance we could get a walk through for our team of what the DataHub team is planning on building for Data Quality/Health? It’ll help us get an idea of where the roadmap is, and we can start collaborating from there.
l
Hi @acceptable-potato-35922, apologies for the delay on my side! Let me get back to you next week on next steps; I’m juggling a few different roadmap conversations & should know more then 🙂
a
Perfect! Thanks Maggie
b
This sounds great- one thing we've definitely been interested in is modeling "contracts" on DataHub. But since datahub is not the place where that contract is implemented, the best we'd be able to do is reactively monitor that the contract is being adhered to, as you are alluding to
p
Hi @acceptable-potato-35922 - Defining the data quality rules and having them periodically checked has already "solved" by the likes of AWS Deequ or Linkedin's Great Expectations. Certainly your last three bullets can be covered by these DQ libraries. The first bullet I guess you need Airflow's SLAs or something similar, not entirely sure. Very happy DataHub team is indeed working on surfacing these stats one way or another in DataHub UI, but IMHO, the real problem is then afterwards the CRUD of these rules/checks (including auto suggested rules based off profiling data). Not sure whether that is territory where DataHub wants/needs to play. Commercial platforms as Collibra have also identified this need to bring DQ closer to the business user. That's why they bought OwlDQ last year (https://dq-docs.collibra.com/)
👍 1
a
@big-carpet-38439 - Yes, completely agree. I too am thinking about using DataHub as a way to surface these metrics, but not necessarily as a place to create or implement digital contracts. I would love to see if we can get a sneak peak of what you all are thinking for the Data Quality & Health module.
@proud-addition-27250 Thanks a lot for that. Great Expectations looks very interesting! I was poking around in the documentation and it looks like a tool to connect to the different pipelines that are creating the data (which I think is great). Do you know if it’s able to graphically reflect the results…ideally in DataHub 😉 ? (I couldn’t find that part). I’m basically trying to get to what Kaggle does in their Data Explorer. Here’s an example: https://www.kaggle.com/omkarborikar/top-10000-popular-movies
b
Really like this! So what’s on our roadmap is integration with Great Expectations to display the result of your expectation (test) suite.. I think @miniature-tiger-96062 has been looking into this more deeply, perhaps he can provide a better look :)
I think it’d also be useful to be able to combine metadata signals and checks into a single health summary, but that will be layered on top
And is not super clear to us yet. We’d love to collaborate on this
a
Awesome. We’d love that as well. We would love to hop on a call and get a walkthrough of where you guys are at (if possible), and from there we can determine how to collaborate on it. Just a heads up that we are still in very very early stages in our Data Quality journey.
b
cc. @little-megabyte-1074 Can we set something up to this effect?
l
Yep, it’s on my to-do list for this week to figure out next steps!
m
Hi @acceptable-potato-35922, like John pointed out GE integration with Datahub is definitely on our roadmap for this year. Starting off, we'd atleast want to reflect the Quality check, the result of the check and associated dataset on datahub UI. We are still working on the details but hope that gives some insights.
a
That sounds great! Really looking forward to it 😀
👍 1
l
Hi @acceptable-potato-35922! Hope you had a great weekend 🙂 We’re planning to begin design/development for data quality (specifically targeting Great Expectations) starting the week of November 29th; we’d love to chat with you that week! Is there a day/time that works best for you?
👀 1
You’re also welcome to grab time here
a
Thanks @little-megabyte-1074! I grabbed some time on that Thursday. Just 30 mins for now to get the ball rolling.
1
l
Looking forward to it!!
p
Hi @acceptable-potato-35922 - sorry for late reply, has been busy at work. Take a look at GE Data Docs, think it includes data profiling (which is what I saw on the Kaggle link you sent).
thankyou 1
@little-megabyte-1074 if you want, I could also make some time to discuss DQ in Datahub. I am involved in capturing requirements at the moment for a DQ platform in our company. Let me know if you are interested to connect around that topic.
l
Hi @proud-addition-27250, that would be amazing!! Would you mind scheduling some time with me here?
p
Done!