Just a quick update about where I m at with my work I have b DataHub #integrate-iceberg-datahub

Just a quick update about where I'm at with my wor...

modern-monitor-81461

01/18/2022, 10:33 PM

Just a quick update about where I'm at with my work. I have been working on profiling for the past few days and I managed to get this. You can ignore the

Distinct Count

and

Distinct %

since it is not possible to compute those by only using the manifest metrics (there is a set of metrics for each data file, so a distinct value in file A and a distinct value in file B do not mean that we have 2 distinct values in the table... it could be the same value, so the distinct count for the table would be 1).

Min

Max

Null Count

and

Null %

are reliable though. Is it a problem if the Iceberg source profiling does not provide a full picture? My code would greatly benefit from a review since I don't think I leveraged all the tooling from DataHub ingestion. What would you recommend? That I try to polish it as much as I think I can and then ask for a review, so do this sooner in case I need to do a big refactoring? I don't want to waste your time too much, but I don't want to waste mine either! 😉

wow 1

mammoth-bear-12532

01/19/2022, 1:39 AM

this looks really cool! @chilly-holiday-80781 might be a great person to help shepherd your code.

chilly-holiday-80781

01/19/2022, 2:12 AM

Wow this is great! I’d love to help you out with a review. It should be cool if we can’t get the complete set of metrics, and I can check if there’s any tools that could make your life easier.

little-megabyte-1074

01/19/2022, 7:17 PM

@modern-monitor-81461 this is SO EXCITING!!!! Thank you so, so much for driving this forward

Open in Slack

Previous Next