Just a quick update about where I'm at with my wor...
# integrate-iceberg-datahub
m
Just a quick update about where I'm at with my work. I have been working on profiling for the past few days and I managed to get this. You can ignore the
Distinct Count
and
Distinct %
since it is not possible to compute those by only using the manifest metrics (there is a set of metrics for each data file, so a distinct value in file A and a distinct value in file B do not mean that we have 2 distinct values in the table... it could be the same value, so the distinct count for the table would be 1).
Min
,
Max
,
Null Count
and
Null %
are reliable though. Is it a problem if the Iceberg source profiling does not provide a full picture? My code would greatly benefit from a review since I don't think I leveraged all the tooling from DataHub ingestion. What would you recommend? That I try to polish it as much as I think I can and then ask for a review, so do this sooner in case I need to do a big refactoring? I don't want to waste your time too much, but I don't want to waste mine either! 😉
wow 1
m
this looks really cool! @chilly-holiday-80781 might be a great person to help shepherd your code.
c
Wow this is great! I’d love to help you out with a review. It should be cool if we can’t get the complete set of metrics, and I can check if there’s any tools that could make your life easier.
l
@modern-monitor-81461 this is SO EXCITING!!!! Thank you so, so much for driving this forward