since it is not possible to compute those by only using the manifest metrics (there is a set of metrics for each data file, so a distinct value in file A and a distinct value in file B do not mean that we have 2 distinct values in the table... it could be the same value, so the distinct count for the table would be 1).
are reliable though. Is it a problem if the Iceberg source profiling does not provide a full picture? My code would greatly benefit from a review since I don't think I leveraged all the tooling from DataHub ingestion. What would you recommend? That I try to polish it as much as I think I can and then ask for a review, so do this sooner in case I need to do a big refactoring? I don't want to waste your time too much, but I don't want to waste mine either! 😉
But they would have two different Azure URLs (
My question is how should the Iceberg source deal with this? How does it compare to AWS S3? How would it look for someone using a local filesystem?
type=self._converter._get_column_type( actual_schema.type, ( getattr(actual_schema, "logical_type", None) or actual_schema.props.get("logicalType") ), ),