Hey everyone, I am looking to partner with someone...
# contribute-code
g
Hey everyone, I am looking to partner with someone to help me implement a new addition to the GraphQL API for the SQL profiling stats options. I want to add an option for distinct field sample values for low cardinality fields. Here is a sample of some code I mocked up that I think would achieve this. I took the sample values and modified it to my use case. I would also be happy to work with someone to convert it to not depend on great expectations. Anyway, I'm going to start diving into updating the GraphQL API following instructions here: https://github.com/datahub-project/datahub/tree/master/datahub-graphql-core. I've never contributed to an open source project before so definitely need some help navigating the codebase. If anyone wants to help me build this out, I would love to learn. Thanks! file: datahub/ingestion/source/ge_data_profiler.py (added at line 510)
Copy code
@_run_with_query_combiner
    def _get_dataset_column_distinct_values(
        self, column_profile: DatasetFieldProfileClass, column: str, unique_count: int, nonnull_count: int
    ) -> None:
        if not self.config.include_field_distinct_values or unique_count > 25:
            return

        try:
            # TODO do this without GE
            self.dataset.set_config_value("interactive_evaluation", True)
            
            # Check for distinct values in ever larger increments
            pct_dataset = [.01,.05,.10,.25,.5,1]

            for pct in pct_dataset:
                samples_to_check = nonnull_count * pct

                res = self.dataset.expect_column_values_to_be_in_set(
                    column,
                    [],
                    result_format={
                        "result_format": "SUMMARY",
                        "partial_unexpected_count": samples_to_check,
                    },
                ).result

                # Get the distinct values
                distinct_values = [*set(res["partial_unexpected_list"])]

                if len(distinct_values) == unique_count:
                    column_profile.distinctValues = [
                    str(v) for v in res["partial_unexpected_list"]
                    ]
                    # Exit loop if the distinct values are all captured
                    break

        except Exception as e:
            logger.debug(
                f"Caught exception while attempting to get distinct values for column {column}. {e}"
            )
            self.report.report_warning(
                "Profiling - Unable to get column distinct values",
                f"{self.dataset_name}.{column}",
            )
1
Just realized I don't need a new entity, so just following the instructions here instead: https://datahubproject.io/docs/metadata-modeling/extending-the-metadata-model/#step_3