Hi All, PLease help me here, I am trying to profil...
# troubleshoot
j
Hi All, PLease help me here, I am trying to profile a bigquery table from an airflow composer with following yaml:
Copy code
complete_json = {
        "source": {
            "type": "bigquery",
            "config": {
                "project_id": "",
                "credential": cred_json,
                "include_views": "true",
                "include_tables": "true",
                "include_table_lineage": "true",
                "upstream_lineage_in_report": "true",
                "schema_pattern": {
                    "ignoreCase": "true",
                    "allow": ["^webengage_mum$"]
                },
                "table_pattern": {
                    "ignoreCase": "true",
                    "deny": ["^.*\.temp_.*"]
                },
                "profile_pattern": {
                    "allow": ["^.*\.application.*"]
                },
                "stateful_ingestion": {
                    "enabled": "true",
                    "remove_stale_metadata": "true",
                    "state_provider": {
                        "type": "datahub",
                        "config": {
                            "datahub_api": {
                                "server": datahub_gms_url,
                                "token": datahub_gms_token
                            }
                        }
                    }
                },
                "profiling": {
                    "enabled": "true",
                    "bigquery_temp_table_schema": ".datahub",
                    "turn_off_expensive_profiling_metrics": "true",
                    "query_combiner_enabled": "false",
                    "max_number_of_fields_to_profile": 1000,
                    "profile_table_level_only": "true",
                    "include_field_null_count": "true",
                    "include_field_min_value": "true",
                    "include_field_max_value": "true",
                    "include_field_mean_value": "true",
                    "include_field_median_value": "true",
                    "include_field_stddev_value": "true",
                    "include_field_quantiles": "true",
                    "include_field_distinct_value_frequencies": "true",
                    "include_field_histogram": "true",
                    "include_field_sample_values": "true"
                }
            },

        },
        "pipeline_name": "biquery_profiling_tables",
        "sink": {
            "type": "datahub-kafka",
            "config": {
                "connection": {
                    "bootstrap": bootstrap_url,
                    "schema_registry_url": schema_registry_url,
                },
            },
        },
    }
The job is running for sometime and then failing with following error:
Copy code
[2022-10-26, 05:26:34 UTC] {ge_data_profiler.py:918} ERROR - Encountered exception while profiling <dataset>.<tableName>
Traceback (most recent call last):
  File "/opt/python3.8/lib/python3.8/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 892, in _generate_single_profile
    batch = self._get_ge_dataset(
  File "/opt/python3.8/lib/python3.8/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 951, in _get_ge_dataset
    batch = ge_context.data_context.get_batch(
  File "/opt/python3.8/lib/python3.8/site-packages/great_expectations/data_context/data_context/base_data_context.py", line 1642, in get_batch
    return self._get_batch_v2(
  File "/opt/python3.8/lib/python3.8/site-packages/great_expectations/data_context/data_context/base_data_context.py", line 1336, in _get_batch_v2
    datasource = self.get_datasource(batch_kwargs.get("datasource"))
  File "/opt/python3.8/lib/python3.8/site-packages/great_expectations/data_context/data_context/base_data_context.py", line 2062, in get_datasource
    raise ValueError(
ValueError: Unable to load datasource `my_sqlalchemy_datasource-548b19eb-6db0-4fa2-8673-0e62306a3c7d` -- no configuration found or invalid configuration.
[2022-10-26, 05:26:35 UTC] {ge_data_profiler.py:773} INFO - Profiling 1 table(s) finished in 2.387 seconds
Can someone help please?
@astonishing-answer-96712
a
Hi Prasoon, what is the filename of the yaml you posted? Is it being referenced in the error?
j
i did not use separate yaml but have created one dag. Uploading the same here:
Hi @astonishing-answer-96712, were you able to figure out something here
a
@gray-shoe-75895 any ideas here?
g
This looks like a bug in our profiler. Could you run
pip freeze
in your airflow environment and let me know what it outputs?
j
actually, we are running it on google composer
running any such command won't be possible. Is there something else I can help?
g
Ah in that case, there should be logs from the
pip install
process somewhere which will contain the version numbers. I haven’t used google cloud composer myself so I don’t know exactly what it’s called, but I know that other people have been able to provide them before