Latest version 0.8.26.7 seems to not set the proje...
# ingestion
p
Latest version 0.8.26.7 seems to not set the project_id correctly in great expectations when running profiling for BQ tables. Logs in thread
Copy code
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - Traceback (most recent call last):
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 813, in _generate_single_profile
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     pretty_name=pretty_name,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 869, in _get_ge_dataset
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     **batch_kwargs,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 1645, in get_batch
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     batch_parameters=batch_parameters,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 1348, in _get_batch_v2
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     return validator.get_dataset()
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/validator/validator.py", line 1942, in get_dataset
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     **self.batch.batch_kwargs.get("dataset_options", {}),
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/dataset/sqlalchemy_dataset.py", line 641, in __init__
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     "No BigQuery dataset specified. Use bigquery_temp_table batch_kwarg or a specify a "
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - ValueError: No BigQuery dataset specified. Use bigquery_temp_table batch_kwarg or a specify a default dataset in engine url
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - Limit and offset parameters are ignored when using query-based batch_kwargs; consider adding limit and offset directly to the generated query.
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - Encountered exception while profiling discord-data-analytics-prd.dem.sketch_guild_num_users_with_voice_or_video_call_instance
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - Traceback (most recent call last):
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 813, in _generate_single_profile
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     pretty_name=pretty_name,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 869, in _get_ge_dataset
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     **batch_kwargs,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 1645, in get_batch
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     batch_parameters=batch_parameters,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 1348, in _get_batch_v2
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     return validator.get_dataset()
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/validator/validator.py", line 1942, in get_dataset
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     **self.batch.batch_kwargs.get("dataset_options", {}),
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/dataset/sqlalchemy_dataset.py", line 641, in __init__
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     "No BigQuery dataset specified. Use bigquery_temp_table batch_kwarg or a specify a "
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - ValueError: No BigQuery dataset specified. Use bigquery_temp_table batch_kwarg or a specify a default dataset in engine url
d
Could you set the
bigquery_temp_table_schema
property in the config. To be able profile partitioned datasets great expectation (the framework we use for profiling under the hood) needs to create temporary tables and for that we need a schema where we can create these tables. These tables going to be purged in the end. https://datahubproject.io/docs/metadata-ingestion/source_docs/bigquery#profiling
p
Sure, but I didn't have to do this before 0.8.26.7. So what exactly changed for this to be a requirement?
21 days ago it seems: https://github.com/linkedin/datahub/commit/928ab74f33ec300a5ce34eae9cd74b6aaaca5aed Wondering if there's a better way we can communicate out changes like this that will break existing configs
d
Well, the thing is that earlier we did not support profiling partitioned/sharded data. From now on we profile the latest partition automatically but for that we need this.
but you are right we should have communicate this better. Sorry about this.
p
follow up: is there a way to disable this feature?
and no need to apologize! our fault too for not keeping on track of things
d
Currently not but I’m creating a pr to make it optional.
👍 1
p
Hey @dazzling-judge-80093! Just wondering what the status is for the above change
d
Hey, I have the pr ready to merge -> https://github.com/linkedin/datahub/pull/4228 It will be hopefully merged and released soon.
In the meantime the pr was merged and it will be shipped with the next python release
p
that's awesome, thanks for the update