Latest version 0 8 26 7 seems to not set the project id corr DataHub #ingestion

Latest version 0.8.26.7 seems to not set the proje...

plain-farmer-27314

02/22/2022, 4:40 PM

Latest version 0.8.26.7 seems to not set the project_id correctly in great expectations when running profiling for BQ tables. Logs in thread

plain-farmer-27314

02/22/2022, 4:41 PM

Copy code

[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - Traceback (most recent call last):
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 813, in _generate_single_profile
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     pretty_name=pretty_name,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 869, in _get_ge_dataset
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     **batch_kwargs,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 1645, in get_batch
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     batch_parameters=batch_parameters,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 1348, in _get_batch_v2
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     return validator.get_dataset()
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/validator/validator.py", line 1942, in get_dataset
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     **self.batch.batch_kwargs.get("dataset_options", {}),
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/dataset/sqlalchemy_dataset.py", line 641, in __init__
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     "No BigQuery dataset specified. Use bigquery_temp_table batch_kwarg or a specify a "
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - ValueError: No BigQuery dataset specified. Use bigquery_temp_table batch_kwarg or a specify a default dataset in engine url
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - Limit and offset parameters are ignored when using query-based batch_kwargs; consider adding limit and offset directly to the generated query.
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - Encountered exception while profiling discord-data-analytics-prd.dem.sketch_guild_num_users_with_voice_or_video_call_instance
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - Traceback (most recent call last):
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 813, in _generate_single_profile
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     pretty_name=pretty_name,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 869, in _get_ge_dataset
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     **batch_kwargs,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 1645, in get_batch
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     batch_parameters=batch_parameters,
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 1348, in _get_batch_v2
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     return validator.get_dataset()
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/validator/validator.py", line 1942, in get_dataset
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     **self.batch.batch_kwargs.get("dataset_options", {}),
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/great_expectations/dataset/sqlalchemy_dataset.py", line 641, in __init__
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO -     "No BigQuery dataset specified. Use bigquery_temp_table batch_kwarg or a specify a "
[2022-02-22, 16:21:22 UTC] {pod_launcher.py:149} INFO - ValueError: No BigQuery dataset specified. Use bigquery_temp_table batch_kwarg or a specify a default dataset in engine url

dazzling-judge-80093

02/22/2022, 4:46 PM

Could you set the

bigquery_temp_table_schema

property in the config. To be able profile partitioned datasets great expectation (the framework we use for profiling under the hood) needs to create temporary tables and for that we need a schema where we can create these tables. These tables going to be purged in the end. https://datahubproject.io/docs/metadata-ingestion/source_docs/bigquery#profiling

plain-farmer-27314

02/22/2022, 4:50 PM

Sure, but I didn't have to do this before 0.8.26.7. So what exactly changed for this to be a requirement?

plain-farmer-27314

02/22/2022, 4:54 PM

21 days ago it seems: https://github.com/linkedin/datahub/commit/928ab74f33ec300a5ce34eae9cd74b6aaaca5aed Wondering if there's a better way we can communicate out changes like this that will break existing configs

dazzling-judge-80093

02/22/2022, 5:07 PM

Well, the thing is that earlier we did not support profiling partitioned/sharded data. From now on we profile the latest partition automatically but for that we need this.

dazzling-judge-80093

02/22/2022, 5:07 PM

but you are right we should have communicate this better. Sorry about this.

plain-farmer-27314

02/22/2022, 7:56 PM

follow up: is there a way to disable this feature?

plain-farmer-27314

02/22/2022, 7:57 PM

and no need to apologize! our fault too for not keeping on track of things

dazzling-judge-80093

02/23/2022, 10:11 AM

Currently not but I’m creating a pr to make it optional.

👍 1

plain-farmer-27314

03/01/2022, 3:53 PM

Hey @dazzling-judge-80093! Just wondering what the status is for the above change

dazzling-judge-80093

03/02/2022, 6:10 AM

Hey, I have the pr ready to merge -> https://github.com/linkedin/datahub/pull/4228 It will be hopefully merged and released soon.

dazzling-judge-80093

03/02/2022, 8:30 AM

In the meantime the pr was merged and it will be shipped with the next python release

plain-farmer-27314

03/02/2022, 2:10 PM

that's awesome, thanks for the update

Open in Slack

Previous Next