Hello channel!!!! I’m testing the new feat that sh...
# troubleshoot
w
Hello channel!!!! I’m testing the new feat that shows greatexpectations validation over datahub. I don’t know if I am execution something wrote (quite sure) or may be I found an issue. When I tryed to run over my
batch_request
created from sql query, it show a strange error:
Copy code
ERROR: name 'MetadataSQLParser' is not defined
Debugging, I added the parameter
parse_table_names_from_sql
referee on the documentation to discover why is not sending the results to DH, but it seems that the
provider class
do not have this parameter and is failing .
Copy code
TypeError                                 Traceback (most recent call last)
~/F14/gitlab/great-expectations/.venv/lib/python3.7/site-packages/great_expectations/data_context/util.py in instantiate_class_from_config(config, runtime_environment, config_defaults)
    114     try:
--> 115         class_instance = class_(**config_with_defaults)
    116     except TypeError as e:

TypeError: __init__() got an unexpected keyword argument 'parse_table_names_from_sql'
Copy code
class DataHubValidationAction(ValidationAction):
    def __init__(
        self,
        data_context: DataContext,
        server_url: str,
        env: str = builder.DEFAULT_ENV,
        platform_instance_map: Optional[Dict[str, str]] = None,
        graceful_exceptions: bool = True,
        token: Optional[str] = None,
        timeout_sec: Optional[float] = None,
        retry_status_codes: Optional[List[int]] = None,
        retry_max_times: Optional[int] = None,
        extra_headers: Optional[Dict[str, str]] = None,
    ):
        super().__init__(data_context)
        self.server_url = server_url
        self.env = env
        self.platform_instance_map = platform_instance_map
        self.graceful_exceptions = graceful_exceptions
        self.token = token
        self.timeout_sec = timeout_sec
        self.retry_status_codes = retry_status_codes
        self.retry_max_times = retry_max_times
        self.extra_headers = extra_headers
When I configure
gracefull…
to false, next error occurs.
Copy code
ERROR: Error running action with name datahub_action
Traceback (most recent call last):
  File "/Users/guido/F14/gitlab/great-expectations/.venv/lib/python3.7/site-packages/great_expectations/validation_operators/validation_operators.py", line 452, in _run_actions
    checkpoint_identifier=checkpoint_identifier,
  File "/Users/guido/F14/gitlab/great-expectations/.venv/lib/python3.7/site-packages/great_expectations/checkpoint/actions.py", line 74, in run
    **kwargs,
  File "/Users/guido/F14/gitlab/great-expectations/.venv/lib/python3.7/site-packages/datahub/integrations/great_expectations/action.py", line 128, in _run
    datasets = self.get_dataset_partitions(batch_identifier, data_asset)
  File "/Users/guido/F14/gitlab/great-expectations/.venv/lib/python3.7/site-packages/datahub/integrations/great_expectations/action.py", line 613, in get_dataset_partitions
    tables = MetadataSQLSQLParser(query).get_tables()
  File "/Users/guido/F14/gitlab/great-expectations/.venv/lib/python3.7/site-packages/datahub/utilities/sql_parser.py", line 57, in __init__
    self._parser = MetadataSQLParser(sql_query)
NameError: name 'MetadataSQLParser' is not defined
Seems that do now work with sql queries on batch requests
When I turned into “great_expectation dataset I could not manage to get how to set the mapping for the validations goes to one specific table. I had redshift on dev, so URN
Copy code
<http://192.168.0.14:9002/dataset/urn:li:dataset:(urn:li:dataPlatform:redshift,database.schema.table>, DEV)
platform_instance_map = { "redshift": "database.schema.table" }
Do not succeeded trying to view validations
l
@hundreds-photographer-13496 ^
also @big-carpet-38439 ^
b
Looking into this.. Seems that we need to make sure a) environment (DEV / PROD) can be easily configured, and b) the dependencies for SQL parsing as all present when you do pip install great expectations plugin... Strange that this dependency issue is occurring. platform instance is used only if you've set a platform instance in your redshift ingestion script! if you did not, you can leave this blank. @hundreds-photographer-13496 And I will get back soon on this. Have you tried to ingest assertions that are not freeform SQL based?
"Seems that do now work with sql queries on batch requests" -> Does this mean you were at least able to ingest results based on SQL query batch request?
w
sorry for my misspelling. now = not, typo on writing!
for this, I can reinstall the environment “b) the dependencies for SQL parsing as all present when you do pip install great expectations plugin”
I could run without the error with the SQL QUERY ON THE BATCH REQUEST, but results still not appearing. I reinstalled the environment from the scratch (without version numbers on the requirements
do you need to try something with this “a) environment (DEV / PROD) can be easily configured,” ?? I tried to send to PROD environment, but not created ( I do not have prod environment created yet). When I run an airflow task to a
ghost
table, it created blank and create lineage over a dummy table. This behaviour is not repeated here (I do not know if this is useful information)
I found “my” problem debbuging the code. The difference y here on this function
Copy code
def get_platform_from_sqlalchemy_uri(sqlalchemy_uri: str) -> str:
    if sqlalchemy_uri.startswith("bigquery"):
        return "bigquery"
    if sqlalchemy_uri.startswith("clickhouse"):
        return "clickhouse"
    if sqlalchemy_uri.startswith("druid"):
        return "druid"
    if sqlalchemy_uri.startswith("mssql"):
        return "mssql"
    if (
        sqlalchemy_uri.startswith("jdbc:postgres:")
        and sqlalchemy_uri.index("redshift.amazonaws") > 0
    ) or sqlalchemy_uri.startswith("redshift"):
        return "redshift"
    if sqlalchemy_uri.startswith("snowflake"):
        return "snowflake"
    if sqlalchemy_uri.startswith("presto"):
        return "presto"
    if sqlalchemy_uri.startswith("postgresql"):
        return "redshift"
    if sqlalchemy_uri.startswith("pinot"):
        return "pinot"
    if sqlalchemy_uri.startswith("oracle"):
        return "oracle"
    if sqlalchemy_uri.startswith("mysql"):
        return "mysql"
    if sqlalchemy_uri.startswith("mongodb"):
        return "mongodb"
    if sqlalchemy_uri.startswith("hive"):
        return "hive"
    if sqlalchemy_uri.startswith("awsathena"):
        return "athena"
    return "external"
On this function, that return the platfom, It parse the
sqlalchemy_uri
I create a logger custom inside the function
def get_dataset_partitions(self, batch_identifier, data_asset)
and print the
URI
before it call the next funtion (line 627 on action.,py)
Copy code
dataset_urn = make_dataset_urn_from_sqlalchemy_uri(
                        sqlalchemy_uri,
                        None,
                        table,
                        self.env,
                        self.get_platform_instance(
                            data_asset.active_batch_definition.datasource_name
                        ),
                    )
The uri started with
sqlalchemy_uri: postgresql+psycopg2://
so it returned the platform as `postgres`and not
redshift
I try changing the return value to redshift, and the info was emmited OK. The question here is, is a problem of my library which makes the query to redshift?
Hello @big-carpet-38439 :… I explaing the problem i found and solve (at least in my environmt) here ☝️
b
thank you!!!
w
Just to be bring some additional info. I understand that what I did was only a temporary solution to discover the bug. Did not think that is permanent. I don't know others “redshift” drivers different from psicopg2 (I am working locally on a Mac OS environment) please let me know if I can provide you some additional info.
h
Hi @wooden-football-7175 thank you for sharing details on redshift uri's platform extracted as postgres. I believe, this PR should fix the issue - https://github.com/datahub-project/datahub/pull/4421 Fix for dependency import
ERROR: name 'MetadataSQLParser' is not defined
and the new config param`parse_table_names_from_sql` (False, by default) has been added quite recently( PR.) and not released yet. It should be available in next release.
w
Hello @hundreds-photographer-13496. I could add the
parse_table_names_from_sql
prop to the checkpoint config and execute correctly with that. That
ERROR: name 'MetadataSQLParser' is not defined
error I guess that was generated by “no the last version” of GE and dependencies. 😃 Excelent PR 😃 🚀 Glad to help
teamwork 2
b
Thank you Guido! And Mayuri for a followup & resolution!!