Hi Team, Facing this issue when connecting to Reds...
# troubleshoot
h
Hi Team, Facing this issue when connecting to Redshift Tables , ( followed the prerequisites for shared for redshift from here https://datahubproject.io/docs/generated/ingestion/sources/redshift/#prerequisites-1 to the user which we are using to connect to redshift ) -
Copy code
datahub.configuration.common.PipelineExecutionError: ('Source reported errors', RedshiftReport(workunits_produced=0, workunit_ids=[], warnings={}, failures={'version': ["Error: invalid literal for int() with base 10: 'redshift:'"]}, cli_version='0.8.41', cli_entry_location='/root/.venvs/airflow/lib/python3.7/site-packages/datahub/__init__.py', py_version='3.7.10 (default, Jun  3 2021, 00:02:01) \n[GCC 7.3.1 20180712 (Red Hat 7.3.1-13)]', py_exec_path='/root/.venvs/airflow/bin/python', os_details='Linux-4.14.287-215.504.amzn2.x86_64-x86_64-with-glibc2.2.5', tables_scanned=0, views_scanned=0, entities_profiled=0, filtered=[], soft_deleted_stale_entities=[], query_combiner=None, saas_version='', upstream_lineage={}))
g
What does your recipe look like?
h
@gray-shoe-75895
Copy code
{
  "source": {
    "type": "redshift",
    "config": {
      "env": "PROD",
      "host_port": "host",
      "database": "dbname",
      "username": "username",
      "password": "pwd",
      "include_views": "True",
      "include_tables": "True"
    }
  },
  "transformers": [
    {
      "type": "set_dataset_browse_path",
      "config": {
        "path_templates": [
          "/Path/PLATFORM/DATASET_PARTS"
        ]
      }
    }
  ]
}
@gray-shoe-75895 - Does this seem ok ?
g
Are you running inside of airflow?
This seems like a bug with our redshift connector but I can't confirm that until I reproduce it
h
@gray-shoe-75895 - Yes running inside Airflow.
Could you let me know if this is bug ? Also is there a way can we ingest Redshift bypassing this issue
g
Are there any additional logs / stack traces from that error? Additionally, could you try running without the transformer to see if that might be causing any issues? cc @dazzling-judge-80093
h
@gray-shoe-75895 @dazzling-judge-80093 -
Copy code
[2022-08-22, 20:15:34 UTC] {{pipeline.py:160}} INFO - Sink configured successfully. DataHubRestEmitter: configured to talk to <https://hostname>
[2022-08-22, 20:15:35 UTC] {{logging_mixin.py:115}} WARNING - /root/.venvs/airflow/lib64/python3.7/site-packages/snowflake/connector/options.py:99 UserWarning: You have an incompatible version of 'pyarrow' installed (9.0.0), please install a version that adheres to: 'pyarrow<8.1.0,>=8.0.0; extra == "pandas"'
[2022-08-22, 20:15:36 UTC] {{taskinstance.py:1889}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/root/.venvs/airflow/lib/python3.7/site-packages/airflow/decorators/base.py", line 179, in execute
    return_value = super().execute(context)
  File "/root/.venvs/airflow/lib/python3.7/site-packages/airflow/operators/python.py", line 171, in execute
    return_value = self.execute_callable()
  File "/root/.venvs/airflow/lib/python3.7/site-packages/airflow/operators/python.py", line 189, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/root/airflow/dags/dagname.py", line 103, in ingest_metadata
    pipeline.raise_from_status()
  File "/root/.venvs/airflow/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 349, in raise_from_status
    "Source reported errors", self.source.get_report()
datahub.configuration.common.PipelineExecutionError: ('Source reported errors', RedshiftReport(workunits_produced=0, workunit_ids=[], warnings={}, failures={'version': ["Error: invalid literal for int() with base 10: 'redshift:'"]}, cli_version='0.8.41', cli_entry_location='/root/.venvs/airflow/lib/python3.7/site-packages/datahub/__init__.py', py_version='3.7.10 (default, Jun  3 2021, 00:02:01) \n[GCC 7.3.1 20180712 (Red Hat 7.3.1-13)]', py_exec_path='/root/.venvs/airflow/bin/python', os_details='Linux-4.14.287-215.504.amzn2.x86_64-x86_64-with-glibc2.2.5', tables_scanned=0, views_scanned=0, entities_profiled=0, filtered=[], soft_deleted_stale_entities=[], query_combiner=None, saas_version='', upstream_lineage={}))
[2022-08-22, 20:15:36 UTC] {{taskinstance.py:1400}} INFO - Marking task as UP_FOR_RETRY. dag_id=dagname, task_id=ingest_metadata, execution_date=20220822T201515, start_date=20220822T201533, end_date=20220822T201536
[2022-08-22, 20:15:36 UTC] {{standard_task_runner.py:97}} ERROR - Failed to execute job 28834 for task ingest_metadata (('Source reported errors', RedshiftReport(workunits_produced=0, workunit_ids=[], warnings={}, failures={'version': ["Error: invalid literal for int() with base 10: 'redshift:'"]}, cli_version='0.8.41', cli_entry_location='/root/.venvs/airflow/lib/python3.7/site-packages/datahub/__init__.py', py_version='3.7.10 (default, Jun  3 2021, 00:02:01) \n[GCC 7.3.1 20180712 (Red Hat 7.3.1-13)]', py_exec_path='/root/.venvs/airflow/bin/python', os_details='Linux-4.14.287-215.504.amzn2.x86_64-x86_64-with-glibc2.2.5', tables_scanned=0, views_scanned=0, entities_profiled=0, filtered=[], soft_deleted_stale_entities=[], query_combiner=None, saas_version='', upstream_lineage={})); 18569)
[2022-08-22, 20:15:36 UTC] {{local_task_job.py:156}} INFO - Task exited with return code 1
g
The culprit lines from the source seem to be here https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/redshift.py#L567-L578. My current best guess is that it cannot connect to your redshift instance, likely due to the host_port field
h
@gray-shoe-75895 - let me look at the configurations a bit more closely and debug.