orange-flag-48535
10/13/2022, 4:56 AMalert-fall-82501
10/13/2022, 6:52 AMfull-chef-85630
10/13/2022, 2:04 PMflaky-soccer-57765
10/13/2022, 3:52 PMbrave-businessperson-3969
10/13/2022, 4:12 PMmicroscopic-mechanic-13766
10/13/2022, 4:49 PMbest-eve-12546
10/13/2022, 5:22 PMmillions-waiter-49836
10/13/2022, 8:54 PMwhite-xylophone-3944
10/13/2022, 8:38 AMwooden-jackal-88380
10/13/2022, 2:55 PMalert-fall-82501
10/14/2022, 12:05 PMlemon-hydrogen-83671
10/14/2022, 4:40 PMSink (***-kafka) report:
{'current_time': '2022-10-14 16:39:57.173070 (now).',
'failures': [],
'records_written_per_second': '2',
'start_time': '2022-10-14 16:35:33.365443 (4 minutes and 23.81 seconds ago).',
'total_duration_in_seconds': '263.81',
'total_records_written': '771',
'warnings': []}
I’m using kafka as my sink so im a bit surprised its so slowripe-apple-36185
10/14/2022, 6:24 PM.csv
. The problem I am seeing is that the snowflake plugin will create lineage to the files, but using _csv
in the urn.
The problem that this creates is that I have two datasets for the same file and broken lineage.
Has anyone else seen this?chilly-scientist-91160
10/14/2022, 6:44 PMsalmon-jackal-36326
10/14/2022, 6:56 PMalert-fall-82501
10/15/2022, 4:10 PMCan anybody suggest on this ? ....Ingesting metadata from bigquery to datahub private sever ip-10-231-6-97.ec2.internal
*** Reading remote log from Cloudwatch log_group: airflow-dt-airflow-prod-Task log_stream: datahub_bigquery_ingest/mp5_ingest/2022-10-14T06_00_00+00_00/1.log.
[2022-10-15 06:00:20,042] {{taskinstance.py:1035}} INFO - Dependencies all met for <TaskInstance: datahub_bigquery_ingest.mp5_ingest scheduled__2022-10-14T06:00:00+00:00 [queued]>
[2022-10-15 06:00:20,123] {{taskinstance.py:1035}} INFO - Dependencies all met for <TaskInstance: datahub_bigquery_ingest.mp5_ingest scheduled__2022-10-14T06:00:00+00:00 [queued]>
[2022-10-15 06:00:20,123] {{taskinstance.py:1241}} INFO -
--------------------------------------------------------------------------------
[2022-10-15 06:00:20,124] {{taskinstance.py:1242}} INFO - Starting attempt 1 of 2
[2022-10-15 06:00:20,124] {{taskinstance.py:1243}} INFO -
--------------------------------------------------------------------------------
[2022-10-15 06:00:20,213] {{taskinstance.py:1262}} INFO - Executing <Task(BashOperator): mp5_ingest> on 2022-10-14 06:00:00+00:00
[2022-10-15 06:00:20,224] {{standard_task_runner.py:52}} INFO - Started process 630 to run task
[2022-10-15 06:00:20,297] {{standard_task_runner.py:76}} INFO - Running: ['airflow', 'tasks', 'run', 'datahub_bigquery_ingest', 'mp5_ingest', 'scheduled__2022-10-14T06:00:00+00:00', '--job-id', '47734', '--raw', '--subdir', 'DAGS_FOLDER/dt_datahub/pipelines/bigquery_metadata_dag.pay.py', '--cfg-path', '/tmp/tmpqhkznspm', '--error-file', '/tmp/tmpmd2axjcd']
[2022-10-15 06:00:20,298] {{standard_task_runner.py:77}} INFO - Job 47734: Subtask mp5_ingest
[2022-10-15 06:00:20,528] {{logging_mixin.py:109}} INFO - Running <TaskInstance: datahub_bigquery_ingest.mp5_ingest scheduled__2022-10-14T06:00:00+00:00 [running]> on host ip-10-231-6-97.ec2.internal
[2022-10-15 06:00:21,041] {{taskinstance.py:1429}} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_EMAIL=data-engineering@xxxx.com
AIRFLOW_CTX_DAG_OWNER=data-engineering
AIRFLOW_CTX_DAG_ID=datahub_bigquery_ingest
AIRFLOW_CTX_TASK_ID=mp5_ingest
AIRFLOW_CTX_EXECUTION_DATE=2022-10-14T06:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2022-10-14T06:00:00+00:00
[2022-10-15 06:00:21,084] {{subprocess.py:62}} INFO - Tmp dir root location:
/tmp
[2022-10-15 06:00:21,085] {{subprocess.py:74}} INFO - Running command: ['bash', '-c', 'python3 -m datahub ingest -c /usr/local/airflow/dags/dt_datahub/recipes/prod/bigquery/mp5.yaml']
[2022-10-15 06:00:21,102] {{subprocess.py:85}} INFO - Output:
[2022-10-15 06:00:26,521] {{subprocess.py:89}} INFO - [2022-10-15 06:00:26,521] INFO {datahub.cli.ingest_cli:179} - DataHub CLI version: 0.8.44
[2022-10-15 06:06:40,084] {{subprocess.py:89}} INFO - [2022-10-15 06:06:40,032] ERROR {datahub.entrypoints:192} -
[2022-10-15 06:06:40,084] {{subprocess.py:89}} INFO - Traceback (most recent call last):
[2022-10-15 06:06:40,084] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
[2022-10-15 06:06:40,084] {{subprocess.py:89}} INFO - (self._dns_host, self.port), self.timeout, **extra_kw
[2022-10-15 06:06:40,084] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
[2022-10-15 06:06:40,084] {{subprocess.py:89}} INFO - raise err
[2022-10-15 06:06:40,085] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
[2022-10-15 06:06:40,085] {{subprocess.py:89}} INFO - sock.connect(sa)
[2022-10-15 06:06:40,085] {{subprocess.py:89}} INFO - socket.timeout: timed out
[2022-10-15 06:06:40,086] {{warnings.py:110}} WARNING - /usr/local/airflow/.local/lib/python3.7/site-packages/watchtower/__init__.py:349: WatchtowerWarning: Received empty message. Empty messages cannot be sent to CloudWatch Logs
warnings.warn("Received empty message. Empty messages cannot be sent to CloudWatch Logs", WatchtowerWarning)
[2022-10-15 06:06:40,086] {{logging_mixin.py:109}} WARNING - Traceback (most recent call last):
[2022-10-15 06:06:40,086] {{logging_mixin.py:109}} WARNING - File "/usr/local/airflow/config/cloudwatch_logging.py", line 161, in emit
self.sniff_errors(record)
[2022-10-15 06:06:40,087] {{logging_mixin.py:109}} WARNING - File "/usr/local/airflow/config/cloudwatch_logging.py", line 211, in sniff_errors
if pattern.search(record.message):
[2022-10-15 06:06:40,087] {{logging_mixin.py:109}} WARNING - AttributeError: 'LogRecord' object has no attribute 'message'
[2022-10-15 06:06:40,087] {{subprocess.py:89}} INFO - During handling of the above exception, another exception occurred:
[2022-10-15 06:06:40,087] {{subprocess.py:89}} INFO - Traceback (most recent call last):
[2022-10-15 06:06:40,087] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
[2022-10-15 06:06:40,087] {{subprocess.py:89}} INFO - chunked=chunked,
[2022-10-15 06:06:40,088] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
[2022-10-15 06:06:40,088] {{subprocess.py:89}} INFO - self._validate_conn(conn)
[2022-10-15 06:06:40,088] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
[2022-10-15 06:06:40,088] {{subprocess.py:89}} INFO - conn.connect()
[2022-10-15 06:06:40,088] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/connection.py", line 358, in connect
[2022-10-15 06:06:40,088] {{subprocess.py:89}} INFO - self.sock = conn = self._new_conn()
[2022-10-15 06:06:40,088] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
[2022-10-15 06:06:40,088] {{subprocess.py:89}} INFO - % (self.host, self.timeout),
[2022-10-15 06:06:40,088] {{subprocess.py:89}} INFO - urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fbb89ffccd0>, 'Connection to <http://datahub-gms.xxxxx.com|datahub-gms.xxxxx.com> timed out. (connect timeout=30)')
[2022-10-15 06:06:40,089] {{subprocess.py:89}} INFO - During handling of the above exception, another exception occurred:
[2022-10-15 06:06:40,089] {{subprocess.py:89}} INFO - Traceback (most recent call last):
[2022-10-15 06:06:40,089] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/requests/adapters.py", line 499, in send
[2022-10-15 06:06:40,089] {{subprocess.py:89}} INFO - timeout=timeout,
[2022-10-15 06:06:40,089] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 828, in urlopen
[2022-10-15 06:06:40,089] {{subprocess.py:89}} INFO - **response_kw
[2022-10-15 06:06:40,089] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 828, in urlopen
[2022-10-15 06:06:40,090] {{subprocess.py:89}} INFO - **response_kw
[2022-10-15 06:06:40,090] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 828, in urlopen
[2022-10-15 06:06:40,090] {{subprocess.py:89}} INFO - **response_kw
[2022-10-15 06:06:40,090] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen
[2022-10-15 06:06:40,090] {{subprocess.py:89}} INFO - method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
[2022-10-15 06:06:40,090] {{subprocess.py:89}} INFO - File "/usr/local/airflow/.local/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment
[2022-10-15 06:06:40,090] {{subprocess.py:89}} INFO - raise MaxRetryError(_pool, url, error or ResponseError(cause))
[2022-10-15 06:06:40,090] {{subprocess.py:89}} INFO - urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='<http://datahub-gms.xxxxx.com|datahub-gms.xxxxx.com>', port=8080): Max retries exceeded with url: /config (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fbb89ffccd0>, 'Connection to <http://datahub-gms.xxxxxx.com|datahub-gms.xxxxxx.com> timed out. (connect timeout=30)'))
_record_initialization_failure
[2022-10-15 06:06:40,103] {{subprocess.py:89}} INFO - raise PipelineInitError(msg) from e
[2022-10-15 06:06:40,103] {{subprocess.py:89}} INFO - datahub.ingestion.run.pipeline.PipelineInitError: Failed to set up framework context
[2022-10-15 06:06:40,103] {{subprocess.py:89}} INFO - [2022-10-15 06:06:40,033] ERROR {datahub.entrypoints:196} - Command failed:
[2022-10-15 06:06:40,103] {{subprocess.py:89}} INFO - Failed to set up framework context due to
[2022-10-15 06:06:40,103] {{subprocess.py:89}} INFO - 'Failed to connect to DataHub' due to
[2022-10-15 06:06:40,103] {{subprocess.py:89}} INFO - 'HTTPSConnectionPool(host='<http://datahub-gms.XXXXX.com|datahub-gms.XXXXX.com>', port=8080): Max retries exceeded with url: /config (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fbb89ffccd0>, 'Connection to <http://datahub-gms.XXXXX.com|datahub-gms.XXXXX.com> timed out. (connect timeout=30)'))'.
[2022-10-15 06:06:40,103] {{subprocess.py:89}} INFO - Run with --debug to get full stacktrace.
[2022-10-15 06:06:40,103] {{subprocess.py:89}} INFO - e.g. 'datahub --debug ingest -c /usr/local/airflow/dags/dt_datahub/recipes/prod/bigquery/mp5.yaml'
[2022-10-15 06:06:40,391] {{subprocess.py:93}} INFO - Command exited with return code 1
[2022-10-15 06:06:40,869] {{taskinstance.py:1703}} ERROR - Task failed with exception
result = execute_callable(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/bash.py", line 188, in execute
f'Bash command failed. The command returned a non-zero exit code {result.exit_code}.'
airflow.exceptions.AirflowException: Bash command failed. The command returned a non-zero exit code 1.
[2022-10-15 06:06:40,947] {{local_task_job.py:154}} INFO - Task exited with return code 1
[2022-10-15 06:06:40,990] {{local_task_job.py:264}} INFO - 0 downstream tasks scheduled from follow-on schedule check
fierce-monkey-46092
10/16/2022, 5:42 PMwooden-jackal-88380
10/17/2022, 3:13 PMnutritious-finland-99092
10/17/2022, 5:41 PMfields.append(
SchemaFieldClass(
fieldPath=column,
type=SchemaFieldDataTypeClass(type=column_type_class),
nativeDataType= f"{table_columns.get(column, {}).get('max_length', 0)}",
nullable = True,
description=table_columns.get(
column, {}).get('description', ''),
lastModified=AuditStampClass(
time=int(round(datetime.timestamp(datetime.now()))), actor="urn:li:corpuser:carol"
),
)
)
able-evening-90828
10/17/2022, 7:06 PM0.9.0
, the details of previous runs of UI ingestion were all lost. This includes the various columns and the execution logs ("DETAILS"). Is this a bug?ambitious-magazine-36012
10/18/2022, 12:26 AMsilly-oil-35180
10/18/2022, 2:43 AMQueries
Tab using GarphQL API.
I cannot find any mutation api to update Queries. updateDataset
exists, however
datasetupdateinput
doesn’t include Queries tab information(https://datahubproject.io/docs/graphql/inputObjects/#datasetupdateinput).
anyone who updated Queries to your custom query..?rhythmic-gold-76195
10/18/2022, 6:52 AMquick-pizza-8906
10/18/2022, 1:43 PMprehistoric-fireman-61692
10/18/2022, 2:29 PM.yaml
the only way to ingest a large existing Business Glossary? What is the best method - via the BG UI plugin or via the CLI?
2. Which data sources
are supported for the Stats
and Queries
tabs? Is there anyway to extract/update the Stats
and Queries
metadata if it isn’t natively supported through the ingestion integration?miniature-plastic-43224
10/18/2022, 4:12 PMsparse-planet-56664
10/19/2022, 9:34 AMalert-fall-82501
10/19/2022, 12:38 PMalert-fall-82501
10/19/2022, 12:38 PMthankful-monitor-87245
10/19/2022, 2:55 PM