eager-monitor-4683
07/03/2023, 4:31 AMfuture-yak-13169
07/03/2023, 6:35 AMabundant-apartment-78179
07/03/2023, 9:22 AMdelightful-school-94725
07/03/2023, 11:26 AMclassification:
enabled: true
info_type_to_term:
Email_Address: Email
classifiers:
- type: datahub
config:
confidence_level_threshold: 0.7
info_types_config:
Street_Address:
prediction_factors_and_weights:
name: 1
description: 0
datatype: 0
values: 0
name:
regex:
- Account_Territory
- account_territory
datatype:
type:
- str
values:
prediction_type: library
regex: []
library:
- spacy
Full_name:
prediction_factors_and_weights:
name: 1
description: 0
datatype: 0
values: 0
name:
regex:
- AccountName
- accountname
datatype:
type:
- str
values:
prediction_type: regex
regex:
- '^[a-zA-Z ]+.*'
library: []
stocky-guitar-68560
07/03/2023, 2:25 PMimport datahub.emitter.mce_builder as builder
from datahub.emitter.rest_emitter import DatahubRestEmitter
lineage_mce = builder.make_lineage_mce(
[
builder.make_dataset_urn("kafka", "topic-A"), # Upstream
],
builder.make_dataset_urn("bigquery", "dataset-A"), # Downstream
)
emitter = DatahubRestEmitter("metabase-gms-endpoint")
emitter.emit_mce(lineage_mce)
the above code generates the lineage between kafka topic-A and bigquery dataset-A
but if run the same script with kafka topic-A and bigquery dataset-B, it actully creates a another link between topic-A and dataset-B.Now there will be two edges from topic-A i.e from topic-A to dataset-A and topic-A to dataset-B.
I want to override the existing lineage.
I only want the latest ingested lineage i.e, topic-A to dataset-B.
Can someone help me in this?ripe-lock-98414
07/04/2023, 4:23 AMshy-dog-84302
07/04/2023, 5:48 AMstaleness
flag into ingestion configuration. I am looking for a safe ways to query and soft/hard delete those entires? Can someone help me with datahub delete
command or GraphQL query that can give me URNs to such data in DataHub?bland-orange-13353
07/04/2023, 8:34 AMworried-butcher-72025
07/04/2023, 10:27 AMExecution finished with errors.
{'exec_id': '61e7151b-2e5c-4d4d-8336-88becfa736c3',
'infos': ['2023-07-04 10:20:22.711650 INFO: Starting execution for task with name=RUN_INGEST',
"2023-07-04 10:21:01.296376 INFO: Failed to execute 'datahub ingest'",
'2023-07-04 10:21:01.296911 INFO: Caught exception EXECUTING task_id=61e7151b-2e5c-4d4d-8336-88becfa736c3, name=RUN_INGEST, '
'stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
' task_event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
' return future.result()\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
' raise TaskError("Failed to execute \'datahub ingest\'")\n'
"acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
'errors': []}
This is the final:
packages/pydantic/_internal/_generate_schema.py", line 578, in _arbitrary_type_schema
raise PydanticSchemaGenerationError(
pydantic.errors.PydanticSchemaGenerationError: Unable to generate pydantic-core schema for datahub.utilities.lossy_collections.LossyList[str]. Set `arbitrary_types_allowed=True` in the model_config to ignore this error or implement `__get_pydantic_core_schema__` on your type to fully support it.
If you got this error by calling handler(<some type>) within `__get_pydantic_core_schema__` then you likely need to call `handler.generate_schema(<some type>)` since we do not call `__get_pydantic_core_schema__` on `<some type>` otherwise to avoid infinite recursion.
For further information visit <https://errors.pydantic.dev/2.0/u/schema-for-unknown-type>
astonishing-dusk-99990
07/05/2023, 10:36 AMlimited-forest-73733
07/05/2023, 2:52 PMbitter-waitress-17567
07/05/2023, 5:45 PMif not username.startswith("urn:li:corpuser:")
AttributeError: 'list' object has no attribute 'startswith'
bitter-waitress-17567
07/05/2023, 5:46 PMbitter-waitress-17567
07/05/2023, 5:46 PMbitter-waitress-17567
07/05/2023, 5:46 PMrich-crowd-33361
07/06/2023, 12:02 AMquiet-scientist-40341
07/06/2023, 3:04 AMquiet-scientist-40341
07/06/2023, 3:05 AMworried-rocket-84695
07/06/2023, 5:23 AMmany-rocket-80549
07/06/2023, 9:52 AMpip install 'acryl-datahub[hana]'
and
pip install pyhdb
That are mentioned in the documentation. However we are still seeing the following error:
~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': '38bab7e3-419e-45f2-a56e-7563b182c83d',
'infos': ['2023-07-06 09:49:46.028237 INFO: Starting execution for task with name=RUN_INGEST',
"2023-07-06 09:49:50.106268 INFO: Failed to execute 'datahub ingest'",
'2023-07-06 09:49:50.106989 INFO: Caught exception EXECUTING task_id=38bab7e3-419e-45f2-a56e-7563b182c83d, name=RUN_INGEST, '
'stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
' task_event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
' return future.result()\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
' raise TaskError("Failed to execute \'datahub ingest\'")\n'
"acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
'errors': []}
~~~~ Ingestion Report ~~~~
{
"cli": {
"cli_version": "0.10.0.7",
"cli_entry_location": "/usr/local/lib/python3.10/site-packages/datahub/__init__.py",
"py_version": "3.10.10 (main, Mar 14 2023, 02:37:11) [GCC 10.2.1 20210110]",
"py_exec_path": "/usr/local/bin/python",
"os_details": "Linux-5.15.0-76-generic-x86_64-with-glibc2.31",
"peak_memory_usage": "75.97 MB",
"mem_info": "75.97 MB"
},
"source": {
"type": "hana",
"report": {
"events_produced": 0,
"events_produced_per_sec": 0,
"entities": {},
"aspects": {},
"warnings": {},
"failures": {},
"soft_deleted_stale_entities": [],
"tables_scanned": 0,
"views_scanned": 0,
"entities_profiled": 0,
"filtered": [],
"start_time": "2023-07-06 09:49:47.606186 (now)",
"running_time": "0.19 seconds"
}
},
"sink": {
"type": "datahub-rest",
"report": {
"total_records_written": 0,
"records_written_per_second": 0,
"warnings": [],
"failures": [],
"start_time": "2023-07-06 09:49:47.395413 (now)",
"current_time": "2023-07-06 09:49:47.793147 (now)",
"total_duration_in_seconds": 0.4,
"gms_version": "v0.10.3",
"pending_requests": 0
}
}
}
~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv setup time = 0
This version of datahub supports report-to functionality
datahub ingest run -c /tmp/datahub/ingest/38bab7e3-419e-45f2-a56e-7563b182c83d/recipe.yml --report-to /tmp/datahub/ingest/38bab7e3-419e-45f2-a56e-7563b182c83d/ingestion_report.json
[2023-07-06 09:49:47,325] INFO {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.0.7
[2023-07-06 09:49:47,400] INFO {datahub.ingestion.run.pipeline:184} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
[2023-07-06 09:49:47,627] INFO {datahub.ingestion.run.pipeline:201} - Source configured successfully.
[2023-07-06 09:49:47,628] INFO {datahub.cli.ingest_cli:129} - Starting metadata ingestion
[2023-07-06 09:49:47,800] INFO {datahub.ingestion.reporting.file_reporter:52} - Wrote UNKNOWN report successfully to <_io.TextIOWrapper name='/tmp/datahub/ingest/38bab7e3-419e-45f2-a56e-7563b182c83d/ingestion_report.json' mode='w' encoding='UTF-8'>
[2023-07-06 09:49:47,801] INFO {datahub.cli.ingest_cli:134} - Source (hana) report:
{'events_produced': 0,
'events_produced_per_sec': 0,
'entities': {},
'aspects': {},
'warnings': {},
'failures': {},
'soft_deleted_stale_entities': [],
'tables_scanned': 0,
'views_scanned': 0,
'entities_profiled': 0,
'filtered': [],
'start_time': '2023-07-06 09:49:47.606186 (now)',
'running_time': '0.19 seconds'}
[2023-07-06 09:49:47,801] INFO {datahub.cli.ingest_cli:137} - Sink (datahub-rest) report:
{'total_records_written': 0,
'records_written_per_second': 0,
'warnings': [],
'failures': [],
'start_time': '2023-07-06 09:49:47.395413 (now)',
'current_time': '2023-07-06 09:49:47.801202 (now)',
'total_duration_in_seconds': 0.41,
'gms_version': 'v0.10.3',
'pending_requests': 0}
[2023-07-06 09:49:48,004] ERROR {datahub.entrypoints:188} - Command failed: Can't load plugin: sqlalchemy.dialects:hana.hdbcli
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/datahub/entrypoints.py", line 175, in main
sys.exit(datahub(standalone_mode=False, **kwargs))
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 379, in wrapper
raise e
File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 334, in wrapper
res = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
return func(ctx, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 198, in run
loop.run_until_complete(run_func_check_upgrade(pipeline))
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 158, in run_func_check_upgrade
ret = await the_one_future
File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 149, in run_pipeline_async
return await loop.run_in_executor(
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 140, in run_pipeline_to_completion
raise e
File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 132, in run_pipeline_to_completion
pipeline.run()
File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 339, in run
for wu in itertools.islice(
File "/usr/local/lib/python3.10/site-packages/datahub/utilities/source_helpers.py", line 85, in auto_stale_entity_removal
for wu in stream:
File "/usr/local/lib/python3.10/site-packages/datahub/utilities/source_helpers.py", line 36, in auto_status_aspect
for wu in stream:
File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 505, in get_workunits_internal
for inspector in self.get_inspectors():
File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 379, in get_inspectors
engine = create_engine(url, **self.config.options)
File "<string>", line 2, in create_engine
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/deprecations.py", line 309, in warned
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 522, in create_engine
entrypoint = u._get_entrypoint()
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/url.py", line 655, in _get_entrypoint
cls = registry.load(name)
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 343, in load
raise exc.NoSuchModuleError(
sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:hana.hdbcli
witty-butcher-82399
07/06/2023, 10:32 AMfancy-monitor-63529
07/06/2023, 11:42 AMfailures': {'lineage-exported-gcp-audit-logs': ['Error: 400 Could not cast literal "20230703" to type TIMESTAMP at [13:18]\n\nLocation: US\nJob ID: (removed by me)\n']},
I will attach the the old and new recipes below. I have gone over the wiki many times can not tell what I am missing. Perhaps my service account needs new permissions now.quaint-appointment-83049
07/06/2023, 12:20 PM"pipeline_name": f"bigquery_metadata_ingestion_{ingestion.project_id}",
"source": {
"type": "bigquery",
"config": {
"env": worker_event.environment,
"project_id": f"{ingestion.project_id}",
"project_on_behalf": config.PROJECT_ID,
"profiling": {"enabled": False},
"column_limit": 900,
"use_exported_bigquery_audit_metadata": False,
"match_fully_qualified_names": True,
"dataset_pattern": {
# Specify datasets to be excluded
"deny": ingestion.exclusion_dataset_patterns,
},
"table_pattern": {
# Specify tables to be excluded
"deny": ingestion.exclusion_table_patterns,
},
"view_pattern": {
# Specify views to be excluded
"deny": ingestion.exclusion_view_patterns,
},
"stateful_ingestion": {"enabled": True},
# credential add BigQuery Credential for pipline source
# <https://datahubproject.io/docs/generated/ingestion/sources/bigquery#cli-based-ingestion-2>
"credential": self.credential,
},
},
"sink": {
"type": "datahub-rest",
"config": {
"server": config.DATAHUB_SERVER,
"token": config.DATAHUB_TOKEN,
"retry_max_times": 4,
"max_threads": 3,
},
},
brainy-butcher-66683
07/06/2023, 2:06 PMsource:
type: mysql
config:
host_port: '********'
database: null
username: ****
include_tables: true
include_views: false
profiling:
enabled: true
profile_table_level_only: false
stateful_ingestion:
enabled: true
password: '${courier_chat_na}'
schema_pattern:
allow:
- courier_chat
sink:
type: datahub-rest
config:
server: '<datahuh url>/api/gms'
token: '${GMS_key}'
WARNING: These logs appear to be stale. No new logs have been received since 2023-07-05 23:25:45.280969 (297 seconds ago). However, the ingestion process still appears to be running and may complete normally.
acceptable-computer-51491
07/06/2023, 2:56 PM[2023-07-06 09:16:10,979] DEBUG {datahub.emitter.rest_emitter:247} - Attempting to emit to DataHub GMS; using curl equivalent to:\n',
'2023-07-06 09:16:11.149010 [exec_id=280a9dbb-5208-4212-95ee-d28a9e4d4afc] INFO: Caught exception EXECUTING '
'task_id=280a9dbb-5208-4212-95ee-d28a9e4d4afc, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.10/asyncio/streams.py", line 525, in readline\n'
' line = await self.readuntil(sep)\n'
' File "/usr/local/lib/python3.10/asyncio/streams.py", line 603, in readuntil\n'
' raise exceptions.LimitOverrunError(\n'
'asyncio.exceptions.LimitOverrunError: Separator is not found, and chunk exceed the limit\n'
'\n'
'During handling of the above exception, another exception occurred:\n'
'\n'
'Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
' task_event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
' return future.result()\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 147, in execute\n'
' await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 99, in _read_output_lines\n'
' line_bytes = await ingest_process.stdout.readline()\n'
' File "/usr/local/lib/python3.10/asyncio/streams.py", line 534, in readline\n'
' raise ValueError(e.args[0])\n'
'ValueError: Separator is not found, and chunk exceed the limit\n']}
Datahub => v0.8.45 deployed on AWS EKSdelightful-school-94725
07/06/2023, 5:39 PMbitter-waitress-17567
07/06/2023, 6:46 PMrich-restaurant-61261
07/06/2023, 10:42 PMdelightful-school-94725
07/07/2023, 12:33 PMnumerous-address-22061
07/07/2023, 5:15 PM