shy-lion-56425
09/21/2022, 8:50 PMsource:
type: s3
config:
path_specs:
- include : "<s3://cseo-global-cloudtrail/AWSLogs/057183463473/{table}/{partition[0]}/{partition[1]}/{partition[2]}/{partition[3]}/*_CloudTrail-Digest_*.json.gz>"
- exclude : "**/AWSLogs/057183463473/CloudTrail-Digest/**"
aws_config:
aws_access_key_id: "{aws_key}"
aws_secret_access_key: "{aws_secret}"
aws_region: us-east-1
profiling:
enabled: false
Error:
[2022-09-21 15:47:53,596] ERROR {datahub.ingestion.run.pipeline:127} - 'include'
Traceback (most recent call last):
File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 178, in __init__
self.source: Source = source_class.create(
File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/s3/source.py", line 321, in create
config = DataLakeSourceConfig.parse_obj(config_dict)
File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
File "pydantic/main.py", line 1056, in pydantic.main.validate_model
File "pydantic/fields.py", line 868, in pydantic.fields.ModelField.validate
File "pydantic/fields.py", line 901, in pydantic.fields.ModelField._validate_sequence_like
File "pydantic/fields.py", line 1067, in pydantic.fields.ModelField._validate_singleton
File "pydantic/fields.py", line 857, in pydantic.fields.ModelField.validate
File "pydantic/fields.py", line 1074, in pydantic.fields.ModelField._validate_singleton
File "pydantic/fields.py", line 1121, in pydantic.fields.ModelField._apply_validators
File "pydantic/class_validators.py", line 313, in pydantic.class_validators._generic_validator_basic.lambda12
File "pydantic/main.py", line 704, in pydantic.main.BaseModel.validate
File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
File "pydantic/main.py", line 1082, in pydantic.main.validate_model
File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/aws/path_spec.py", line 104, in validate_path_spec
if "**" in values["include"]:
KeyError: 'include'
[2022-09-21 15:47:53,598] INFO {datahub.cli.ingest_cli:119} - Starting metadata ingestion
[2022-09-21 15:47:53,598] INFO {datahub.cli.ingest_cli:137} - Finished metadata ingestion
few-sugar-84064
09/22/2022, 3:17 AMError parsing DAG for Glue job. The script <s3://steadio-glue-info/scripts/test-datahub-lineage.py> cannot be processed by Glue (this usually occurs when it has been user-modified): An error occurred (InvalidInputException) when calling the GetDataflowGraph operation: line 11:87 no viable alternative at input \'## @type: DataSource\\n## @args: [catalog_connection = "redshiftconnection", connection_options = {"database" =\'']}
• Dataset job code - have no idea what I need to put for job id and flow idcool-vr-73109
09/22/2022, 8:06 AMkind-scientist-44426
09/22/2022, 9:33 AMsource:
type: "ldap"
config:
ldap_server: <server>
ldap_user: "cn=<user_name>,dc=example,dc=org"
ldap_password: "<password>"
base_dn: "dc=example,dc=org"
but on running this recipe on ui i’m getting below errors
ERROR {datahub.ingestion.run.pipeline:127} - LDAP connection failed\n'
"AttributeError: 'Pipeline' object has no attribute 'source'\n"
"[2022-09-22 07:19:56,128] ERROR {datahub.entrypoints:188} - Command failed with 'Pipeline' object has no attribute 'source'
if someone can suggest the reason?mammoth-air-95743
09/22/2022, 10:37 AM'[2022-09-20 09:41:44,078] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: '
'<s3://path/to/file.json>\n'
'[2022-09-20 09:41:44,078] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: '
'path/to/file.json\n'
mammoth-air-95743
09/22/2022, 10:40 AMcareful-action-61962
09/22/2022, 11:32 AMsome-printer-33912
09/22/2022, 3:03 PM"[2022-09-22 14:54:17,208] ERROR {datahub.entrypoints:188} - Command failed with HTTPConnectionPool(host='localhost', port=8080): Max "
"retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1859651c90>: Failed "
"to establish a new connection: [Errno 111] Connection refused')). Run with --debug to get full trace\n"
~~~~ Execution Summary ~~~~
RUN_INGEST - {'errors': [],
'exec_id': '2547645d-cd36-4752-b64d-d5bf7552b7c6',
'infos': ['2022-09-22 14:54:03.509045 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: Starting execution for task with name=RUN_INGEST',
'2022-09-22 14:54:17.675643 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: stdout=venv setup time = 0\n'
'This version of datahub supports report-to functionality\n'
'datahub ingest run -c /tmp/datahub/ingest/2547645d-cd36-4752-b64d-d5bf7552b7c6/recipe.yml --report-to '
'/tmp/datahub/ingest/2547645d-cd36-4752-b64d-d5bf7552b7c6/ingestion_report.json\n'
'[2022-09-22 14:54:04,862] INFO {datahub.cli.ingest_cli:170} - DataHub CLI version: 0.8.42\n'
"[2022-09-22 14:54:17,208] ERROR {datahub.entrypoints:188} - Command failed with HTTPConnectionPool(host='localhost', port=8080): Max "
"retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1859651c90>: Failed "
"to establish a new connection: [Errno 111] Connection refused')). Run with --debug to get full trace\n"
'[2022-09-22 14:54:17,208] INFO {datahub.entrypoints:191} - DataHub CLI version: 0.8.42 at '
'/tmp/datahub/ingest/venv-mssql-0.8.42/lib/python3.10/site-packages/datahub/__init__.py\n',
"2022-09-22 14:54:17.677042 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: Failed to execute 'datahub ingest'",
'2022-09-22 14:54:17.677216 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: Caught exception EXECUTING '
'task_id=2547645d-cd36-4752-b64d-d5bf7552b7c6, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
' task_event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
' return future.result()\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 168, in execute\n'
' raise TaskError("Failed to execute \'datahub ingest\'")\n'
"acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
limited-forest-73733
09/22/2022, 3:16 PMlimited-forest-73733
09/22/2022, 3:16 PMthankful-morning-85093
09/22/2022, 4:33 PMbland-balloon-48379
09/22/2022, 7:12 PMcareful-engine-38533
09/23/2022, 4:19 AM'/usr/local/bin/run_ingest.sh: line 40: 79 Killed ( datahub ingest run -c "${recipe_file}" ${report_option} )\n',
"2022-09-22 06:29:49.739560 [exec_id=29430983-bfd2-4551-b153-c869537f5fe5] INFO: Failed to execute 'datahub ingest'",
'2022-09-22 06:29:49.739831 [exec_id=29430983-bfd2-4551-b153-c869537f5fe5] INFO: Caught exception EXECUTING '
'task_id=29430983-bfd2-4551-b153-c869537f5fe5, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
' self.event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in
cuddly-arm-8412
09/23/2022, 2:18 AMjobFlow = DataFlow(cluster="prod", orchestrator="airflow", id="flow_new_api")
jobFlow.emit(emitter)
dataJob = DataJob(flow_urn=jobFlow.urn, id="flow_new_api_job1")
dataJob.emit(emitter)
But in the UI, I found that they were not related,How to troubleshoot here?narrow-toothbrush-13209
09/23/2022, 7:16 AMboundless-student-48844
09/23/2022, 8:14 AMacryl-executor
pip package?brave-tomato-16287
09/23/2022, 11:13 AM'urn:li:dataset:(urn:li:dataPlatform:tableau,d1ad8766-18c8-6938-770e-42929141371c,PROD)\\n Cause: ERROR :: '
'/upstreams/0/dataset :: \\"Provided urn '
"urn:li:dataset:(urn:li:dataPlatform:google-sheets,temp_0ufiu670cqle3e165n9eh12vw5vo.'am, bad debt users, to make "
'bal$\',PROD)\\" is invalid: Failed to convert urn to entity key: urns parts and key fields do not have same length\\n", '
'"message": "Invalid urn format for aspect: {upstreams=[{type=TRANSFORMED, auditStamp={actor=urn:li:corpuser:unknown, time=0}, '
'dataset=urn:li:dataset:(urn:li:dataPlatform:google-sheets,temp_0ufiu670cqle3e165n9eh12", "status": 400, "id": '
'"urn:li:dataset:(urn:li:dataPlatform:tableau,d1ad8766-18c8-6938-770e-42929141371c,PROD)"}}], "failures": [{"error": "Unable '
'to emit metadata to DataHub GMS", "info": {"exceptionClass": "com.linkedin.restli.server.RestLiServiceException",
glamorous-wire-83850
09/23/2022, 11:47 AMsource:
type: bigquery
config:
project_id: service_acc_project
storage_project_id: second_project
credential:
project_id: service_acc_project
private_key_id: '${BQ_PRIVATE_KEY_ID2}'
client_email: <mailto:abc-abc@service.iam.gserviceaccount.com|abc-abc@service.iam.gserviceaccount.com>
private_key: '${BQ_PRIVATE_KEY2}'
client_id: '11111111'
include_tables: true
include_views: true
include_table_lineage: true
lemon-engine-23512
09/23/2022, 11:48 AMadamant-rain-51672
09/23/2022, 12:32 PM~~~~ Execution Summary ~~~~
RUN_INGEST - {'errors': [],
'exec_id': 'e7e241c2-dcbf-43c9-9363-0eb77c8a1fad',
'infos': ['2022-09-23 12:30:07.735004 [exec_id=e7e241c2-dcbf-43c9-9363-0eb77c8a1fad] INFO: Starting execution for task with name=RUN_INGEST',
'2022-09-23 12:30:07.735648 [exec_id=e7e241c2-dcbf-43c9-9363-0eb77c8a1fad] INFO: Caught exception EXECUTING '
'task_id=e7e241c2-dcbf-43c9-9363-0eb77c8a1fad, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
' self.event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
' return f.result()\n'
' File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
' raise self._exception\n'
' File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
' result = coro.send(None)\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 71, in execute\n'
' validated_args = SubProcessIngestionTaskArgs.parse_obj(args)\n'
' File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n'
' File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n'
'pydantic.error_wrappers.ValidationError: 1 validation error for SubProcessIngestionTaskArgs\n'
'debug_mode\n'
' extra fields not permitted (type=value_error.extra)\n']}
Execution finished with errors.
Do you maybe know why it's caused?future-smartphone-53257
09/23/2022, 12:57 PMbumpy-whale-50799
09/23/2022, 1:12 PMgray-cpu-75769
09/22/2022, 12:40 PMsource:
type: bigquery
config:
credential:
private_key_id: '${private_key_id}'
project_id: '${project_id}'
client_email: '${Client_email}'
private_key: '${private_key}'
client_id: '${client_id}'
profiling:
enabled: true
project_id: '${project_id}'
table_pattern:
allow:
- daas-prod-251711.cdo.online_merchant
profile_pattern:
allow:
- daas-prod-251711.cdo.online_merchant
pipeline_name: 'urn:li:dataHubIngestionSource:60e4b0c9-dc16-4138-8fa7-d0c881af095a'
chilly-potato-57465
09/23/2022, 1:33 PMfresh-nest-42426
09/23/2022, 9:16 PM' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 96, in _read_output_lines\n'
' line_bytes = await ingest_process.stdout.readline()\n'
' File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
' raise ValueError(e.args[0])\n'
'ValueError: Separator is found, but chunk is longer than limit\n']}
I see a related thread here https://datahubspace.slack.com/archives/CUMUWQU66/p1663143783318239
Is there way to exclude the upstream S3 lineage collection for certain redshift tables? So far I had to exclude such tables with extensive s3 upstream otherwise ingestion doesn't work. I'm using v0.8.44
and datahub actions 0.0.7
Thanks!green-lion-58215
09/23/2022, 10:47 PMlittle-spring-72943
09/24/2022, 9:55 PMscope = DatasetAssertionScope.DATASET_COLUMN
operator = AssertionStdOperator.BETWEEN
aggregation = AssertionStdAggregation.SUM
UI shows: Column Amount values are between 0 and 1
Sum (aggregation) is missing. How can we fix this or have custom text here?acceptable-judge-21659
09/26/2022, 6:52 AMbumpy-journalist-41369
09/26/2022, 7:33 AM'2022-09-21 12:33:03.932429 [exec_id=14acb269-e6af-4ca0-871b-684c02a11814] INFO: Caught exception EXECUTING '
'task_id=14acb269-e6af-4ca0-871b-684c02a11814, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.10/asyncio/streams.py", line 525, in readline\n'
' line = await self.readuntil(sep)\n'
' File "/usr/local/lib/python3.10/asyncio/streams.py", line 620, in readuntil\n'
' raise exceptions.LimitOverrunError(\n'
'asyncio.exceptions.LimitOverrunError: Separator is found, but chunk is longer than limit\n'
'\n'
'During handling of the above exception, another exception occurred:\n'
'\n'
'Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
' task_event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
' return future.result()\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 147, in execute\n'
' await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 99, in _read_output_lines\n'
' line_bytes = await ingest_process.stdout.readline()\n'
' File "/usr/local/lib/python3.10/asyncio/streams.py", line 534, in readline\n'
' raise ValueError(e.args[0])\n'
'ValueError: Separator is found, but chunk is longer than limit\n']}
Execution finished with errors.
And eventually the ingestion fails, even though it managed to ingest some of the data. My recipe looks like that :
sink:
type: datahub-rest
config:
server: ‘http://datahub-datahub-gms:8080’
source:
type: glue
config:
aws_region: us-east-1
database_pattern:
allow:
- product_metrics
I don’t see any other exceptions in the log. Does anyone know how to fix it?clean-tomato-22549
09/26/2022, 9:42 AMtype: snowflake
, and has enabled following parameters while ingesting. What else setting do I need to lighten the tabs?
type: snowflake
ignore_start_time_lineage: true
include_table_lineage: true
include_view_lineage: true