great-branch-515
09/17/2022, 5:06 AMfull-chef-85630
09/18/2022, 2:25 PMclean-tomato-22549
09/19/2022, 4:01 AMwithout 9999
.
I tried to provide external_base_url
parameter, but it is not work.magnificent-lock-58916
09/19/2022, 5:49 AMcreamy-pizza-80433
09/19/2022, 7:10 AMbetter-actor-97450
09/19/2022, 10:10 AMrich-machine-24265
09/19/2022, 10:34 AMbase64.b85encode(bz2.compress(pickle.dumps(self)))
instead of json_str_self.encode("utf-8")
here https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/checkpoint.py#L49 ? Thank you!square-winter-39825
09/19/2022, 3:03 PMspark.jars.packages io.acryl:datahub-spark-lineage:0.8.44-3
as external jar to the cluster. I am looking to add extra listener and URL. Can somebody please provide the steps if they have configure spark-lineage on databricks clusters? Thanks!swift-nail-32514
09/19/2022, 4:43 PMDatabricks
sources are being listed with the Platform type of Hive
and Synapse
sources are being listed as MSSQL
because those are the plugins that need to be used for ingestion. Is there a way to force a differentiation that we're just not using? cc team members @brave-pencil-21289 @great-optician-81135agreeable-farmer-44067
09/19/2022, 6:18 PMalert-fall-82501
09/20/2022, 9:22 AMFailed to create source due to Protocol message FieldDescriptorProto has no "proto3_optional" field.
alert-fall-82501
09/20/2022, 9:22 AMalert-fall-82501
09/20/2022, 9:22 AMclean-tomato-22549
09/20/2022, 9:46 AMprofiling.partition_datetime
According to the doc https://datahubproject.io/docs/generated/ingestion/sources/presto-on-hive It is Only Bigquery supports this.
Is their plan to support the parameter for presto on hive?microscopic-mechanic-13766
09/20/2022, 12:10 PMmax_workers
but haven't improve the times.
I don't think it is a problem with either the quantity of data (as I only have 4 tables, and the maximum number of rows in them is 30) or my deployment as it didn't use to be this slow.
Any tips how to either improve the ingestion or to determine the actual cause of the problem??
(The profiling in other sources is normal, so it isn't a problem with either the ingestion or profiling, but a problem with Hive ingestion)brave-pencil-21289
09/20/2022, 12:27 PMalert-fall-82501
09/20/2022, 12:43 PMalert-fall-82501
09/20/2022, 12:43 PM022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - raise NewConnectionError(
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f98aff4f940>: Failed to establish a new connection: [Errno -2] Name or service not known
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - [2022-09-20, 12:32:59 UTC] WARNING {elasticsearch:293} - GET <https://vpc-prod-test-2-zzshies2gij47gfl4x5sehcebe.eu-central-1.es.amazonaws.com:443/_alias> [status:N/A request:0.803s]
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - Traceback (most recent call last):
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - conn = connection.create_connection(
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 73, in create_connection
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - File "/usr/local/lib/python3.8/socket.py", line 918, in getaddrinfo
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - for res in _socket.getaddrinfo(host, port, family
cool-vr-73109
09/20/2022, 3:54 PMcool-vr-73109
09/20/2022, 4:03 PMcool-vr-73109
09/20/2022, 4:04 PMshy-lion-56425
09/20/2022, 4:30 PMsource:
type: s3
config:
profiling:
enabled: false
path_specs:
-
include: '<s3://MY_EXAMPLE_BUCKET/AWSLogs/0123456789/CloudTrail/us-east-1/2022/08/23/*.*>'
-
enable_compresion: true
aws_config:
aws_access_key_id: '${AWS_ACCESS_KEY_ID_CLOUDTRAIL}'
aws_region: us-east-1
aws_secret_access_key: '${AWS_SECRET_ACCESS_KEY_CLOUDTRAIL}'
However I get the following error:
~~~~ Execution Summary ~~~~
RUN_INGEST - {'errors': [],
'exec_id': '9e0f190f-05fd-407c-bdb9-16cebaed1d0c',
'infos': ['2022-09-20 16:24:16.151405 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: Starting execution for task with name=RUN_INGEST',
'2022-09-20 16:24:18.213813 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: stdout=Elapsed seconds = 0\n'
' --report-to TEXT Provide an output file to produce a\n'
'This version of datahub supports report-to functionality\n'
'datahub --debug ingest run -c /tmp/datahub/ingest/9e0f190f-05fd-407c-bdb9-16cebaed1d0c/recipe.yml --report-to '
'/tmp/datahub/ingest/9e0f190f-05fd-407c-bdb9-16cebaed1d0c/ingestion_report.json\n'
'[2022-09-20 16:24:17,736] INFO {datahub.cli.ingest_cli:170} - DataHub CLI version: 0.8.43.2\n'
'[2022-09-20 16:24:17,769] INFO {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured '
'to talk to <http://datahub-datahub-gms:8080>\n'
"[2022-09-20 16:24:17,770] ERROR {datahub.ingestion.run.pipeline:127} - s3 is disabled; try running: pip install 'acryl-datahub[s3]'\n"
'Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 85, in _ensure_not_lazy\n'
' plugin_class = import_path(path)\n'
' File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 32, in import_path\n'
' item = importlib.import_module(module_name)\n'
' File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module\n'
' return _bootstrap._gcd_import(name[level:], package, level)\n'
' File "<frozen importlib._bootstrap>", line 1030, in _gcd_import\n'
' File "<frozen importlib._bootstrap>", line 1007, in _find_and_load\n'
' File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked\n'
' File "<frozen importlib._bootstrap>", line 680, in _load_unlocked\n'
' File "<frozen importlib._bootstrap_external>", line 850, in exec_module\n'
' File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed\n'
' File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/s3/__init__.py", line 1, in <module>\n'
' from datahub.ingestion.source.s3.source import S3Source\n'
' File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/s3/source.py", line 10, in <module>\n'
' import pydeequ\n'
"ModuleNotFoundError: No module named 'pydeequ'\n"
'\n'
'The above exception was the direct cause of the following exception:\n'
'\n'
'Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 172, in __init__\n'
' source_class = source_registry.get(source_type)\n'
' File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 127, in get\n'
' raise ConfigurationError(\n'
"datahub.configuration.common.ConfigurationError: s3 is disabled; try running: pip install 'acryl-datahub[s3]'\n"
'[2022-09-20 16:24:17,773] INFO {datahub.cli.ingest_cli:119} - Starting metadata ingestion\n'
'[2022-09-20 16:24:17,774] INFO {datahub.cli.ingest_cli:137} - Finished metadata ingestion\n'
"[2022-09-20 16:24:17,919] ERROR {datahub.entrypoints:188} - Command failed with 'Pipeline' object has no attribute 'source'. Run with "
'--debug to get full trace\n'
'[2022-09-20 16:24:17,920] INFO {datahub.entrypoints:191} - DataHub CLI version: 0.8.43.2 at '
'/usr/local/lib/python3.9/site-packages/datahub/__init__.py\n',
"2022-09-20 16:24:18.214118 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: Failed to execute 'datahub ingest'",
'2022-09-20 16:24:18.214380 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: Caught exception EXECUTING '
'task_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
' self.event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
' return f.result()\n'
' File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
' raise self._exception\n'
' File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
' result = coro.send(None)\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 142, in execute\n'
' raise TaskError("Failed to execute \'datahub ingest\'")\n'
"acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
cool-vr-73109
09/20/2022, 4:12 PMcreamy-pizza-80433
09/20/2022, 7:04 AMlemon-cat-72045
09/21/2022, 6:38 AMdry-hair-98162
09/21/2022, 7:28 AMmagnificent-lock-58916
09/21/2022, 10:24 AMfancy-alligator-33404
09/21/2022, 1:57 PMdelightful-barista-90363
09/21/2022, 2:26 PMsilly-finland-62382
09/21/2022, 4:34 PM