red-pizza-28006
05/19/2022, 2:30 PM[2022-05-19 16:27:27,391] WARNING {datahub.ingestion.source.confluent_schema_registry:47} - Failed to get subjects from schema registry: ('Connection aborted.', BadStatusLine('\x15\x03\x03\x00\x02\x02P'))
Looking at the documentation, I understand that I need to have an implementation for KafkaSchemaRegistryBase
but where do i implement and deploy this?brash-sundown-77702
05/19/2022, 2:33 PMbrash-sundown-77702
05/19/2022, 2:36 PMsome-shoe-34751
05/19/2022, 2:55 PMdbt
ingestion -
we've ingested the dbt json files into datahub, however it has also ingested some of the upstream Glue
schemas into dbt dataset. Ideally it should just link to the existing glue schema metadata.
This has caused duplicates in the datasets. Is there a way to dedup or skip the upstream schema ingestion for dbt ?chilly-gpu-46080
05/20/2022, 5:36 AMUsing v3 (Batch Request) API
Calculating Metrics: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 24.01it/s]
Validation succeeded!
Suite Name Status Expectations met
- incident_data_table_expectations ✔ Passed 9 of 9 (100.0 %)
and I’m certain that the results are being pushed correctly as when I try to push them without a token, I get a 401 error when running the GE checkpoint.
I’ve added this configuration to my checkpoint
- name: datahub_action
action:
module_name: datahub.integrations.great_expectations.action
class_name: DataHubValidationAction
server_url: '<http://host_name:8080>'
token: 'really_long_token'
Any help will be greatly appreciated!nice-mechanic-83147
05/20/2022, 6:57 AMminiature-sandwich-75434
05/20/2022, 7:08 AMred-pizza-28006
05/20/2022, 11:49 AM[2022-05-20 13:39:39,780] WARNING {datahub.ingestion.source.confluent_schema_registry:47} - Failed to get subjects from schema registry: HTTPSConnectionPool(host='endpoint', port=8081): Max retries exceeded with url: /subjects (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1125)')))
whereas when I run this it works fine.
curl --cacert cert.pem <endpoint>:8081/subjects
my config looks like this
source:
type: "kafka"
config:
connection:
bootstrap: "<>"
consumer_config:
sasl.mechanism: "PLAIN"
sasl.username: "<>"
sasl.password: "<>"
security.protocol: "sasl_ssl"
schema_registry_url: "https://<>:8081"
schema_registry_config:
<http://basic.auth.user.info|basic.auth.user.info>: <>:<>
ssl.ca.location: "cert.pem"
# see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub> for complete documentation
sink:
type: "datahub-rest"
config:
server: "<http://gms-datacatalog.data-ing-prod-eks-eu-west-1.sam-app.ro>"
Any ideas?salmon-angle-92685
05/20/2022, 12:53 PMnice-mechanic-83147
05/20/2022, 5:36 PMsteep-thailand-61363
05/22/2022, 11:42 AMastonishing-dusk-99990
05/23/2022, 4:40 AMsource:
type: superset
config:
connect_uri: '<http://xx.xx.xx.xx:8088>'
username: admin
password: admin
provider: db
sink:
type: datahub-rest
config:
server: '<http://datahub-gms:8080>'
But I'm getting error on this
---- (full traceback above) ----
'File "/tmp/datahub/ingest/venv-2bd5bc52-6ba1-45c3-b58a-830ae5dd0254/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 103, in '
'run\n'
' pipeline = Pipeline.create(pipeline_config, dry_run, preview, preview_workunits)\n'
'File "/tmp/datahub/ingest/venv-2bd5bc52-6ba1-45c3-b58a-830ae5dd0254/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line '
'203, in create\n'
' return cls(\n'
'File "/tmp/datahub/ingest/venv-2bd5bc52-6ba1-45c3-b58a-830ae5dd0254/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line '
'151, in __init__\n'
' self.source: Source = source_class.create(\n'
'File "/tmp/datahub/ingest/venv-2bd5bc52-6ba1-45c3-b58a-830ae5dd0254/lib/python3.9/site-packages/datahub/ingestion/source/superset.py", '
'line 157, in create\n'
' return cls(ctx, config)\n'
'File "/tmp/datahub/ingest/venv-2bd5bc52-6ba1-45c3-b58a-830ae5dd0254/lib/python3.9/site-packages/datahub/ingestion/source/superset.py", '
'line 137, in __init__\n'
' self.access_token = login_response.json()["access_token"]\n'
'\n'
"KeyError: 'access_token'\n"
'[2022-05-23 04:34:04,688] INFO {datahub.entrypoints:176} - DataHub CLI version: 0.8.34.1 at '
'/tmp/datahub/ingest/venv-2bd5bc52-6ba1-45c3-b58a-830ae5dd0254/lib/python3.9/site-packages/datahub/__init__.py\n'
'[2022-05-23 04:34:04,688] INFO {datahub.entrypoints:179} - Python version: 3.9.9 (main, Dec 21 2021, 10:03:34) \n'
'[GCC 10.2.1 20210110] at /tmp/datahub/ingest/venv-2bd5bc52-6ba1-45c3-b58a-830ae5dd0254/bin/python3 on '
'Linux-4.14.262-200.489.amzn2.x86_64-x86_64-with-glibc2.31\n'
"[2022-05-23 04:34:04,688] INFO {datahub.entrypoints:182} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': "
"'v0.8.35', 'commit': 'f0756460483e84a121410ad16d7acf6f34986978'}}, 'managedIngestion': {'defaultCliVersion': '0.8.34.1', 'enabled': "
"True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': False, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, "
"'datasetUrnNameCasing': False, 'retention': 'true', 'noCode': 'true'}\n",
"2022-05-23 04:34:05.851851 [exec_id=2bd5bc52-6ba1-45c3-b58a-830ae5dd0254] INFO: Failed to execute 'datahub ingest'",
'2022-05-23 04:34:05.852216 [exec_id=2bd5bc52-6ba1-45c3-b58a-830ae5dd0254] INFO: Caught exception EXECUTING '
'task_id=2bd5bc52-6ba1-45c3-b58a-830ae5dd0254, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
' self.event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
' return f.result()\n'
' File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
' raise self._exception\n'
' File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
' result = coro.send(None)\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
' raise TaskError("Failed to execute \'datahub ingest\'")\n'
"acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
Does anyone know how to fix it?nutritious-bird-77396
05/23/2022, 2:59 PMmysterious-lamp-91034
05/23/2022, 6:24 PMclever-machine-43182
05/24/2022, 5:24 AMpolite-application-51650
05/24/2022, 6:33 AMrapid-fireman-19686
05/24/2022, 10:07 AMquick-motorcycle-57957
05/24/2022, 2:12 PMnutritious-bird-77396
05/24/2022, 4:51 PMorange-coat-2879
05/24/2022, 9:48 PMbest-umbrella-24804
05/25/2022, 1:15 AMsource:
type: snowflake-usage
config:
host_port: <http://xxxxx.snowflakecomputing.com|xxxxx.snowflakecomputing.com>
warehouse: DEVELOPER_X_SMALL
username: DATAHUB_DEV_USER
password: '${SNOWFLAKE_DEV_PASSWORD}'
role: DATAHUB_DEV_ACCESS
top_n_queries: 10
sink:
type: datahub-rest
config:
server: xxxxxx'
When running the ingestion, it errors out and looks like its having trouble installing packages? I've attached the full log.
Thanks in advance!
~~~~ Execution Summary ~~~~
RUN_INGEST - {'errors': [],
'exec_id': '897f0639-d673-433a-aab5-76460957d26b',
'infos': ['2022-05-25 00:32:52.050959 [exec_id=897f0639-d673-433a-aab5-76460957d26b] INFO: Starting execution for task with name=RUN_INGEST',
'2022-05-25 00:35:00.631829 [exec_id=897f0639-d673-433a-aab5-76460957d26b] INFO: stdout=Requirement already satisfied: pip in '
'/tmp/datahub/ingest/venv-897f0639-d673-433a-aab5-76460957d26b/lib/python3.9/site-packages (21.2.4)\n'
'Collecting pip\n'
' Using cached pip-22.1.1-py3-none-any.whl (2.1 MB)\n'
'Collecting wheel\n'
' Using cached wheel-0.37.1-py2.py3-none-any.whl (35 kB)\n'
'Requirement already satisfied: setuptools in /tmp/datahub/ingest/venv-897f0639-d673-433a-aab5-76460957d26b/lib/python3.9/site-packages '
'(58.1.0)\n'
'Collecting setuptools\n'
' Using cached setuptools-62.3.2-py3-none-any.whl (1.2 MB)\n'
'Installing collected packages: wheel, setuptools, pip\n'
' Attempting uninstall: setuptools\n'
' Found existing installation: setuptools 58.1.0\n'
' Uninstalling setuptools-58.1.0:\n'
' Successfully uninstalled setuptools-58.1.0\n'
' Attempting uninstall: pip\n'
' Found existing installation: pip 21.2.4\n'
' Uninstalling pip-21.2.4:\n'
' Successfully uninstalled pip-21.2.4\n'
'Successfully installed pip-22.1.1 setuptools-62.3.2 wheel-0.37.1\n'
'Collecting acryl-datahub[datahub-rest,snowflake-usage]==0.8.33\n'
' Using cached acryl_datahub-0.8.33-py3-none-any.whl (756 kB)\n'
'Collecting mixpanel>=4.9.0\n'
' Using cached mixpanel-4.9.0-py2.py3-none-any.whl (8.9 kB)\n'
'Collecting types-termcolor>=1.0.0\n'
' Using cached types_termcolor-1.1.4-py3-none-any.whl (2.1 kB)\n'
'Collecting avro<1.11,>=1.10.2\n'
' Using cached avro-1.10.2-py3-none-any.whl\n'
'Collecting docker\n'
' Using cached docker-5.0.3-py2.py3-none-any.whl (146 kB)\n'
'Collecting markupsafe<=2.0.1,>=1.1.1\n'
' Using cached MarkupSafe-2.0.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (30 kB)\n'
'Collecting typing-extensions>=3.10.0.2\n'
' Using cached typing_extensions-4.2.0-py3-none-any.whl (24 kB)\n'
'Collecting types-Deprecated\n'
' Using cached types_Deprecated-1.2.8-py3-none-any.whl (3.1 kB)\n'
'Collecting toml>=0.10.0\n'
' Using cached toml-0.10.2-py2.py3-none-any.whl (16 kB)\n'
'Collecting click>=6.0.0\n'
' Using cached click-8.1.3-py3-none-any.whl (96 kB)\n'
'Collecting termcolor>=1.0.0\n'
' Using cached termcolor-1.1.0-py3-none-any.whl\n'
'Collecting python-dateutil>=2.8.0\n'
' Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n'
'Collecting click-default-group\n'
' Using cached click_default_group-1.2.2-py3-none-any.whl\n'
'Collecting tabulate\n'
' Using cached tabulate-0.8.9-py3-none-any.whl (25 kB)\n'
'Collecting typing-inspect\n'
' Using cached typing_inspect-0.7.1-py3-none-any.whl (8.4 kB)\n'
'Collecting mypy-extensions>=0.4.3\n'
' Using cached mypy_extensions-0.4.3-py2.py3-none-any.whl (4.5 kB)\n'
'Collecting PyYAML\n'
' Using cached PyYAML-6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (661 kB)\n'
'Collecting entrypoints\n'
' Using cached entrypoints-0.4-py3-none-any.whl (5.3 kB)\n'
'Collecting psutil>=5.8.0\n'
' Using cached psutil-5.9.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281 '
'kB)\n'
'Collecting progressbar2\n'
' Using cached progressbar2-4.0.0-py2.py3-none-any.whl (26 kB)\n'
'Collecting expandvars>=0.6.5\n'
' Using cached expandvars-0.9.0-py3-none-any.whl (6.6 kB)\n'
'Collecting pydantic>=1.5.1\n'
' Using cached pydantic-1.9.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)\n'
'Collecting stackprinter\n'
' Using cached stackprinter-0.2.6-py3-none-any.whl (28 kB)\n'
'Collecting Deprecated\n'
' Using cached Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)\n'
'Collecting avro-gen3==0.7.2\n'
' Using cached avro_gen3-0.7.2-py3-none-any.whl (26 kB)\n'
'Collecting requests\n'
' Using cached requests-2.27.1-py2.py3-none-any.whl (63 kB)\n'
'Collecting sqlalchemy==1.3.24\n'
' Using cached SQLAlchemy-1.3.24-cp39-cp39-manylinux2010_x86_64.whl (1.3 MB)\n'
'Collecting Jinja2<3.1.0\n'
' Using cached Jinja2-3.0.3-py3-none-any.whl (133 kB)\n'
'Collecting more-itertools>=8.12.0\n'
' Downloading more_itertools-8.13.0-py3-none-any.whl (51 kB)\n'
' ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.6/51.6 kB 8.3 MB/s eta 0:00:00\n'
'Collecting snowflake-sqlalchemy<=1.2.4\n'
' Using cached snowflake_sqlalchemy-1.2.4-py2.py3-none-any.whl (29 kB)\n'
'Collecting cryptography\n'
' Using cached cryptography-37.0.2-cp36-abi3-manylinux_2_24_x86_64.whl (4.0 MB)\n'
'Collecting greenlet\n'
' Using cached greenlet-1.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (153 kB)\n'
'Collecting great-expectations>=0.14.11\n'
' Using cached great_expectations-0.15.6-py3-none-any.whl (5.1 MB)\n'
'Collecting sqlparse\n'
' Downloading sqlparse-0.4.2-py3-none-any.whl (42 kB)\n'
' ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.3/42.3 kB 7.8 MB/s eta 0:00:00\n'
'Collecting six\n'
' Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)\n'
'Collecting pytz\n'
' Using cached pytz-2022.1-py2.py3-none-any.whl (503 kB)\n'
'Collecting tzlocal\n'
' Using cached tzlocal-4.2-py3-none-any.whl (19 kB)\n'
'Collecting colorama>=0.4.3\n'
' Using cached colorama-0.4.4-py2.py3-none-any.whl (16 kB)\n'
'Collecting importlib-metadata>=1.7.0\n'
' Using cached importlib_metadata-4.11.4-py3-none-any.whl (18 kB)\n'
'Collecting jsonpatch>=1.22\n'
' Using cached jsonpatch-1.32-py2.py3-none-any.whl (12 kB)\n'
'Collecting pandas>=0.23.0\n'
' Using cached pandas-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)\n'
'Collecting nbformat>=5.0\n'
' Using cached nbformat-5.4.0-py3-none-any.whl (73 kB)\n'
'Collecting ruamel.yaml<0.17.18,>=0.16\n'
' Using cached ruamel.yaml-0.17.17-py3-none-any.whl (109 kB)\n'
'Collecting urllib3<1.27,>=1.25.4\n'
' Using cached urllib3-1.26.9-py2.py3-none-any.whl (138 kB)\n'
'Collecting Ipython>=7.16.3\n'
' Using cached ipython-8.3.0-py3-none-any.whl (750 kB)\n'
'Collecting pyparsing<3,>=2.4\n'
' Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)\n'
'Collecting packaging\n'
' Using cached packaging-21.3-py3-none-any.whl (40 kB)\n'
'Collecting tqdm>=4.59.0\n'
' Using cached tqdm-4.64.0-py2.py3-none-any.whl (78 kB)\n'
'Collecting altair<5,>=4.0.0\n'
' Using cached altair-4.2.0-py3-none-any.whl (812 kB)\n'
'Collecting notebook>=6.4.10\n'
' Using cached notebook-6.4.11-py3-none-any.whl (9.9 MB)\n'
'Collecting cryptography\n'
' Using cached cryptography-36.0.2-cp36-abi3-manylinux_2_24_x86_64.whl (3.6 MB)\n'
'Collecting jsonschema>=2.5.1\n'
' Using cached jsonschema-4.5.1-py3-none-any.whl (72 kB)\n'
'Collecting mistune>=0.8.4\n'
' Using cached mistune-2.0.2-py2.py3-none-any.whl (24 kB)\n'
'Collecting scipy>=0.19.0\n'
'/usr/local/bin/run_ingest.sh: line 16: 424 Killed pip install -r $req_file\n'
'/tmp/datahub/ingest/venv-897f0639-d673-433a-aab5-76460957d26b/bin/python3: No module named datahub\n',
"2022-05-25 00:35:00.632102 [exec_id=897f0639-d673-433a-aab5-76460957d26b] INFO: Failed to execute 'datahub ingest'",
'2022-05-25 00:35:00.632580 [exec_id=897f0639-d673-433a-aab5-76460957d26b] INFO: Caught exception EXECUTING '
'task_id=897f0639-d673-433a-aab5-76460957d26b, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
' self.event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 81, in run_until_complete\n'
' return f.result()\n'
' File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
' raise self._exception\n'
' File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
' result = coro.send(None)\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
' raise TaskError("Failed to execute \'datahub ingest\'")\n'
"acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
clever-machine-43182
05/25/2022, 6:47 AMpolite-application-51650
05/25/2022, 8:07 AME0524 21:41:14.098937000 123145790181376 <http://completion_queue.cc:1052]|completion_queue.cc:1052]> Completion queue next failed: {"created":"@1653408674.098904000","description":"Too many open files","errno":24,"file":"src/core/lib/iomgr/wakeup_fd_pipe.cc","file_line":40,"os_error":"Too many open files","syscall":"pipe"}
E0524 21:41:14.100019000 123145571917824 <http://completion_queue.cc:1052]|completion_queue.cc:1052]> Completion queue next failed: {"created":"@1653408674.099962000","description":"Too many open files","errno":24,"file":"src/core/lib/iomgr/wakeup_fd_pipe.cc","file_line":40,"os_error":"Too many open files","syscall":"pipe"}
E0524 21:41:14.100510000 123146495340544 <http://wakeup_fd_pipe.cc:39]|wakeup_fd_pipe.cc:39]> pipe creation failed (24): Too many open files
E0524 21:41:14.100272000 123146478551040 <http://completion_queue.cc:1052]|completion_queue.cc:1052]> Completion queue next failed: {"created":"@1653408674.099987000","description":"Too many open files","errno":24,"file":"src/core/lib/iomgr/wakeup_fd_pipe.cc","file_line":40,"os_error":"Too many open files","syscall":"pipe"}
@dazzling-judge-80093clean-piano-28976
05/25/2022, 10:37 AMcurl request
to delete all metadata related to a specific platform? In the documentation I only see an example using URNsteep-painter-66054
05/25/2022, 11:02 AMhandsome-football-66174
05/25/2022, 1:15 PMsalmon-angle-92685
05/25/2022, 1:53 PMnumerous-camera-74294
05/25/2022, 2:17 PMurn:li:schemaField:(...,fieldName)
and the datahub sdk for Java defines it as
urn:li:datasetField:(...,fieldName)
which ine is the correct one? I am using v0.8.34 for both sdk’sechoing-alligator-70530
05/25/2022, 3:38 PMbillions-twilight-48559
05/25/2022, 8:44 PM