millions-raincoat-77437
08/22/2022, 1:27 PMmelodic-monitor-75886
08/22/2022, 5:08 PM[2022-08-22 17:00:35,324] ERROR {datahub.ingestion.run.pipeline:127} - The "dnspython" module must be installed to use mongodb+srv:// '
'URIs. To fix this error install pymongo with the srv extra:\n'
' /tmp/datahub/ingest/venv-1481877f-1fce-4dc3-888e-1d27fe819844/bin/python3 -m pip install "pymongo[srv]"\n'
Has anyone encountered this and resolved it?straight-agent-79732
08/21/2022, 6:36 AMproud-cpu-75817
08/22/2022, 10:12 PMgray-airplane-39227
08/22/2022, 10:50 PMEntities
and Timeline
. I’m wondering other than from CLI and UI there’s any other ways to ingest data.bland-orange-13353
08/23/2022, 4:43 AMbusy-glass-61431
08/23/2022, 5:54 AMalert-fall-82501
08/23/2022, 6:31 AMmicroscopic-mechanic-13766
08/23/2022, 7:50 AMBroken DAG: [/opt/airflow/dags/pruebaDH.py] Traceback (most recent call last):
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/opt/airflow/dags/pruebaDH.py", line 10, in <module>
from datahub_provider.entities import Dataset
ModuleNotFoundError: No module named 'datahub_provider'
square-solstice-69079
08/23/2022, 9:43 AMcolossal-hairdresser-6799
08/23/2022, 11:39 AMIngesting metadata
BigQuery labels
Hello channel!
For my current assignment we have 100k+ tables that we would like to ingest into Datahub.
For all the tables we want to retrieve the information contained in labels and add it as metadata in Datahub.
What’s a feasible way of achieving this?bland-orange-13353
08/23/2022, 11:58 AMalert-fall-82501
08/23/2022, 2:11 PMsparse-forest-98608
08/23/2022, 2:36 PMsparse-forest-98608
08/23/2022, 2:36 PMgreat-cpu-77172
08/23/2022, 3:58 PMspark = SparkSession.builder \
.master("local") \
.appName("datahub-lineage") \
.config("spark.jars", "postgresql-42.2.14.jar") \
.config("spark.jars.packages", "io.acryl:datahub-spark-lineage:0.8.24") \
.config("spark.extralisteners", "datahub.spark.DatahubSparkListener") \
.config("spark.datahub.rest.server", "<http://localhost:8080>") \
.getOrCreate()
flight_details.write \
.mode("append") \
.format("jdbc") \
.option("url", "jdbc:<postgresql://localhost:5432/my_database>") \
.option("user", "postgres") \
.option("password", "password123") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "flight_details") \
.save()
little-breakfast-38102
08/23/2022, 5:15 PMcalm-balloon-31412
08/23/2022, 5:34 PMcool-actor-73767
08/23/2022, 9:50 PMelegant-article-21703
08/24/2022, 9:20 AM[2022-08-24 10:13:04,025] INFO {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.41
[2022-08-24 10:13:04,691] INFO {datahub.ingestion.run.pipeline:160} - Sink configured successfully. DataHubRestEmitter: configured to talk to <https://apitest.project.com/project-gms-test/>
[2022-08-24 10:13:04,691] INFO {datahub.cli.ingest_cli:115} - Starting metadata ingestion
[2022-08-24 10:13:04,737] ERROR {datahub.ingestion.run.pipeline:110} - failed to write record with workunit file://./datahub-cli/recipes/users.json:0 with ('Unable to emit metadata to DataHub GMS', {'statusCode': 404, 'message': 'Resource not found'}) and info {'statusCode': 404, 'message': 'Resource not found'}
[2022-08-24 10:13:04,771] ERROR {datahub.ingestion.run.pipeline:110} - failed to write record with workunit file://./datahub-cli/recipes/users.json:1 with ('Unable to emit metadata to DataHub GMS', {'statusCode': 404, 'message': 'Resource not found'}) and info {'statusCode': 404, 'message': 'Resource not found'}
[2022-08-24 10:13:04,772] INFO {datahub.cli.ingest_cli:133} - Finished metadata pipeline
Source (file) report:
{'workunits_produced': 2,
'workunit_ids': ['file://./datahub-cli/recipes/users.json:0', 'file://./datahub-cli/recipes/users.json:1'],
'warnings': {},
'failures': {},
'cli_version': '0.8.41',
'cli_entry_location': '/home/0_GDP/datahub/venv/lib/python3.8/site-packages/datahub/__init__.py',
'py_version': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]',
'py_exec_path': '/home/0_GDP/datahub/venv/bin/python',
'os_details': 'Linux-5.15.0-46-generic-x86_64-with-glibc2.29'}
Sink (datahub-rest) report:
{'records_written': 0,
'warnings': [],
'failures': [{'error': 'Unable to emit metadata to DataHub GMS', 'info': {'statusCode': 404, 'message': 'Resource not found'}},
{'error': 'Unable to emit metadata to DataHub GMS', 'info': {'statusCode': 404, 'message': 'Resource not found'}}],
'downstream_start_time': None,
'downstream_end_time': None,
'downstream_total_latency_in_seconds': None,
'gms_version': 'v0.8.41'}
Pipeline finished with 0 failures in source producing 2 workunits
And the recipe we are using is the following:
source:
type: file
config:
# Coordinates
filename: "./datahub-cli/recipes/users.json"
sink:
type: "datahub-rest"
config:
server: "<https://apitest.project.com/project-gms-test/>"
extra_headers:
accept: "*/*"
accept-language: "en-US,en;q=0.9"
authorization: "Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6IjJaUXBKM1VwYmpBWVhZR2FYRUpsOGxWMFRPSSIsImtpZCI6IjJaUXBKM1VwYmpBWVhZR2FYRUpsOGxWMFRPSSJ9.eyJhdWQiO"
cache-control: "no-cache"
content-type: "application/json"
ocp-apim-subscription-key: "61cf44e0696d"
sec-fetch-dest: "empty"
sec-fetch-mode: "cors"
sec-fetch-site: "cross-site"
In the swagger, we have included that the server shall aims to the GMS url. Is there something that we are missing here?
Thank you all in advance!great-account-95406
08/24/2022, 9:57 AMSucceeded
. Getting this error:
'/usr/local/bin/run_ingest.sh: line 40: 1085 Killed ( datahub ingest run -c "${recipe_file}" ${report_option} )\n',
"2022-08-24 09:54:45.857765 [exec_id=dda4a3b4-d54c-4764-a22e-55ff65fbb940] INFO: Failed to execute 'datahub ingest'",
'2022-08-24 09:54:45.858000 [exec_id=dda4a3b4-d54c-4764-a22e-55ff65fbb940] INFO: Caught exception EXECUTING '
'task_id=dda4a3b4-d54c-4764-a22e-55ff65fbb940, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
' self.event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
' return f.result()\n'
' File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
' raise self._exception\n'
' File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
' result = coro.send(None)\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 142, in execute\n'
' raise TaskError("Failed to execute \'datahub ingest\'")\n'
"acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
false
Is this the expected behavior?few-grass-66826
08/24/2022, 12:38 PMlate-bear-87552
08/24/2022, 12:48 PMval taskMCPW = dataHubRestEmitter.addTaskRunToDataHub(
"dataProcessInstance",
"urn:li:dataProcessInstance:(urn:li:dataJob:(urn:li:dataFlow:(spark,gobbler-ingestion-applicationId-1,PROD),dp.groww_staging_22803.gobbler_3.test3_2),avc)")
def addTaskRunToDataHub(entityType: String, urn: String): MetadataChangeProposalWrapper.Build ={
MetadataChangeProposalWrapper.builder()
.entityType(entityType)
.entityUrn(urn)
.upsert()
.aspect(new DataProcessInstanceRunEvent()
.setMessageId("test-1")
.setStatus(DataProcessRunStatus.COMPLETE)
.setTimestampMillis(Instant.now.getEpochSecond))
}
Failed to validate entity URN urn:li:dataProcessInstance:(urn:li:dataJob:(urn:li:dataFlow:(spark,gobbler-ingestion-applicationId-1,PROD),dp.groww_staging_22803.gobbler_3.test3_2),avc)\n\tat com.linkedin.metadata.utils.EntityKeyUtils.getUrnFromProposal(EntityKeyUtils.java:33)\n\tat com.linkedin.metadata.resources.entity.AspectUtils.getAdditionalChanges(AspectUtils.java:33)\n\tat com.linkedin.metadata.resources.entity.AspectResource.ingestProposal(AspectResource.java:131)\n\tat sun.reflect.GeneratedMethodAccessor233.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:177)\n\t... 81 more\nCaused by: java.lang.IllegalArgumentException: Failed to convert urn to entity key: urns parts and key fields do not have same length\n\tat com.linkedin.metadata.utils.EntityKeyUtils.convertUrnToEntityKey(EntityKeyUtils.java:97)\n\tat com.linkedin.metadata.utils.EntityKeyUtils.getUrnFromProposal(EntityKeyUtils.java:31)\n\t... 87 more\n","message":"INTERNAL SERVER ERROR","status":500}, underlyingResponse=HTTP/1.1 500 Server Error [Date: Wed, 24 Aug 2022 12:41:00 GMT, Content-Type: application/json, Content-Length: 9066, Connection: keep-alive, X-RestLi-Protocol-Version: 2.0.0, Strict-Transport-Security: max-age=15724800; includeSubDomains] [Content-Length: 9066,Chunked: false])
aloof-ram-72401
08/24/2022, 2:24 PMsilly-finland-62382
08/24/2022, 2:45 PMrun Spark lineage using python code on local using the below code I am getting error
silly-finland-62382
08/24/2022, 2:46 PMcode: spark=SparkSession.builder \
.master("local[1]") \
.appName("Main") \
.config("spark.sql.warehouse.dir", "/tmp/data") \
.config("spark.jars.packages", "io.acryl:datahub-spark-lineage:0.8.43") \
.config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \
.config("spark.datahub.rest.server", "<http://172.31.18.133:8080>") \
.config("spark.datahub.metadata.dataset.platformInstance", "dataset") \
.config("spark.datahub.rest.token", "eyJhbGciOiJIUzI1NiJ9.eyJhY3RvclR5cGUiOiJVU0VSIiwiYWN0b3JJZCI6Im1vaGl0LmdhcmciLCJ0eXBlIjoiUEVSU09OQUwiLCJ2ZXJzaW9uIjoiMiIsImV4cCI6MTY2MzkxOTkzOSwianRpIjoiMjk2Y2E3MGUtMjA2My00ODM0LTkwNmYtMGIzZjRjMTVlY2RhIiwic3ViIjoibW9oaXQuZ2FyZyIsImlzcyI6ImRhdGFodWItbWV0YWRhdGEtc2VydmljZSJ9.tr2mu_FueVfHKz9Ze2BWmN4dqhOrTwR1t_WrfxspOmY") \
.enableHiveSupport() \
.getOrCreate();
silly-finland-62382
08/24/2022, 2:46 PM/Users/nishchayagarwal/IdeaProjects/python-venv/bin/python /Users/nishchayagarwal/IdeaProjects/prism-catalog/lineage/staging/datahub-spark.py
Ivy Default Cache set to: /Users/nishchayagarwal/.ivy2/cache
The jars for the packages stored in: /Users/nishchayagarwal/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/lib/python3.9/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
io.acryl#datahub-spark-lineage added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-1a425c26-0bdd-4fa2-82e7-2e79de959dae;1.0
confs: [default]
found io.acryl#datahub-spark-lineage;0.8.43 in central
:: resolution report :: resolve 236ms :: artifacts dl 3ms
:: modules in use:
io.acryl#datahub-spark-lineage;0.8.43 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 1 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-1a425c26-0bdd-4fa2-82e7-2e79de959dae
confs: [default]
0 artifacts copied, 1 already retrieved (0kB/5ms)
22/08/24 20:14:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/24 20:14:19 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See <http://www.slf4j.org/codes.html#StaticLoggerBinder> for further details.
Process finished with exit code 0
silly-finland-62382
08/24/2022, 2:46 PMsilly-finland-62382
08/24/2022, 2:46 PMsilly-finland-62382
08/24/2022, 2:47 PM