DataHub #ingestion

millions-raincoat-77437

08/22/2022, 1:27 PM

Hi folks, Im trying to add a data lineage between glue and s3? Is it possible to link these tools automatically or through yml file? If yes, please tell me how.

melodic-monitor-75886

08/22/2022, 5:08 PM

Hey folks, I am trying to connect to a MongoDB Atlas instance for ingest using the GUI, and I’m getting this error:

Copy code

[2022-08-22 17:00:35,324] ERROR    {datahub.ingestion.run.pipeline:127} - The "dnspython" module must be installed to use mongodb+srv:// '
           'URIs. To fix this error install pymongo with the srv extra:\n'
           ' /tmp/datahub/ingest/venv-1481877f-1fce-4dc3-888e-1d27fe819844/bin/python3 -m pip install "pymongo[srv]"\n'

Has anyone encountered this and resolved it?

straight-agent-79732

08/21/2022, 6:36 AM

Hi, for datahub-business-glossary recipe. Where will datahub pick file from, is it from browser running machine? or is it from datahub hosting machine? I tried both, none of them seems working, attaching the reference image. Can someone help us here?

proud-cpu-75817

08/22/2022, 10:12 PM

Just opened my first issue on the DataHub project 🙂 https://github.com/datahub-project/datahub/issues/5706

teamwork 1

gray-airplane-39227

08/22/2022, 10:50 PM

Hello folks, I’m wondering does datahub OpenAPI support ingestion, from the doc it seems it’s only dealing with

Entities

and

Timeline

. I’m wondering other than from CLI and UI there’s any other ways to ingest data.

bland-orange-13353

08/23/2022, 4:43 AM

This message was deleted.

busy-glass-61431

08/23/2022, 5:54 AM

Has anyone tried connector for Airflow on Airflow v1.10.9? Datahub documentation says it's supported for Airflow v1.10.15+ but has anyone tested it below that version OR it will just not work with an airflow version below that?

alert-fall-82501

08/23/2022, 6:31 AM

HI Team - I have s3 delta lake as source .I have table there in parquet file . In base path I am giving path as include "s3://xx.lakehouse.xxx.dev/xxx/data/PartialPayload_daily/date=2021-07-28/*.parquet" after giving this path I am getting whole folder path at server side and also the parquet file name a=is converting in to folder ..... What I want is only Table Schema no need of folder . ? Please suggest on this ? or can anybody here format above path to get only table ?

microscopic-mechanic-13766

08/23/2022, 7:50 AM

Good morning team, so I am facing with an error trying to test the connection with Airflow. I have followed the steps shown here. My problem is that the DAG can be imported into Airflow. That DAG is just a copy and paste of the given example. Has anyone faced this error previously?? Thanks in advance!

Copy code

Broken DAG: [/opt/airflow/dags/pruebaDH.py] Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/airflow/dags/pruebaDH.py", line 10, in <module>
    from datahub_provider.entities import Dataset
ModuleNotFoundError: No module named 'datahub_provider'

square-solstice-69079

08/23/2022, 9:43 AM

What is the status on managed Airflow? MWAA ingestion. It seems like it is not supported yet based on this? https://datahubspace.slack.com/archives/CUMUWQU66/p1646928904910199 But there is some kind of workaround? Are someone able to explain a bit more in detail how to set it up?

colossal-hairdresser-6799

08/23/2022, 11:39 AM

Ingesting metadata

BigQuery labels

Hello channel! For my current assignment we have 100k+ tables that we would like to ingest into Datahub. For all the tables we want to retrieve the information contained in labels and add it as metadata in Datahub. What’s a feasible way of achieving this?

bland-orange-13353

08/23/2022, 11:58 AM

This message was deleted.

alert-fall-82501

08/23/2022, 2:11 PM

sparse-forest-98608

08/23/2022, 2:36 PM

Can anyone help on my query

sparse-forest-98608

08/23/2022, 2:36 PM

I am putting lot of efforts to research this, but I could not ingest json file schema from local to datahub

great-cpu-77172

08/23/2022, 3:58 PM

Hi Team - I am trying to ingest spark-lineage in my local datahub, where data is being read from csv file and postgres and written in a new table in postgres. But data and lineage is not getting persisted in datahub. Any pointer what could be wrong. I am using spark 3.3 version with jupyter notebook

Copy code

spark = SparkSession.builder \
        .master("local") \
        .appName("datahub-lineage") \
        .config("spark.jars", "postgresql-42.2.14.jar") \
        .config("spark.jars.packages", "io.acryl:datahub-spark-lineage:0.8.24") \
        .config("spark.extralisteners", "datahub.spark.DatahubSparkListener") \
        .config("spark.datahub.rest.server", "<http://localhost:8080>") \
        .getOrCreate()

flight_details.write \
    .mode("append") \
    .format("jdbc") \
    .option("url", "jdbc:<postgresql://localhost:5432/my_database>") \
    .option("user", "postgres") \
    .option("password", "password123") \
    .option("driver", "org.postgresql.Driver") \
    .option("dbtable", "flight_details") \
    .save()

little-breakfast-38102

08/23/2022, 5:15 PM

Hello @incalculable-ocean-74010 / @dazzling-judge-80093 , I am using datahub-ingestion-cron to ingest metadata from MSSQL. I am able to successfully run ingestion after manually making changes going on edit mode to add env variables from secrets in my CRON job using Lens. When tried deploying changes I am running into error as “invalid value in env name”. Attaching screen shots from values.yaml and deployment log. Appreciate any help

calm-balloon-31412

08/23/2022, 5:34 PM

👋 Hello! I am trying to write a graphiQL query to get all runs for a set of tasks where one of the custom properties (in my case "execution date") is greater than some date value I pass in the query, is this possible?

cool-actor-73767

08/23/2022, 9:50 PM

Hi Everyone! I'm using a Glue ingestion process created in datahub UI ingestion feature. Recently I realize that some catalog tables aren't loaded. Does anyone came across the same problem if yes what is solution?

elegant-article-21703

08/24/2022, 9:20 AM

Hello everyone! In our development we are trying to connect to our GMS through an API gateway in Azure. We have loaded the swagger of the openAPI but, once it's done and test it using a recipe, the answer we receive it's the following:

Copy code

[2022-08-24 10:13:04,025] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.41
[2022-08-24 10:13:04,691] INFO     {datahub.ingestion.run.pipeline:160} - Sink configured successfully. DataHubRestEmitter: configured to talk to <https://apitest.project.com/project-gms-test/>
[2022-08-24 10:13:04,691] INFO     {datahub.cli.ingest_cli:115} - Starting metadata ingestion
[2022-08-24 10:13:04,737] ERROR    {datahub.ingestion.run.pipeline:110} - failed to write record with workunit file://./datahub-cli/recipes/users.json:0 with ('Unable to emit metadata to DataHub GMS', {'statusCode': 404, 'message': 'Resource not found'}) and info {'statusCode': 404, 'message': 'Resource not found'}
[2022-08-24 10:13:04,771] ERROR    {datahub.ingestion.run.pipeline:110} - failed to write record with workunit file://./datahub-cli/recipes/users.json:1 with ('Unable to emit metadata to DataHub GMS', {'statusCode': 404, 'message': 'Resource not found'}) and info {'statusCode': 404, 'message': 'Resource not found'}
[2022-08-24 10:13:04,772] INFO     {datahub.cli.ingest_cli:133} - Finished metadata pipeline

Source (file) report:
{'workunits_produced': 2,
 'workunit_ids': ['file://./datahub-cli/recipes/users.json:0', 'file://./datahub-cli/recipes/users.json:1'],
 'warnings': {},
 'failures': {},
 'cli_version': '0.8.41',
 'cli_entry_location': '/home/0_GDP/datahub/venv/lib/python3.8/site-packages/datahub/__init__.py',
 'py_version': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]',
 'py_exec_path': '/home/0_GDP/datahub/venv/bin/python',
 'os_details': 'Linux-5.15.0-46-generic-x86_64-with-glibc2.29'}
Sink (datahub-rest) report:
{'records_written': 0,
 'warnings': [],
 'failures': [{'error': 'Unable to emit metadata to DataHub GMS', 'info': {'statusCode': 404, 'message': 'Resource not found'}},
              {'error': 'Unable to emit metadata to DataHub GMS', 'info': {'statusCode': 404, 'message': 'Resource not found'}}],
 'downstream_start_time': None,
 'downstream_end_time': None,
 'downstream_total_latency_in_seconds': None,
 'gms_version': 'v0.8.41'}

Pipeline finished with 0 failures in source producing 2 workunits

And the recipe we are using is the following:

Copy code

source:
  type: file
  config:
    # Coordinates
    filename: "./datahub-cli/recipes/users.json"

sink:
  type: "datahub-rest"
  config:
    server: "<https://apitest.project.com/project-gms-test/>"
    extra_headers:
      accept: "*/*"
      accept-language: "en-US,en;q=0.9"
      authorization: "Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6IjJaUXBKM1VwYmpBWVhZR2FYRUpsOGxWMFRPSSIsImtpZCI6IjJaUXBKM1VwYmpBWVhZR2FYRUpsOGxWMFRPSSJ9.eyJhdWQiO"
      cache-control: "no-cache"
      content-type: "application/json"
      ocp-apim-subscription-key: "61cf44e0696d"
      sec-fetch-dest: "empty"
      sec-fetch-mode: "cors"
      sec-fetch-site: "cross-site"

In the swagger, we have included that the server shall aims to the GMS url. Is there something that we are missing here? Thank you all in advance!

great-account-95406

08/24/2022, 9:57 AM

Hi, everyone! I’m trying to run multiple ingestions at the same time via UI but only one of them is

Succeeded

. Getting this error:

Copy code

'/usr/local/bin/run_ingest.sh: line 40:  1085 Killed                  ( datahub ingest run -c "${recipe_file}" ${report_option} )\n',
           "2022-08-24 09:54:45.857765 [exec_id=dda4a3b4-d54c-4764-a22e-55ff65fbb940] INFO: Failed to execute 'datahub ingest'",
           '2022-08-24 09:54:45.858000 [exec_id=dda4a3b4-d54c-4764-a22e-55ff65fbb940] INFO: Caught exception EXECUTING '
           'task_id=dda4a3b4-d54c-4764-a22e-55ff65fbb940, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 142, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
false

Is this the expected behavior?

few-grass-66826

08/24/2022, 12:38 PM

Hi guys, I have confluent docker setup and wanna ingest from a topic from confluent kafka but it stuck on running status and do noting any idea known bug?

late-bear-87552

08/24/2022, 12:48 PM

Hello Everyone, i am trying to add run instance to task of spark job in the datahub using java emitter. i am getting below error. could you please help me how to form urn

Copy code

val taskMCPW = dataHubRestEmitter.addTaskRunToDataHub(
          "dataProcessInstance",
          "urn:li:dataProcessInstance:(urn:li:dataJob:(urn:li:dataFlow:(spark,gobbler-ingestion-applicationId-1,PROD),dp.groww_staging_22803.gobbler_3.test3_2),avc)")

Copy code

def addTaskRunToDataHub(entityType: String, urn: String): MetadataChangeProposalWrapper.Build ={

    MetadataChangeProposalWrapper.builder()
      .entityType(entityType)
      .entityUrn(urn)
      .upsert()
      .aspect(new DataProcessInstanceRunEvent()
        .setMessageId("test-1")
        .setStatus(DataProcessRunStatus.COMPLETE)
        .setTimestampMillis(Instant.now.getEpochSecond))
  }

Copy code

Failed to validate entity URN urn:li:dataProcessInstance:(urn:li:dataJob:(urn:li:dataFlow:(spark,gobbler-ingestion-applicationId-1,PROD),dp.groww_staging_22803.gobbler_3.test3_2),avc)\n\tat com.linkedin.metadata.utils.EntityKeyUtils.getUrnFromProposal(EntityKeyUtils.java:33)\n\tat com.linkedin.metadata.resources.entity.AspectUtils.getAdditionalChanges(AspectUtils.java:33)\n\tat com.linkedin.metadata.resources.entity.AspectResource.ingestProposal(AspectResource.java:131)\n\tat sun.reflect.GeneratedMethodAccessor233.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:177)\n\t... 81 more\nCaused by: java.lang.IllegalArgumentException: Failed to convert urn to entity key: urns parts and key fields do not have same length\n\tat com.linkedin.metadata.utils.EntityKeyUtils.convertUrnToEntityKey(EntityKeyUtils.java:97)\n\tat com.linkedin.metadata.utils.EntityKeyUtils.getUrnFromProposal(EntityKeyUtils.java:31)\n\t... 87 more\n","message":"INTERNAL SERVER ERROR","status":500}, underlyingResponse=HTTP/1.1 500 Server Error [Date: Wed, 24 Aug 2022 12:41:00 GMT, Content-Type: application/json, Content-Length: 9066, Connection: keep-alive, X-RestLi-Protocol-Version: 2.0.0, Strict-Transport-Security: max-age=15724800; includeSubDomains] [Content-Length: 9066,Chunked: false])

aloof-ram-72401

08/24/2022, 2:24 PM

Hi, looking for a recommendation on how to handle ingestion of GlobalTags, GlossaryTerms, and Ownership for a Dataset when we have multiple sources that may need to modify these. For example, the source db might emit a couple tags, but we want to allow users to also add tags via UI as well. Is there a way to make sure the source won't overwrite any tags added via UI every time it runs ingestion, similar to how editableSchemaMetadata works?

silly-finland-62382

08/24/2022, 2:45 PM

Hey, I am tryiung to

Copy code

run Spark lineage using python code on local using the below code I am getting error

silly-finland-62382

08/24/2022, 2:46 PM

Copy code

code: spark=SparkSession.builder \
    .master("local[1]") \
    .appName("Main") \
    .config("spark.sql.warehouse.dir", "/tmp/data") \
    .config("spark.jars.packages", "io.acryl:datahub-spark-lineage:0.8.43") \
    .config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \
    .config("spark.datahub.rest.server", "<http://172.31.18.133:8080>") \
    .config("spark.datahub.metadata.dataset.platformInstance", "dataset") \
    .config("spark.datahub.rest.token", "eyJhbGciOiJIUzI1NiJ9.eyJhY3RvclR5cGUiOiJVU0VSIiwiYWN0b3JJZCI6Im1vaGl0LmdhcmciLCJ0eXBlIjoiUEVSU09OQUwiLCJ2ZXJzaW9uIjoiMiIsImV4cCI6MTY2MzkxOTkzOSwianRpIjoiMjk2Y2E3MGUtMjA2My00ODM0LTkwNmYtMGIzZjRjMTVlY2RhIiwic3ViIjoibW9oaXQuZ2FyZyIsImlzcyI6ImRhdGFodWItbWV0YWRhdGEtc2VydmljZSJ9.tr2mu_FueVfHKz9Ze2BWmN4dqhOrTwR1t_WrfxspOmY") \
    .enableHiveSupport() \
    .getOrCreate();

plus1 1

silly-finland-62382

08/24/2022, 2:46 PM

Error:

Copy code

/Users/nishchayagarwal/IdeaProjects/python-venv/bin/python /Users/nishchayagarwal/IdeaProjects/prism-catalog/lineage/staging/datahub-spark.py
Ivy Default Cache set to: /Users/nishchayagarwal/.ivy2/cache
The jars for the packages stored in: /Users/nishchayagarwal/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/lib/python3.9/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
io.acryl#datahub-spark-lineage added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-1a425c26-0bdd-4fa2-82e7-2e79de959dae;1.0
	confs: [default]
	found io.acryl#datahub-spark-lineage;0.8.43 in central
:: resolution report :: resolve 236ms :: artifacts dl 3ms
	:: modules in use:
	io.acryl#datahub-spark-lineage;0.8.43 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-1a425c26-0bdd-4fa2-82e7-2e79de959dae
	confs: [default]
	0 artifacts copied, 1 already retrieved (0kB/5ms)
22/08/24 20:14:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/24 20:14:19 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See <http://www.slf4j.org/codes.html#StaticLoggerBinder> for further details.

Process finished with exit code 0

silly-finland-62382

08/24/2022, 2:46 PM

can someone help me on this @channel

silly-finland-62382

08/24/2022, 2:46 PM

@big-carpet-38439

silly-finland-62382

08/24/2022, 2:47 PM

@bulky-soccer-26729 @little-megabyte-1074