https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • m

    millions-raincoat-77437

    08/22/2022, 1:27 PM
    Hi folks, Im trying to add a data lineage between glue and s3? Is it possible to link these tools automatically or through yml file? If yes, please tell me how.
    h
    m
    • 3
    • 8
  • m

    melodic-monitor-75886

    08/22/2022, 5:08 PM
    Hey folks, I am trying to connect to a MongoDB Atlas instance for ingest using the GUI, and I’m getting this error:
    Copy code
    [2022-08-22 17:00:35,324] ERROR    {datahub.ingestion.run.pipeline:127} - The "dnspython" module must be installed to use mongodb+srv:// '
               'URIs. To fix this error install pymongo with the srv extra:\n'
               ' /tmp/datahub/ingest/venv-1481877f-1fce-4dc3-888e-1d27fe819844/bin/python3 -m pip install "pymongo[srv]"\n'
    Has anyone encountered this and resolved it?
    h
    g
    • 3
    • 3
  • s

    straight-agent-79732

    08/21/2022, 6:36 AM
    Hi, for datahub-business-glossary recipe. Where will datahub pick file from, is it from browser running machine? or is it from datahub hosting machine? I tried both, none of them seems working, attaching the reference image. Can someone help us here?
    g
    • 2
    • 1
  • p

    proud-cpu-75817

    08/22/2022, 10:12 PM
    Just opened my first issue on the DataHub project 🙂 https://github.com/datahub-project/datahub/issues/5706
    teamwork 1
    b
    • 2
    • 3
  • g

    gray-airplane-39227

    08/22/2022, 10:50 PM
    Hello folks, I’m wondering does datahub OpenAPI support ingestion, from the doc it seems it’s only dealing with
    Entities
    and
    Timeline
    . I’m wondering other than from CLI and UI there’s any other ways to ingest data.
    b
    b
    +2
    • 5
    • 14
  • b

    bland-orange-13353

    08/23/2022, 4:43 AM
    This message was deleted.
    d
    • 2
    • 1
  • b

    busy-glass-61431

    08/23/2022, 5:54 AM
    Has anyone tried connector for Airflow on Airflow v1.10.9? Datahub documentation says it's supported for Airflow v1.10.15+ but has anyone tested it below that version OR it will just not work with an airflow version below that?
    m
    d
    • 3
    • 2
  • a

    alert-fall-82501

    08/23/2022, 6:31 AM
    HI Team - I have s3 delta lake as source .I have table there in parquet file . In base path I am giving path as include "s3://xx.lakehouse.xxx.dev/xxx/data/PartialPayload_daily/date=2021-07-28/*.parquet" after giving this path I am getting whole folder path at server side and also the parquet file name a=is converting in to folder ..... What I want is only Table Schema no need of folder . ? Please suggest on this ? or can anybody here format above path to get only table ?
    h
    c
    • 3
    • 4
  • m

    microscopic-mechanic-13766

    08/23/2022, 7:50 AM
    Good morning team, so I am facing with an error trying to test the connection with Airflow. I have followed the steps shown here. My problem is that the DAG can be imported into Airflow. That DAG is just a copy and paste of the given example. Has anyone faced this error previously?? Thanks in advance!
    Copy code
    Broken DAG: [/opt/airflow/dags/pruebaDH.py] Traceback (most recent call last):
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
      File "/opt/airflow/dags/pruebaDH.py", line 10, in <module>
        from datahub_provider.entities import Dataset
    ModuleNotFoundError: No module named 'datahub_provider'
    d
    • 2
    • 36
  • s

    square-solstice-69079

    08/23/2022, 9:43 AM
    What is the status on managed Airflow? MWAA ingestion. It seems like it is not supported yet based on this? https://datahubspace.slack.com/archives/CUMUWQU66/p1646928904910199 But there is some kind of workaround? Are someone able to explain a bit more in detail how to set it up?
    d
    d
    • 3
    • 52
  • c

    colossal-hairdresser-6799

    08/23/2022, 11:39 AM
    Ingesting metadata
    BigQuery labels
    Hello channel! For my current assignment we have 100k+ tables that we would like to ingest into Datahub. For all the tables we want to retrieve the information contained in labels and add it as metadata in Datahub. What’s a feasible way of achieving this?
    d
    • 2
    • 4
  • b

    bland-orange-13353

    08/23/2022, 11:58 AM
    This message was deleted.
    h
    • 2
    • 2
  • a

    alert-fall-82501

    08/23/2022, 2:11 PM
    HI Team - I have s3 delta lake as source .I have table there in parquet file . In base path I am giving path as include "s3://xx.lakehouse.xxx.dev/xxx/data/PartialPayload_daily/date=2021-07-28/*.parquet" after giving this path I am getting whole folder path at server side and also the parquet file name a=is converting in to folder ..... What I want is only Table Schema no need of folder . ? Please suggest on this ? or can anybody here format above path to get only table ? (edited)
    h
    • 2
    • 1
  • s

    sparse-forest-98608

    08/23/2022, 2:36 PM
    Can anyone help on my query
  • s

    sparse-forest-98608

    08/23/2022, 2:36 PM
    I am putting lot of efforts to research this, but I could not ingest json file schema from local to datahub
  • g

    great-cpu-77172

    08/23/2022, 3:58 PM
    Hi Team - I am trying to ingest spark-lineage in my local datahub, where data is being read from csv file and postgres and written in a new table in postgres. But data and lineage is not getting persisted in datahub. Any pointer what could be wrong. I am using spark 3.3 version with jupyter notebook
    Copy code
    spark = SparkSession.builder \
            .master("local") \
            .appName("datahub-lineage") \
            .config("spark.jars", "postgresql-42.2.14.jar") \
            .config("spark.jars.packages", "io.acryl:datahub-spark-lineage:0.8.24") \
            .config("spark.extralisteners", "datahub.spark.DatahubSparkListener") \
            .config("spark.datahub.rest.server", "<http://localhost:8080>") \
            .getOrCreate()
    
    flight_details.write \
        .mode("append") \
        .format("jdbc") \
        .option("url", "jdbc:<postgresql://localhost:5432/my_database>") \
        .option("user", "postgres") \
        .option("password", "password123") \
        .option("driver", "org.postgresql.Driver") \
        .option("dbtable", "flight_details") \
        .save()
    d
    • 2
    • 3
  • l

    little-breakfast-38102

    08/23/2022, 5:15 PM
    Hello @incalculable-ocean-74010 / @dazzling-judge-80093 , I am using datahub-ingestion-cron to ingest metadata from MSSQL. I am able to successfully run ingestion after manually making changes going on edit mode to add env variables from secrets in my CRON job using Lens. When tried deploying changes I am running into error as “invalid value in env name”. Attaching screen shots from values.yaml and deployment log. Appreciate any help
    i
    • 2
    • 8
  • c

    calm-balloon-31412

    08/23/2022, 5:34 PM
    👋 Hello! I am trying to write a graphiQL query to get all runs for a set of tasks where one of the custom properties (in my case "execution date") is greater than some date value I pass in the query, is this possible?
    h
    • 2
    • 2
  • c

    cool-actor-73767

    08/23/2022, 9:50 PM
    Hi Everyone! I'm using a Glue ingestion process created in datahub UI ingestion feature. Recently I realize that some catalog tables aren't loaded. Does anyone came across the same problem if yes what is solution?
    h
    • 2
    • 4
  • e

    elegant-article-21703

    08/24/2022, 9:20 AM
    Hello everyone! In our development we are trying to connect to our GMS through an API gateway in Azure. We have loaded the swagger of the openAPI but, once it's done and test it using a recipe, the answer we receive it's the following:
    Copy code
    [2022-08-24 10:13:04,025] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.41
    [2022-08-24 10:13:04,691] INFO     {datahub.ingestion.run.pipeline:160} - Sink configured successfully. DataHubRestEmitter: configured to talk to <https://apitest.project.com/project-gms-test/>
    [2022-08-24 10:13:04,691] INFO     {datahub.cli.ingest_cli:115} - Starting metadata ingestion
    [2022-08-24 10:13:04,737] ERROR    {datahub.ingestion.run.pipeline:110} - failed to write record with workunit file://./datahub-cli/recipes/users.json:0 with ('Unable to emit metadata to DataHub GMS', {'statusCode': 404, 'message': 'Resource not found'}) and info {'statusCode': 404, 'message': 'Resource not found'}
    [2022-08-24 10:13:04,771] ERROR    {datahub.ingestion.run.pipeline:110} - failed to write record with workunit file://./datahub-cli/recipes/users.json:1 with ('Unable to emit metadata to DataHub GMS', {'statusCode': 404, 'message': 'Resource not found'}) and info {'statusCode': 404, 'message': 'Resource not found'}
    [2022-08-24 10:13:04,772] INFO     {datahub.cli.ingest_cli:133} - Finished metadata pipeline
    
    Source (file) report:
    {'workunits_produced': 2,
     'workunit_ids': ['file://./datahub-cli/recipes/users.json:0', 'file://./datahub-cli/recipes/users.json:1'],
     'warnings': {},
     'failures': {},
     'cli_version': '0.8.41',
     'cli_entry_location': '/home/0_GDP/datahub/venv/lib/python3.8/site-packages/datahub/__init__.py',
     'py_version': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]',
     'py_exec_path': '/home/0_GDP/datahub/venv/bin/python',
     'os_details': 'Linux-5.15.0-46-generic-x86_64-with-glibc2.29'}
    Sink (datahub-rest) report:
    {'records_written': 0,
     'warnings': [],
     'failures': [{'error': 'Unable to emit metadata to DataHub GMS', 'info': {'statusCode': 404, 'message': 'Resource not found'}},
                  {'error': 'Unable to emit metadata to DataHub GMS', 'info': {'statusCode': 404, 'message': 'Resource not found'}}],
     'downstream_start_time': None,
     'downstream_end_time': None,
     'downstream_total_latency_in_seconds': None,
     'gms_version': 'v0.8.41'}
    
    Pipeline finished with 0 failures in source producing 2 workunits
    And the recipe we are using is the following:
    Copy code
    source:
      type: file
      config:
        # Coordinates
        filename: "./datahub-cli/recipes/users.json"
    
    sink:
      type: "datahub-rest"
      config:
        server: "<https://apitest.project.com/project-gms-test/>"
        extra_headers:
          accept: "*/*"
          accept-language: "en-US,en;q=0.9"
          authorization: "Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6IjJaUXBKM1VwYmpBWVhZR2FYRUpsOGxWMFRPSSIsImtpZCI6IjJaUXBKM1VwYmpBWVhZR2FYRUpsOGxWMFRPSSJ9.eyJhdWQiO"
          cache-control: "no-cache"
          content-type: "application/json"
          ocp-apim-subscription-key: "61cf44e0696d"
          sec-fetch-dest: "empty"
          sec-fetch-mode: "cors"
          sec-fetch-site: "cross-site"
    In the swagger, we have included that the server shall aims to the GMS url. Is there something that we are missing here? Thank you all in advance!
    h
    o
    b
    • 4
    • 15
  • g

    great-account-95406

    08/24/2022, 9:57 AM
    Hi, everyone! I’m trying to run multiple ingestions at the same time via UI but only one of them is
    Succeeded
    . Getting this error:
    Copy code
    '/usr/local/bin/run_ingest.sh: line 40:  1085 Killed                  ( datahub ingest run -c "${recipe_file}" ${report_option} )\n',
               "2022-08-24 09:54:45.857765 [exec_id=dda4a3b4-d54c-4764-a22e-55ff65fbb940] INFO: Failed to execute 'datahub ingest'",
               '2022-08-24 09:54:45.858000 [exec_id=dda4a3b4-d54c-4764-a22e-55ff65fbb940] INFO: Caught exception EXECUTING '
               'task_id=dda4a3b4-d54c-4764-a22e-55ff65fbb940, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 142, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    Execution finished with errors.
    false
    Is this the expected behavior?
    d
    • 2
    • 6
  • f

    few-grass-66826

    08/24/2022, 12:38 PM
    Hi guys, I have confluent docker setup and wanna ingest from a topic from confluent kafka but it stuck on running status and do noting any idea known bug?
    h
    • 2
    • 1
  • l

    late-bear-87552

    08/24/2022, 12:48 PM
    Hello Everyone, i am trying to add run instance to task of spark job in the datahub using java emitter. i am getting below error. could you please help me how to form urn
    Copy code
    val taskMCPW = dataHubRestEmitter.addTaskRunToDataHub(
              "dataProcessInstance",
              "urn:li:dataProcessInstance:(urn:li:dataJob:(urn:li:dataFlow:(spark,gobbler-ingestion-applicationId-1,PROD),dp.groww_staging_22803.gobbler_3.test3_2),avc)")
    Copy code
    def addTaskRunToDataHub(entityType: String, urn: String): MetadataChangeProposalWrapper.Build ={
    
        MetadataChangeProposalWrapper.builder()
          .entityType(entityType)
          .entityUrn(urn)
          .upsert()
          .aspect(new DataProcessInstanceRunEvent()
            .setMessageId("test-1")
            .setStatus(DataProcessRunStatus.COMPLETE)
            .setTimestampMillis(Instant.now.getEpochSecond))
      }
    Copy code
    Failed to validate entity URN urn:li:dataProcessInstance:(urn:li:dataJob:(urn:li:dataFlow:(spark,gobbler-ingestion-applicationId-1,PROD),dp.groww_staging_22803.gobbler_3.test3_2),avc)\n\tat com.linkedin.metadata.utils.EntityKeyUtils.getUrnFromProposal(EntityKeyUtils.java:33)\n\tat com.linkedin.metadata.resources.entity.AspectUtils.getAdditionalChanges(AspectUtils.java:33)\n\tat com.linkedin.metadata.resources.entity.AspectResource.ingestProposal(AspectResource.java:131)\n\tat sun.reflect.GeneratedMethodAccessor233.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:177)\n\t... 81 more\nCaused by: java.lang.IllegalArgumentException: Failed to convert urn to entity key: urns parts and key fields do not have same length\n\tat com.linkedin.metadata.utils.EntityKeyUtils.convertUrnToEntityKey(EntityKeyUtils.java:97)\n\tat com.linkedin.metadata.utils.EntityKeyUtils.getUrnFromProposal(EntityKeyUtils.java:31)\n\t... 87 more\n","message":"INTERNAL SERVER ERROR","status":500}, underlyingResponse=HTTP/1.1 500 Server Error [Date: Wed, 24 Aug 2022 12:41:00 GMT, Content-Type: application/json, Content-Length: 9066, Connection: keep-alive, X-RestLi-Protocol-Version: 2.0.0, Strict-Transport-Security: max-age=15724800; includeSubDomains] [Content-Length: 9066,Chunked: false])
    h
    • 2
    • 1
  • a

    aloof-ram-72401

    08/24/2022, 2:24 PM
    Hi, looking for a recommendation on how to handle ingestion of GlobalTags, GlossaryTerms, and Ownership for a Dataset when we have multiple sources that may need to modify these. For example, the source db might emit a couple tags, but we want to allow users to also add tags via UI as well. Is there a way to make sure the source won't overwrite any tags added via UI every time it runs ingestion, similar to how editableSchemaMetadata works?
    m
    • 2
    • 4
  • s

    silly-finland-62382

    08/24/2022, 2:45 PM
    Hey, I am tryiung to
    Copy code
    run Spark lineage using python code on local using the below code I am getting error
  • s

    silly-finland-62382

    08/24/2022, 2:46 PM
    Copy code
    code: spark=SparkSession.builder \
        .master("local[1]") \
        .appName("Main") \
        .config("spark.sql.warehouse.dir", "/tmp/data") \
        .config("spark.jars.packages", "io.acryl:datahub-spark-lineage:0.8.43") \
        .config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \
        .config("spark.datahub.rest.server", "<http://172.31.18.133:8080>") \
        .config("spark.datahub.metadata.dataset.platformInstance", "dataset") \
        .config("spark.datahub.rest.token", "eyJhbGciOiJIUzI1NiJ9.eyJhY3RvclR5cGUiOiJVU0VSIiwiYWN0b3JJZCI6Im1vaGl0LmdhcmciLCJ0eXBlIjoiUEVSU09OQUwiLCJ2ZXJzaW9uIjoiMiIsImV4cCI6MTY2MzkxOTkzOSwianRpIjoiMjk2Y2E3MGUtMjA2My00ODM0LTkwNmYtMGIzZjRjMTVlY2RhIiwic3ViIjoibW9oaXQuZ2FyZyIsImlzcyI6ImRhdGFodWItbWV0YWRhdGEtc2VydmljZSJ9.tr2mu_FueVfHKz9Ze2BWmN4dqhOrTwR1t_WrfxspOmY") \
        .enableHiveSupport() \
        .getOrCreate();
    plus1 1
  • s

    silly-finland-62382

    08/24/2022, 2:46 PM
    Error:
    Copy code
    /Users/nishchayagarwal/IdeaProjects/python-venv/bin/python /Users/nishchayagarwal/IdeaProjects/prism-catalog/lineage/staging/datahub-spark.py
    Ivy Default Cache set to: /Users/nishchayagarwal/.ivy2/cache
    The jars for the packages stored in: /Users/nishchayagarwal/.ivy2/jars
    :: loading settings :: url = jar:file:/usr/local/lib/python3.9/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
    io.acryl#datahub-spark-lineage added as a dependency
    :: resolving dependencies :: org.apache.spark#spark-submit-parent-1a425c26-0bdd-4fa2-82e7-2e79de959dae;1.0
    	confs: [default]
    	found io.acryl#datahub-spark-lineage;0.8.43 in central
    :: resolution report :: resolve 236ms :: artifacts dl 3ms
    	:: modules in use:
    	io.acryl#datahub-spark-lineage;0.8.43 from central in [default]
    	---------------------------------------------------------------------
    	|                  |            modules            ||   artifacts   |
    	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    	---------------------------------------------------------------------
    	|      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
    	---------------------------------------------------------------------
    :: retrieving :: org.apache.spark#spark-submit-parent-1a425c26-0bdd-4fa2-82e7-2e79de959dae
    	confs: [default]
    	0 artifacts copied, 1 already retrieved (0kB/5ms)
    22/08/24 20:14:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    22/08/24 20:14:19 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
    SLF4J: Defaulting to no-operation (NOP) logger implementation
    SLF4J: See <http://www.slf4j.org/codes.html#StaticLoggerBinder> for further details.
    
    Process finished with exit code 0
    d
    • 2
    • 45
  • s

    silly-finland-62382

    08/24/2022, 2:46 PM
    can someone help me on this @channel
    h
    l
    • 3
    • 2
  • s

    silly-finland-62382

    08/24/2022, 2:46 PM
    @big-carpet-38439
  • s

    silly-finland-62382

    08/24/2022, 2:47 PM
    @bulky-soccer-26729 @little-megabyte-1074
1...636465...144Latest