https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • s

    salmon-rose-54694

    02/25/2022, 5:36 AM
    Hey Team, My elasticsearch index is broken and how can i rebuild it? I don’t want to import everything again.
    h
    • 2
    • 2
  • c

    curved-carpenter-44858

    02/25/2022, 6:42 AM
    Hi Everyone, I am trying to test the metadata ingestion for hive metastore. We have a standalone hive metastore service running with version 3.1.2. Below is the receipe file I used.
    Copy code
    source:
        type: hive
        config:
            scheme: hive+http
            host_port: 'hive-metastore.hive.svc.cluster.local:9083'
            database: null
            username: null
            password: null
    sink:
        type: datahub-rest
        config:
            server: '<http://datahub-datahub-gms.datahub.svc.cluster.local:8080>'
    When I ran it from the datahub frontend I got the below error. (pasted partial logs)
    Copy code
    ......
        version, status, reason = self._read_status()\n'
               'File "/usr/local/lib/python3.9/http/client.py", line 289, in _read_status\n'
               '    raise RemoteDisconnected("Remote end closed connection without"\n'
               '\n'
               'RemoteDisconnected: Remote end closed connection without response\n',
               "2022-02-25 06:21:18.926125 [exec_id=a071f153-5777-419f-9511-37214e1429b6] INFO: Failed to execute 'datahub ingest'",
               '2022-02-25 06:21:18.926532 [exec_id=a071f153-5777-419f-9511-37214e1429b6] INFO: Caught exception EXECUTING '
               'task_id=a071f153-5777-419f-9511-37214e1429b6, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 81, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
    .......
    In the metastore logs I found this. am I miss anything ? what could be the reason ?
    Copy code
    2022-02-25T06:19:46,599 ERROR [pool-6-thread-200] server.TThreadPoolServer: Thrift error occurred during processing of message.
    org.apache.thrift.protocol.TProtocolException: Missing version in readMessageBegin, old client?
            at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:228) ~[libthrift-0.9.3.jar:0.9.3]
            at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:76) ~[hive-standalone-metastore-3.1.2.jar:3.1.2]
            at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) [libthrift-0.9.3.jar:0.9.3]
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_322]
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_322]
            at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
    l
    • 2
    • 3
  • f

    few-air-56117

    02/25/2022, 10:18 AM
    Hi guys, its when i do the ingestion, if a table its delete, to remove it from datahub?
    s
    b
    • 3
    • 19
  • f

    few-air-56117

    02/25/2022, 11:10 AM
    Hi guys, the statefull ingestion suport bigquery? i tried to use it in 2 ways
    Copy code
    source:
        type: bigquery
        config:
            project_id: <project>
            credential:
            include_table_lineage: true
            stateful_ingestion.enabled: true
    
    
    
    sink:
        type: datahub-rest
        config:
            server: '<http://localhost:8080>'
    but i got this
    Copy code
    stateful_ingestion.enabled
      extra fields not permitted (type=value_error.extra)
    and
    Copy code
    source:
        type: bigquery
        config:
            project_id: am-dwh-t1
            credential:
            include_table_lineage: true
            stateful_ingestion:
              enabled: true
    
    
    sink:
        type: datahub-rest
        config:
            server: '<http://localhost:8080>'
    but i got
    Copy code
    : 'BigQuerySource' object has no attribute 'config
    Thx 😄
    s
    l
    • 3
    • 3
  • d

    dazzling-judge-80093

    02/25/2022, 11:13 AM
    You need to set
    pipeline_name
    -> https://datahubproject.io/docs/metadata-ingestion/source_docs/stateful_ingestion/
    f
    • 2
    • 15
  • s

    some-crayon-90964

    02/25/2022, 3:44 PM
    Hey guys, we are trying to implement our ingestion process using Java Emitter, we are wondering what happens if user changes something on the UI and Java Emitter ingests the same aspect? Is there anything to allow Emitter to ignore the aspect where user have edited? Thanks in advanced.
    e
    b
    • 3
    • 6
  • s

    shy-island-99768

    02/25/2022, 4:06 PM
    Hi! We are trying to ingest our lookml from our github repository. Im doing a POC but runnig into issues:
    Copy code
    Source (lookml) report:
    {'workunits_produced': 0,
     'workunit_ids': [],
     'warnings': {},
     'failures': {'/models/google_ads.model.lkml': ['cannot resolve include /views/vm_datawarehouse/sales/sales_orderline.view.lkml']},
     'models_discovered': 9,
     'models_dropped': [...],
     'views_discovered': 0,
     'views_dropped': []}
    Sink (datahub-rest) report:
    {'records_written': 0,
     'warnings': [],
     'failures': [],
     'downstream_start_time': None,
     'downstream_end_time': None,
     'downstream_total_latency_in_seconds': None}
    
    Pipeline finished with failures
    Note that im running this in docker:
    Copy code
    FROM python:3.8-slim-bullseye
    
    WORKDIR /app
    
    COPY ./models /models
    COPY ./views /views
    COPY ./receipt.yml /app/
    
    RUN pip install acryl-datahub[lookml]
    
    RUN ls /views/vm_datawarehouse/sales/
    CMD ["datahub", "ingest", "-c", "receipt.yml"]
    With the receipt:
    Copy code
    source:
      type: "lookml"
      config:
        # Coordinates
        base_folder: /models/
    
        # Options
        api:
          # Coordinates for your looker instance
          base_url: <https://host>
    
          client_id: ID
          client_secret: SECRET
    
        github_info:
          repo: VanMoof/looker
        model_pattern:
          allow:
            - "google_ads"
    
    sink:
      type: "datahub-rest"
      config:
        server: "<http://HOST:8080>"
        token: "TOKEN"
    Note that when running the 'ls' I can see the view in the correct folder. Any ideas why this is failing? Something particular related to the absolute paths?
    l
    • 2
    • 2
  • r

    red-accountant-48681

    02/25/2022, 4:32 PM
    I Have an image database that I want to search in datahub. I have xml metadata about each image and am trying to find a way to ingest this so that it remains searchable (i.e. the tags are carried over) is there any guide on how to do this? We also have the ownership of the image database as well as the ownership of the individual images
    e
    • 2
    • 3
  • g

    gifted-queen-80042

    02/25/2022, 5:55 PM
    Hi team! Could you please confirm how Datahub obtain profiling stats for Bigquery under the hood? Does it query over each table in bigquery to compute it's statistics or does it obtain this directly from logs? cc. @acceptable-potato-35922
    d
    • 2
    • 4
  • r

    rapid-article-86196

    02/27/2022, 10:43 AM
    Hey there Anyone knows if there’s a way to enable auto tagging of columns via ingestion transformers? I see tagging datasets is working, but I couldn’t figure out how to tag columns For example: any column mathing
    .**email.**
    should be tagged as
    email
    plus1 2
    m
    s
    l
    • 4
    • 3
  • r

    rough-van-26693

    02/28/2022, 4:30 AM
    Hi. may I know when I try to delete a particular metadata from PROD environment, How do i specify the prod env server as it is trying to connect to the default server http://localhost:8080
    b
    s
    • 3
    • 3
  • r

    rough-van-26693

    02/28/2022, 5:10 AM
    When I go to the UI page from BigQuery Platform, the following error occurred.
    Copy code
    The field at path '/searchAcrossEntities/searchResults[0]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value. The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult' The field at path '/searchAcrossEntities/searchResults[1]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value. The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult' The field at path '/searchAcrossEntities/searchResults[2]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value. The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'
    b
    • 2
    • 11
  • r

    rough-van-26693

    02/28/2022, 6:16 AM
    Hi All, how do we delete a container for bigquery?
    s
    • 2
    • 1
  • w

    witty-dream-29576

    02/28/2022, 1:23 PM
    Hey everyone, I am fiddling around with ingesting lineage from airflow. However, the demo script does not really help my problem. https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py In the demo the datasets "Table A", "Table B" and "Table C" are generated in the airflow script. Is there can ingest lineage for alreagy existing datasets? I.e. My first job has already ingested "Table A", "Table B" and "Table C" in the example? Is there anyway to pass the dataset urn in the airflow job?
    l
    • 2
    • 2
  • n

    numerous-camera-74294

    02/28/2022, 1:57 PM
    hi folks! I’m really interested in https://datahubproject.io/docs/metadata-ingestion/source_docs/data_lake is there any way to run such a profiling over a glue table? or broadly speaking over any kind of dataset other than a file?
    c
    b
    c
    • 4
    • 15
  • s

    silly-beach-19296

    02/28/2022, 5:43 PM
    hi folks, I managed to do all the ingest of the glossary terms, but when manually associating the terms to the columns they do not appear in related entities
    plus1 1
    b
    s
    • 3
    • 10
  • m

    millions-waiter-49836

    02/28/2022, 7:44 PM
    Hi everyone, we are ingesting from two different
    postgres
    data sources but with same db and table name, so naturally we want to use
    platform_instance
    to customize the URNs... only to find out
    postgres
    recipe doesn't support
    platform_instance
    as
    MySQL
    and
    MSSQL
    do. Can I ask if there is any special consideration for this?
    m
    • 2
    • 6
  • r

    rough-van-26693

    03/01/2022, 2:13 AM
    Hi all. How can I list down all the urn on my server? and how can I perform a bulk deletion?
    b
    • 2
    • 1
  • s

    square-solstice-69079

    03/01/2022, 9:22 AM
    Hello, I'm trying to ingest a oracle database: Getting this error:
    Copy code
    'DatabaseError: (cx_Oracle.DatabaseError) DPI-1047: Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared '
               'object file: No such file or directory". See <https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html> for help\n'
               '(Background on this error at: <http://sqlalche.me/e/13/4xp6>)\n',
               "2022-03-01 09:16:17.371091 [exec_id=9a7d99e0-a12e-433e-ba3e-c4d384602ff6] INFO: Failed to execute 'datahub ingest'",
    I did install the oracle package with pip install 'acryl-datahub[oracle]'. Any thoughts on where the error could be?
    n
    b
    +4
    • 7
    • 35
  • n

    numerous-camera-74294

    03/02/2022, 12:29 PM
    hi folks! is there any way to use the spark integration for the automatic lineage to point to a an Athena/Glue dataset rather than to a hive dataset?
    c
    • 2
    • 2
  • a

    alert-hydrogen-52567

    03/02/2022, 2:50 PM
    Hello, is there any way that I can store my ingestion recipe someplace else along with the neo4j database? For example, this is a recipe for mysql ingestion from the datahub UI given in the demo: source: type: mysql config: host_port: '175.6.61.131:3306' database: gprs username: cd password: cd20220104 include_tables: true include_views: true profiling: enabled: false sink: type: datahub-rest config: server: 'https://demo.datahubproject.io/api/gms' Now, I want to save this data in my PostgreSQL database too. Is there any way I can do that?
    n
    • 2
    • 2
  • n

    numerous-holiday-52504

    03/02/2022, 5:31 PM
    Hi all. I am brand new to Datahub. I'm trying my very first connection. I'm trying to configure Snowflake to import a schema and provide definitions. I simply keep getting a connection refused error whenever I try to connect. I think the challenge is with the
    host_port
    piece but I can't figure out what is wrong. Whenever I connect with Python I seem to have to provide azure credentials as part of the connection string.
    Copy code
    source:
        type: snowflake
        config:
            host_port: [snowflakeaccount].[azure-region].<http://azure.snowflakecomputing.com|azure.snowflakecomputing.com>
            warehouse: *****
            username: *****
            password: *****
            role: ****
    sink:
        type: datahub-rest
        config:
            server: '<http://localhost:9002/api/gms>'
    f
    l
    +2
    • 5
    • 10
  • n

    nutritious-bird-77396

    03/02/2022, 6:46 PM
    Has anyone in the community used Kafka Connect Ingestion with
    username
    and
    password
    config parameters? Documentation states it supports it - https://datahubproject.io/docs/metadata-ingestion/source_docs/kafka-connect But, in the code i don't see the creds passed for connection - https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/kafka_connect.py#L728
    • 1
    • 1
  • n

    nutritious-bird-77396

    03/02/2022, 9:26 PM
    How does the
    platform_instance_map
    parameter work in Kafka Connect Ingestion connector? If there are 2 different
    postgres
    instances each with its own
    platform_instance
    name such as
    instance1
    and
    instance2
    how will the map parameter look? Not sure how it would work for the same platform having multiple instances with the example - https://datahubproject.io/docs/metadata-ingestion/source_docs/kafka-connect#config-details
    m
    • 2
    • 4
  • p

    powerful-nest-24866

    03/03/2022, 2:53 AM
    Hello all, Is there a way we can create a new domain on datahub apart from hitting a graphql endpoint. Specifically interested in creating a domain via “datahub rest emitter” TIA
    c
    • 2
    • 3
  • n

    narrow-finland-38723

    03/03/2022, 12:35 PM
    Hi everyone! We’re testing datahub features and some questions occurred. Not sure if this is right channel to ask these questions, but I hope someone can help:) 1. Where should we create a recipe file - in terminal or elsewhere? a. If in terminal, then what command to use to create recipe? The following doesn’t seem to work (see screens in thread). b. If recipe is set in UI: - When created in ui, we can’t see the results. What do we possibly do wrong? (also see screens in thread)
    thank you 1
    s
    s
    • 3
    • 13
  • h

    high-toothbrush-90528

    03/03/2022, 2:25 PM
    Hi everybody! I am trying to create some containers and also display it for the Domains. I am using this json file for ingestion:
    Copy code
    [
      {
        "auditHeader":null,
        "entityType":"container",
        "entityUrn": "urn:li:container:DATAPR",
        "changeType":"UPSERT",
        "aspectName":"containerProperties",
        "aspect":{
          "value":"{\"name\": \"datahub_db\", \"description\": \"DPROD\" }",
          "contentType":"application/json"
        },
        "systemMetadata":null
      },
      {
        "auditHeader":null,
        "entityType":"container",
        "entityUrn": "urn:li:container:DATAPR",
        "changeType":"UPSERT",
        "aspectName":"domains",
        "aspect":{
          "value":"{\"domains\": [\"urn:li:domain:marketing\"] }",
          "contentType":"application/json"
        },
        "systemMetadata":null
      },
    
    
      {
        "auditHeader":null,
        "entityType":"dataset",
        "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)",
        "changeType":"UPSERT",
        "aspectName":"container",
        "aspect":{
          "value":"{\"container\": \"urn:li:container:DATAPR\" }",
          "contentType":"application/json"
        },
        "systemMetadata":null
      }
    ]
    c
    • 2
    • 6
  • r

    rich-policeman-92383

    03/03/2022, 6:28 PM
    Hi Datahub While ingesting metadata from hive to datahub we are getting Error. Ingestion is using datahub cli.
    Copy code
    Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "pk_metadata_aspect_v2"
    plus1 1
    c
    • 2
    • 4
  • g

    gifted-queen-80042

    03/03/2022, 7:43 PM
    Hi! I am working on data profiling for our MySQL datasource. I can run it locally with
    profiling.enabled: True
    in my recipe. I can see aspect names
    datasetProfile
    in the output, along with the values in the respective aspects being produced, in terms of
    rowCount
    ,
    columnCount
    ,
    fieldProfiles
    , etc. However, when I ingest onto the UI, with sink type set to
    datahub-kafka
    , I don't see the data. The
    Stats
    tab remains disabled. Any idea why this might be happening?
    h
    m
    +2
    • 5
    • 13
  • m

    mysterious-nail-70388

    03/04/2022, 5:30 AM
    Hi,This exception occurs when I ingested oracle metadata. What should I do
    b
    • 2
    • 2
1...313233...144Latest