https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • w

    witty-kilobyte-6731

    04/04/2021, 2:03 PM
    How would the functionality and implementation compare with Atlas Hive hook for example https://atlas.apache.org/1.2.0/Hook-Hive.html ?
    a
    • 2
    • 1
  • h

    high-hospital-85984

    04/06/2021, 9:43 AM
    @green-football-43791 we’re still running into some problems while trying to ingest tags. Error in thread.
    g
    • 2
    • 6
  • f

    faint-hair-91313

    04/06/2021, 10:00 AM
    Hi guys, first I really like the product and really considering suggesting this for our organization. I've been looking everywhere for an elegant solution such as DataHub. I am doing a small PoC and would like to see it loaded with our data. We have Oracle as our main source, and saw it it is not yet supported. As a work-around I would like to ingest some manual extracts, so from a file source. But I have difficulties understanding how to get the file in a shape that the ingestor can understand. I couldn't find anything in your documentation. E.g. We have some metadata already trapped in some tables that could be easily be exported in json or csv, or anything else. What are the required fields, data structure, etc to ingest this?
    b
    g
    • 3
    • 6
  • w

    wonderful-quill-11255

    04/09/2021, 8:52 AM
    Hi. Regarding the python ingestion framework, would it make sense to introduce a step between the source and sink where you could attach generic transformation of the events? My use-case right now is to be able to modify dataset names. I want to avoid writing custom sources or sinks for just this small reason and if there was a middle step in a pipeline I could just plug in my name-transformer and be done. WDYT?
    h
    m
    g
    • 4
    • 7
  • w

    witty-florist-25216

    04/09/2021, 9:24 AM
    Hi, we're currently looking to a mlflow ingestion connection, especially on datahub modelling. Do you have any updates on model managements metadata models ? We map mlflow objects to datasets to start developing the ingestion chain but I guess that a custom model would be nice.
    h
    • 2
    • 2
  • f

    faint-hair-91313

    04/09/2021, 11:55 AM
    Hey guys, do you support Spark in your ingestion framework? It could either go via JDBC/ODBC.
    m
    b
    +3
    • 6
    • 57
  • b

    better-orange-49102

    04/11/2021, 3:58 PM
    i modified the sample_recipe.yml in docker/ingestion/quickstart.sh to point to a postgresql docker instance to try it out. when the sink is set to console the output looks ok, but when i set the sink to datahub-rest and tried to start the ingestion container, it threw the following error, how might i troubleshoot this?
    Copy code
    ingestion    | [2021-04-11 15:51:17,986] ERROR    {datahub.ingestion.run.pipeline:41} - failed to write record with workunit public.mytable with ('Unable to emit metadata to DataHub GMS', {'message': "HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /datasets?action=ingest (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f14a3261790>: Failed to establish a new connection: [Errno 111] Connection refused'))"}) and info {'message': "HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /datasets?action=ingest (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f14a3261790>: Failed to establish a new connection: [Errno 111] Connection refused'))"}
    g
    • 2
    • 7
  • b

    busy-accountant-26554

    04/13/2021, 1:53 PM
    Hi all! When I look at
    DashboardInfo.pdl
    (https://github.com/linkedin/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/dashboard) I see coverage for the metadata elements • Title • Description • Last modified • Last reloaded + a few more. And I guess owner is a common element for all entities in DataHub. But there seems to be no Dashboard aspect for misc properties and there can be quite a few ot those from my experience with Qlik apps. Is the possibility to add misc properties (name - value pairs) for Dashboards something that anyone would feel was useful? Or did I miss anything?
    👋 1
    h
    b
    w
    • 4
    • 6
  • b

    broad-flag-97458

    04/14/2021, 8:59 PM
    Hi everyone, I’ve been messing about with the postgres ingest and so far so good! One question I have: we have 100+ databases on one postgres cluster. Is it possible to prefix the database to the WorkUnit? For example, looking at the datasets I have dozens of tables in the ‘public’ schema - but they’re actually in several different databases.
    g
    m
    • 3
    • 12
  • c

    calm-sunset-28996

    04/15/2021, 8:52 AM
    Did some rules got stricter yesterday? We have a bunch of our ingestion jobs failing now because of missing properties.
    m
    g
    • 3
    • 14
  • i

    icy-easter-2378

    04/15/2021, 3:14 PM
    Question on installing the python plugin dependencies: I tried
    pip install 'acryl-datahub[mysql]'
    but I get this error:
    l
    • 2
    • 1
  • i

    icy-easter-2378

    04/15/2021, 3:15 PM
    Copy code
    Collecting acryl-datahub[mysql]
      Could not find a version that satisfies the requirement acryl-datahub[mysql] (from versions: )
    No matching distribution found for acryl-datahub[mysql]
    b
    g
    • 3
    • 2
  • i

    icy-easter-2378

    04/15/2021, 3:22 PM
    Okay, pip3 got me a step further I think. I'm going to push a little further there.
    g
    • 2
    • 17
  • r

    red-journalist-15118

    04/19/2021, 4:52 PM
    hi, I am new to Datahub and plan on using it. But I had a few questions: 1. How is the metadata brought into Datahub. I see there are ingestion scripts. But is there way for each of the data sources to push the metadata to spark topics (push-based architecture) instead of periodically calling the ingestions scripts? 2. Are owners added manually or should there be an "owners" field in the json metadata? 3. How are table descriptions and column descriptions added? Are they manually created through the UI? Or should there be a "description" field in the json metadata for both the tables and the columns?
    b
    m
    +3
    • 6
    • 32
  • c

    calm-addition-66352

    04/19/2021, 11:17 PM
    Hi All, I am trying to ingest metadata from a mssql database. So the command I am using is
    datahub ingest -c ./mssql-recipe_._yml
    . And my recipe file looks as below,
    Copy code
    source:
      type: mssql
      config:
        username: <user_name>
        password: <password>
        host_port: <ip>:1433
        database: <db_name>
        table_pattern:
          #deny:
          #  - "^.*\\.sys_.*" # deny all tables that start with sys_
          allow:
            - "schema_name_1.*"
    
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    But when I try, I get the below error,
    Copy code
    sqlalchemy.exc.OperationalError: (pytds.tds_base.OperationalError) Database '<organization_name>\<dba_name>' does not exist. Make sure that the name is entered correctly.
    [SQL: use [<organization_name>\<dba_name>]]
    • Is there a way to print or get the underlying queries submitted to the database (so I can figure out why it tries to query an object with the dba's name, probably a previously deleted item / user account ect etc...) ? • Is there a list of permissions that needs to be assigned to the db user that we use for the crawler / ingestion. Current user that I am trying has some additional server level access and was wondering whether it is providing additional metadata of the server that is not expected (eg: dba user names etc etc...)
    g
    • 2
    • 14
  • c

    calm-addition-66352

    04/20/2021, 6:32 AM
    Hi All, Just wondering is there any examples of ingesting dashboard, charts metadata that I can refer to ?
    m
    • 2
    • 3
  • c

    calm-addition-66352

    04/21/2021, 8:17 AM
    Hi All, Just wondering whether I can use a mce json file to ingest dashboard/chart metadata using file-to-rest method. mce.json ->
    Copy code
    {
      "auditHeader": null,
      "proposedSnapshot": {
        "com.linkedin.pegasus2avro.metadata.snapshot.DashboardSnapshot": {
          "urn": "urn:li:dashboard:sample",
          "aspects": [
            {
              "com.linkedin.pegasus2avro.dataset.DashboardInfo": {
                "title": "Sample Dashboard",
                "description": "This is a sample dashboard to test mce events"
              }
            },
            {
              "com.linkedin.pegasus2avro.common.Ownership": {
                "owners": [
                  {
                    "owner": "urn:li:corpuser:bi-analyst",
                    "type": "DEVELOPER"
                  }
                ]
              }
            }
          ]
        }
      },
      "proposedDelta": null
    }
    metadata.yml
    Copy code
    source:
      type: "file"
      config:
        filename: ./mce.json
    
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    I get an error saying
    __root__, MetadataFileSourceConfig expected dict not str (type=type_error)
    . So I am wondering either this method doesn't support Dashboard/Chart metadata or I am missing a property or a value in my mce json 🙂
    g
    • 2
    • 16
  • c

    calm-sunset-28996

    04/21/2021, 8:59 AM
    I’m adding groups to the LDAP ingestion, is this something which is of general usecase? Then I can make a PR for it.
    g
    c
    • 3
    • 12
  • m

    modern-nest-69826

    04/21/2021, 12:07 PM
    Hi, I am trying to load the example data. Loading this dataset fails after a few entries depending on the command used for ingestion. I tried both "./scripts/datahub_docker.sh ingest -c ./examples/recipes/example_to_datahub_rest.yml" and "datahub ingest -c ./examples/recipes/mssql_to_datahub.yml". What happens seem to differ slightly (Where it goes wrong, error info provided). One reports "Pipeline finished with failures", the other "Pipeline finished successfully" despite both are incomplete in the processing. All docker containers are up and running, all plugins have been installed. Both "datahub check local-docker" and "datahub check plugins" show no issues. Do I do something wrong, or is there an error in the content of the examples, or...?
    g
    • 2
    • 2
  • m

    modern-nest-69826

    04/21/2021, 12:18 PM
    Hi, I also tried to apply a recipe to access a postgres database metadata. The PG database has ssl=require option set. I could not find documentation on how to set this option. Without the option set the ingestion process reports: SSL connection is required. Please specify SSL options and retry. What needs to be set in the recipe to make this work?
    b
    g
    +2
    • 5
    • 7
  • c

    calm-addition-66352

    04/22/2021, 8:54 PM
    Hi All, I am using the simple
    acryl-datahub[mssql]
    to ingest metadata from a mssql database to datahub. Is there a way that I can pass upstream and ownership data at the same time ? Or do I need to pass that separately ? 🙂
    g
    • 2
    • 5
  • b

    busy-accountant-26554

    04/23/2021, 11:16 AM
    Hi all, anyone knows why there is no fabric type for Dashboard entities? As opposed to Dataset entities.
    b
    • 2
    • 3
  • s

    steep-pizza-15641

    04/23/2021, 2:03 PM
    HI all newbie question - Id like to test the datahub ingestion backend described here: https://datahubproject.io/docs/metadata-ingestion - what is the quickest way to get the datahub lineage module packaged in airflow? We are using the default image here https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html - I have tried adding the acryl libraries using ADDITIONAL_PYTHON_DEPS and rebuilding the images but a lot of version conflicts. Any pointers on how other teams get started?
    m
    g
    • 3
    • 32
  • h

    handsome-airplane-62628

    04/23/2021, 3:05 PM
    Hello All! We have snowflake/dbt/looker tech stack. We've ingested snowflake data as well as dbt json files - however datahub created 2 separate datasets - rather than linking the snowflake assets to the dbt models. Has anyone tried this? Any thoughts/ideas to de-duplicate / clean it up?
    l
    a
    +4
    • 7
    • 25
  • d

    delightful-plumber-77060

    04/23/2021, 5:04 PM
    Hello Everyone! We are ingesting metadata from Avro schema registry ( through Kafka ingestion). A lot of our schemas have default value in avro, is there a way to show it in the UI ?
    g
    • 2
    • 2
  • s

    steep-pizza-15641

    04/26/2021, 1:01 PM
    Hi, another newbie question re airflow integration. The example here shows how to ingest a mysql schema https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/airflow/mysql_sample_dag.py Let's say I wanted to tag the ingested database as "System of Record" and one column as "PII", would it be possible to get examples of how to do that? I'd like ot specify the tags in code rather than edit manually.
    g
    m
    • 3
    • 14
  • c

    calm-addition-66352

    04/26/2021, 11:36 PM
    Hey Team, few noob questions on the airflow integration. I have an amazon managed airflow environment (v 1.10.12). I am planning to try the https://datahubproject.io/docs/metadata-ingestion#emitting-lineage-via-a-separate-operator example. So my questions are, 1. In order to use
    datahub
    library in my dag, should I install
    acryl-datahub
    python package or a is there a different one for airflow ? 2. I am trying to add the datahub rest connection through the UI, if I have installed
    acryl-datahub
    package - should I see an option as
    datahub
    on the drop down ? or do I need to select a different option ? [screenshot attached]
    m
    b
    g
    • 4
    • 7
  • c

    calm-sunset-28996

    04/27/2021, 11:57 AM
    Is there a specific reason not to include all the connectors by default in the pipfile? They are only a few python modules anyway right? So should be pretty light. I’m refactoring our build to use the pipfile instead of the docker, but it seems a bit cumbersome to define every source/sink manually as extra’s. (We are using pipenv.)
    h
    g
    • 3
    • 3
  • s

    salmon-translator-27951

    04/27/2021, 1:55 PM
    I am new to Datahub and currently trying to get familiar with the architecture. When reading about metadata service, I couldn't find BaseRemoteDAO.java. It seems that this file has been removed, but it's still referred to in RestliRemoteDAO.java. Can somebody tell me which file
    com.linkedin.metadata.dao
    refers to?
    m
    • 2
    • 1
  • c

    curved-magazine-23582

    04/27/2021, 2:28 PM
    Hello team, I am playing with ingestion of chart. I noticed there is a
    Properties
    section under chart UI. What data correspond to this section? For dataset entity, a similar section corresponds to DatasetAspect
    com.linkedin.dataset.DatasetProperties.
    But I don't see such aspect for chart. 🤔
    g
    b
    • 3
    • 4
12345...144Latest