https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • b

    brainy-crayon-53549

    07/04/2022, 6:13 AM
    I was trying connect to airflow using this doc https://datahubproject.io/docs/lineage/airflow , But when running step 3 getting this error
    d
    • 2
    • 1
  • p

    proud-baker-56489

    07/04/2022, 6:55 AM
    hello, I used the command “datahub ingest -c xxx.yaml” to ingest the hive table, but the COMMENT in table did not show in the datahub. Is it a bug? Or datahub don’t support the hive comment imported?
    l
    b
    b
    • 4
    • 12
  • b

    bland-smartphone-67838

    07/04/2022, 6:55 AM
    Hi! I want to create custom source for dremio, maybe someone has already integrated connection or know some guides how to integrate it (more than official docs)?
    l
    b
    s
    • 4
    • 3
  • p

    proud-baker-56489

    07/04/2022, 6:56 AM
    image.png,image.png
    b
    • 2
    • 1
  • m

    mysterious-nail-70388

    07/04/2022, 8:08 AM
    Hi, Do we support using the ES cluster as a DataHub component?
    b
    i
    m
    • 4
    • 8
  • m

    mysterious-nail-70388

    07/04/2022, 9:52 AM
    I built the DataHub metadata ingestion client into an image and started the container to ingest the metadata. It took about 5 minutes before each ingestion started. How should I optimize it
    b
    • 2
    • 2
  • g

    gray-hair-27030

    07/04/2022, 9:39 PM
    Copy code
    hello, I'm trying to make the airflow connection with datahub, but when importing the dag it throws a library error. I already installed an acryl-datahub[airflow]==0.8.40.2 library in the worker and webserver container, but it keeps throwing me the problem, am I missing another library?
    d
    b
    • 3
    • 2
  • m

    magnificent-camera-71872

    07/05/2022, 7:37 AM
    The docs suggest downloading the raw openapi yaml/json for the datahub api and using a code generator to auto generate client side code. If anyone has attempted codegen'ing a client for python i'd be very interested to hear about your experiences - and in particular which code-generator you used. Cheers -- Simon
  • b

    billions-twilight-48559

    07/05/2022, 12:41 PM
    Hi, When crawling Hive objets, are we able to ingest de column descriptions in hive a column descriptions in datahub? At this moment is not taking them
    p
    • 2
    • 1
  • a

    astonishing-byte-5433

    07/05/2022, 1:06 PM
    Hey I tried to find anything in the docs and in this slack channel: Is there any documentation of the python metadata classes? This link doesn't seem to work: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/metadata/schema_classes.py
    b
    • 2
    • 2
  • b

    bland-balloon-48379

    07/05/2022, 3:56 PM
    Hey everyone, I have a use case where I want to set some dataset properties and owners based on elements parsed from the description. I want to create a custom transformer to do this and read through the documentation on how to do so, but the example provided is for setting aspect value from an outside source. What I want to know is, is it possible to set an aspects value using the value of another aspect at ingestion time? Based on the documentation, since the _aspect_name()_ function returns just a single string, it seems like the answer is no unless I can somehow pull and set that information during the __init__() or create() functions. I found the following threads which seem to be related as well: 1. https://datahubspace.slack.com/archives/CUMUWQU66/p1655191762153639 2. https://datahubspace.slack.com/archives/CUMUWQU66/p1655796024377539 I think my use case is closest to sibren's (2), but Thomas Larsson (1) talks about the issue I have of a transformer needing multiple aspects to perform it's job. I saw Chris suggested making API calls to datahub (or I presume to datahub), but if the dataset doesn't already exist in datahub, would that be possible?
    m
    • 2
    • 2
  • g

    gray-architect-29447

    07/06/2022, 1:37 AM
    hi all, I'm having an error while ingesting MSSQL data into the datahub. It says db connection url couldn't be parsed. Below is the error returned. Also I shared my ingestion yaml file here. Am I missing something?
    source:
    type: mssql
    config:
    # Coordinates
    host_port: '192.168.1.1:1433'
    database: 'prod-db-2'
    scheme: 'PROD-DB2'
    # Credentials
    username: db2admin
    password: "password*"
    sink:
    type: "datahub-rest"
    config:
    server: "<http://localhost:6080>"
    transformers:
    - type: "simple_add_dataset_tags"
    config:
    tag_urns:
    - "urn:li:tag:db2"
    Copy code
    ArgumentError: Could not parse rfc1738 URL from string '<PROD-DB2://db2admin:password%2A@192.168.1.1:1433/prod-db-2>'
    [2022-07-06 01:26:39,723] INFO     {datahub.entrypoints:176} - DataHub CLI version: 0.8.34.2 at /usr/local/lib/python3.8/dist-packages/datahub/__init__.py
    [2022-07-06 01:26:39,723] INFO     {datahub.entrypoints:179} - Python version: 3.8.10 (default, Mar 15 2022, 12:22:08)
    [GCC 9.4.0] at /usr/bin/python3 on Linux-5.4.0-104-generic-x86_64-with-glibc2.29
    [2022-07-06 01:26:39,723] INFO     {datahub.entrypoints:182} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': 'v0.8.34', 'commit': 'f847fa31c9010bbb9df0d13ae7660e59083ea03e'}}, 'managedIngestion': {'defaultCliVersion': '0.8.34.1', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'noCode': 'true'}
    h
    f
    • 3
    • 35
  • b

    brash-sundown-77702

    07/06/2022, 5:07 AM
    Hi Team, using recipe.yml , I need to do metadata ingestion and SINK to kafka of datahub server. I don't see sink recipe config for kafka in datahub project doc. Can someone please help me ? Please note that I am using a remote datahub client to CLI execute the recipe and so I need to specify the IPaddress and port no of Datahub server kafka.
    b
    • 2
    • 2
  • b

    bright-cpu-56427

    07/06/2022, 5:41 AM
    Hi team Is it possible to replace the logo url with the path of the logo file locally to create a custom dataplatform?
    b
    • 2
    • 8
  • b

    brash-sundown-77702

    07/06/2022, 5:51 AM
    Hi Team, I am using a datahub client to do metadata ingestion from mysql to kafka - But I am getting the following error when I run the recipe CLI -
    "Connect to ipv4#127.0.0.1:9092 failed: Connection refused (after 0ms in state CONNECT, 4 identical error(s) suppressed)."
    Looks like it is trying to connect to localhost 9092 instead of remote server's kafka. Here is my recipe .yaml source: type: "mysql" config: env: "DEV" username: datahub password: datahub host_port: <RomoteIPAddr>:3306 sink: type: "datahub-kafka" config: connection: bootstrap: "<RomoteIPAddr>:9092" schema_registry_url: "http//&lt;RomoteIPAddr&gt;8081"
    c
    • 2
    • 3
  • b

    brash-sundown-77702

    07/06/2022, 5:53 AM
    I expect it to talk to <RomoteIPAddr>:9092 but it is talking to local machine's(datahub client) 9092
  • b

    brash-sundown-77702

    07/06/2022, 5:53 AM
    Please help
  • l

    lemon-zoo-63387

    07/06/2022, 6:08 AM
    Hello everyone, how to avoid this kind of alarm? Although the UI is still alarming, metadata has been ingested,Thanks in advance for your help!
    Copy code
    "              'CRM.crmqas_FE.systemuser': ['unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type BIT() to metadata schema',\n"
               "                                           'unable to map type BIT() to metadata schema',\n"
               "                                           'unable to map type BIT() to metadata schema',\n"
               "                                           'unable to map type BIT() to metadata schema',\n"
               "                                           'unable to map type BIT() to metadata schema',\n"
               "                                           'unable to map type BIT() to metadata schema',\n"
               "                                           'unable to map type BIT() to metadata schema',\n"
               "                                           'unable to map type BIT() to metadata schema',\n"
               "                                           'unable to map type BIT() to metadata schema',\n"
               "                                           'unable to map type BIT() to metadata schema',\n"
               "                                           'unable to map type BIT() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                           'unable to map type UNIQUEIDENTIFIER() to metadata schema'],\n"
               "              'CRM.crmqas_FE.transactioncurrency': ['unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                                    'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                                    'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                                    'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                                    'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                                    'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                                    'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                                    'unable to map type UNIQUEIDENTIFIER() to metadata schema',\n"
               "                                                    'unable to map type UNIQUEIDENTIFIER() to metadata schema']},\n"
               ' \'failures\': {\'DELTA\\\\EDITH.CE.CHANG\': ["Tables error: (pytds.tds_base.OperationalError) Database \'DELTA\\\\EDITH.CE\' does not '
               'exist. Make sure that "\n'
               "                                        'the name is entered correctly.\\n'\n"
               "                                        '[SQL: use [DELTA\\\\EDITH.CE]]\\n'\n"
               "                                        '(Background on this error at: <http://sqlalche.me/e/13/e3q8)>',\n"
               '                                        "Views error: (pytds.tds_base.OperationalError) Database \'DELTA\\\\EDITH.CE\' does not exist. '
               'Make sure that "\n'
               "                                        'the name is entered correctly.\\n'\n"
               "                                        '[SQL: use [DELTA\\\\EDITH.CE]]\\n'\n"
               "                                        '(Background on this error at: <http://sqlalche.me/e/13/e3q8)']>},\n"
               " 'cli_version': '0.8.38',\n"
               " 'cli_entry_location': '/tmp/datahub/ingest/venv-73071ee2-6365-4acf-b3e5-8fcaa08684dd/lib/python3.9/site-packages/datahub/__init__.py',\n"
               " 'py_version': '3.9.9 (main, Dec 21 2021, 10:03:34) \\n[GCC 10.2.1 20210110]',\n"
               " 'py_exec_path': '/tmp/datahub/ingest/venv-73071ee2-6365-4acf-b3e5-8fcaa08684dd/bin/python3',\n"
               " 'os_details': 'Linux-3.10.0-1160.62.1.el7.x86_64-x86_64-with-glibc2.31',\n"
               " 'tables_scanned': 4,\n"
               " 'views_scanned': 167,\n"
               " 'entities_profiled': 0,\n"
               " 'filtered': [],\n"
               " 'soft_deleted_stale_entities': [],\n"
               " 'query_combiner': None}\n"
               'Sink (datahub-rest) report:\n'
               "{'records_written': 755,\n"
               " 'warnings': [],\n"
               " 'failures': [],\n"
               " 'downstream_start_time': datetime.datetime(2022, 6, 25, 4, 0, 7, 991097),\n"
               " 'downstream_end_time': datetime.datetime(2022, 6, 25, 4, 0, 32, 45487),\n"
               " 'downstream_total_latency_in_seconds': 24.05439,\n"
               " 'gms_version': 'v0.8.38'}\n"
               '\n'
               'Pipeline finished with 2 failures in source producing 755 workunits\n',
               "2022-06-25 04:00:55.833939 [exec_id=73071ee2-6365-4acf-b3e5-8fcaa08684dd] INFO: Failed to execute 'datahub ingest'",
               '2022-06-25 04:00:55.834347 [exec_id=73071ee2-6365-4acf-b3e5-8fcaa08684dd] INFO: Caught exception EXECUTING '
               'task_id=73071ee2-6365-4acf-b3e5-8fcaa08684dd, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    Execution finished with errors.
    b
    s
    +2
    • 5
    • 5
  • b

    busy-waiter-6669

    07/06/2022, 6:12 AM
    Hi guys, is there any possible way to add something to the documentation of DataHub? I would like to help and add some data on how to ingest data via the JSON Files. (e.g. MlModels, ect.) I would also like to add example calls on graphql. Would that we helpful?
    teamwork 1
    s
    • 2
    • 1
  • m

    magnificent-camera-71872

    07/06/2022, 6:42 AM
    Hi all.... I'm trying to "save" the definition of a dataset such that I can easily restore it using the saved definition - even after a hard delete of the dataset from datahub. I was hoping to use the swagger-ui to
    GET
    the dataset and dump it to JSON, and then use
    POST
    action to recreate. However it seems the format of the JSON delivered by
    GET
    is considerably different than that required by
    PUT
    ??? Does anyone have a simple method of saving an entity and recreating it ?
    plus1 1
    l
    o
    • 3
    • 3
  • w

    wonderful-egg-79350

    07/06/2022, 7:49 AM
    Hi all! I am wondering What is the meaning of "platform_instance" related to 'file lineage based'. I actually saw 'platform_instance' on DataHub Docs. But I didn't understand. In DataHub Docs, 'platform_instance' means 'The instance of the platform that all assets produced by this recipe belong to'. Could anybody introduce meaning of 'platform_instance' more easily and is there any example?
    b
    a
    • 3
    • 3
  • l

    late-bear-87552

    07/06/2022, 9:30 AM
    i wanted to onboard metabase to datahub but in our organisation we use org email id to login the metabase but in the recipe it is asking for username and password. Is there a way to use email id or may be some other way to onboard the metabase??
    l
    • 2
    • 1
  • b

    bitter-dusk-52400

    07/06/2022, 9:36 AM
    Hi @big-carpet-38439 and @better-orange-49102 and datahub team, @tall-butcher-30509 I have tried to ingest the below string to dataset description but i could nt do it for unknown reason. But if i try modifying from the datahub UI i can able to do that. Can you guys help me on this issue. I have faced the similar issue while ingesting the description of dataset columns. Failed description:
    Copy code
    {"tbl_lgcl_name_eng":"Stock Header","tbl_lgcl_name_lcl":"ストックヘッダ","tbl_desc":"entity name:Stock Header\\nlayer:cleansed\\nnote:Cleansed table of loaded.t_r_oiv_stk_hdr_jp\\n\\n#interface_logical_name_english:Stock Information Raw Data"}}
    metadatachangeproposalwrapper to string:
    Copy code
    MetadataChangeProposalWrapper(entityType=dataset, entityUrn=urn:li:dataset:(urn:li:dataPlatform:bigquery,dataset.bq.datalake.stg_dataset.table,STG), changeType=UPSERT, aspect={externalUrl="", customProperties={created_time=2022-02-24 10:47:08.296, created_by=event_driven}, description={"tbl_lgcl_name_eng":"Stock Header","tbl_lgcl_name_lcl":"ストックヘッダ","tbl_desc":"entity name:Stock Header\\nlayer:cleansed\\nnote:Cleansed table of loaded.t_r_oiv_stk_hdr_jp\\n\\n#interface_logical_name_english:Stock Information Raw Data"}}, aspectName=datasetProperties)
    b
    t
    b
    • 4
    • 10
  • b

    better-orange-49102

    07/06/2022, 11:07 AM
    for the openAPI implementation, can i check if edit ACL policies are checked for update operations, similar to the GraphQL API? i know the REST endpoint and the kafka sink does not.
  • l

    late-bear-87552

    07/06/2022, 11:33 AM
    i am trying to profile tables using allow_deny_pattern but somehow it is profiling on the basis of table_pattern
    Copy code
    source:
        type: mysql
        config:
            host_port: 'X.X.X.X:3306'
            username: x
            password: x
            platform: test-pattern
            include_tables: true
            include_views: true
            schema_pattern:
                ignoreCase: true
                allow:
                    - dp_datahub
            table_pattern:
                ignoreCase: true
                allow:
                    - 'dp_datahub.stocks_bse*'
            profiling:
                enabled: true
                bigquery_temp_table_schema: abc.datahub
                turn_off_expensive_profiling_metrics: false
                query_combiner_enabled: false
                max_number_of_fields_to_profile: 2
                profile_table_level_only: false
                include_field_null_count: true
                include_field_min_value: true
                include_field_max_value: true
                include_field_mean_value: true
                include_field_median_value: true
                include_field_stddev_value: false
                include_field_quantiles: false
                include_field_distinct_value_frequencies: false
                include_field_histogram: false
                include_field_sample_values: false
                allow_deny_patterns:
                  allow:
                      - dp_datahub.stocks_bse_ci_test.date
    sink:
        type: datahub-rest
        config:
            server: '<http://X.X.X.X:8080>'
    s
    • 2
    • 2
  • s

    square-solstice-69079

    07/06/2022, 11:49 AM
    Hello, what will be the difference between the new Delta Lake ingestion that you are working on compared to using Glue to crawl Delta Lake tables to make them available to datahub?
    m
    • 2
    • 2
  • f

    few-air-56117

    07/06/2022, 11:52 AM
    Hello folks, i see that datahub versions the table/view schemas (column). It also version the view definitions?
    b
    • 2
    • 2
  • h

    handsome-alarm-6227

    07/06/2022, 4:15 PM
    Hi! I try to ingest a new entity type I created, it doesn't seem to complain when I ingest it, but I can't see it in the UI. Is that normal? (more info in the thread)
    m
    • 2
    • 8
  • c

    chilly-gpu-46080

    07/07/2022, 4:02 AM
    Hello, are there any plans to add support for ingestion from Dagster? it will be extremely useful!
    m
    • 2
    • 1
  • b

    best-umbrella-24804

    07/07/2022, 4:09 AM
    Hello, I ran an ingestion and it picked up a database that we don't want to show with hundreds of tables. Is there a way that I can remove all the datasets in that database? (it's a snowflake database). Thank you
    b
    • 2
    • 11
1...515253...144Latest