https://datahubproject.io logo
Join Slack
Powered by
# getting-started
  • l

    late-notebook-97260

    09/05/2023, 9:17 AM
    i want to fetch the downstream lineage using GraphQL where condition is to filter the dataset having some tag value. eg : only fetch those dataset which have “critical” tag attached to them. I tried this orFilters option but its not fetching the right datasets. ANy suggestions .
    Copy code
    query getDatasetUpstreams($urn: String!) {
      
      downstream: searchAcrossLineage(
        input: {urn: $urn, direction: DOWNSTREAM, count: 1000 , orFilters: [
          {
            and: [
              {
                field: "filter_tags",
                values: ["critical"]
                condition: CONTAIN
              }
            ]
          }
        ] }
      ) {
        total
        searchResults {
          degree
          entity {
            type
          
            urn
          }
          
        }
      }
    }
    r
    g
    • 3
    • 2
  • s

    shy-diamond-99510

    09/05/2023, 12:39 PM
    Hey guys, I’m new to this slack and to datahub. I have to install a custom ingestion source in a docker setup. I have followed all the tutorials related to what I am trying to achieve but it still doesn’t work. Has anybody experience with installing custom ingestion source? I really need help.
    r
    • 2
    • 1
  • m

    melodic-match-91677

    09/06/2023, 3:34 AM
    Hello everyone, I recently installed DataHub locally using Docker, and I followed the documentation to set it up successfully. I also imported the ingest-sample-data. However, I have encountered several issues that I haven't been able to figure out how to achieve my desired results. I would appreciate your help in understanding how to achieve the following: 1. In Permissions -> Role, I assigned the "Admin" role to the datahub account, which should allow me to perform any operation. However, I'm unable to modify my own password, as the "Reset Password" option is grayed out. 2. In the imported ingest-sample-data, under Permissions -> Policies, there are many settings that I cannot edit or modify, even though the datahub account has Admin privileges. 3. I've tried various Policies configurations, but I'm unable to make a user see only their own data. In other words, I want a user to see data where they are the owner or it belongs to their group. I'm not sure how to configure this so that a user can only search for resources where they are the owner or it belongs to their group, and other resources shouldn't be searchable. I'd greatly appreciate any assistance or guidance on how to achieve these objectives. Thank you!
    plus1 1
    r
    b
    • 3
    • 3
  • e

    eager-monitor-4683

    09/06/2023, 6:39 AM
    Hey team, is there any info around Acryl pricing, I cannot find it in https://www.acryldata.io/. Thanks
    h
    • 2
    • 2
  • w

    wonderful-library-51057

    09/06/2023, 10:24 PM
    Hi all. i'm trying to evaluate datahub for my use case and i'm not certain if it's a fit.... I have a data lake with bunch of immutable parquet files in a (theoretical)
    <hdfs://my-orders>
    folder. a new file is uploaded each day, but i want my logical data set to be “orders.” i have an airflow job that runs every week on the last week's files. for example, airflow job
    build-weekly-summary/__scheduled_2023-09-09T01:00:00
    reads [
    <hdfs://my-orders/2023-09-03.parquet>
    ,
    <hdfs://my-orders/2023-09-04.parquet>
    , etc] and writes
    <hdfs://order-summary/2023-09-09.parquet>
    . I want to track lineage that shows which files were accessed and written by a specific run of a job. but i can’t find a way to register a file like s3://my-orders/2023-09-06.parquet to the orders data set in data hub. effectively i want to: 1. Go to data hub and click on the "Orders" logical dataset 2. See that this data set is composed of 24 files in data lake, including
    <hdfs://my-orders/2023-09-03.parquet>
    3. Click on
    <hdfs://my-orders/2023-09-03.parquet>
    and see (via lineage) that it was read by the
    build-weekly-summary/__scheduled_2023-09-09T01:00:00
    job. 4. See that this job passed all the validation checks. 5. See that this job also wrote out a file to
    <hdfs://order-summary/2023-09-09.parquet>
    Is that possible with Data Hub? it seems like the s3 data lake tooling supports something similar but hdfs is tied to hive? if it's not natively supported, would it be realistic to implement a custom source. edited: replaced s3 path examples with hdfs as that's the store we're actually using in our environment.
    m
    • 2
    • 19
  • d

    dazzling-rainbow-96194

    09/07/2023, 4:50 PM
    Hi, is there a way to get the stats on how many datasets have documentation, how many users are using DataHub on a daily basis etc?
    r
    • 2
    • 1
  • w

    wonderful-library-51057

    09/07/2023, 8:56 PM
    another question... our overall project uses postgresql in a lot of places. the quickstart guide mentions mysql as a required service dependency, but i also saw this: https://github.com/datahub-project/datahub/blob/4ffad4d9b91c25d9f8380fba7d81f65fed[…]d188c/docker/docker-compose-without-neo4j.postgres.override.yml. are they interchangeable?
    b
    r
    • 3
    • 5
  • s

    stale-guitar-30481

    09/07/2023, 10:53 PM
    Hi, I have a noob question. what does
    li
    stand for in URN expression? ex.
    urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)
    p
    • 2
    • 1
  • s

    shy-kangaroo-51257

    09/08/2023, 6:37 AM
    Hi @witty-plumber-82249, i tried connecting minio and datahub. Connection got succeeded but isnt getting ingested. Any suggestions?
    r
    • 2
    • 1
  • b

    best-monitor-90704

    09/08/2023, 8:32 AM
    Hi, how to establish connection between Datahub and Grafana?
    r
    • 2
    • 1
  • b

    bumpy-computer-90932

    09/08/2023, 12:44 PM
    Hi, I am trying to install the helm chart datahub/datahub-0.2.182 without the prerequisite chart. We already have postgresql, elasticsearch and kafka in our infrastructure. Having issues where the datahub-gms pod is not getting ready. The java process is not listening to any ports. Could this be related to the INTERNAL schema registry implementation?
    r
    b
    • 3
    • 5
  • o

    orange-gpu-90973

    09/08/2023, 2:34 PM
    Hi, why datahub uses gradle instead of maven to build and publish jar files ? Is there any way to use maven to build the metadata-service or front-end and use that jar/war file in datahub-gms or front-end image?
    r
    • 2
    • 1
  • e

    elegant-machine-46829

    09/08/2023, 6:18 PM
    Hi Everyone, I just set up the quickstart install for redshift and I've added my first source but the ingest failed. It seems like one or more of the containers comes with python 3.10 which isn't compatible with acryl-datahub.
    Copy code
    ERROR: Could not find a version that satisfies the requirement acryl-datahub[datahub-kafka,datahub-rest,redshift]==@cliMajorVersion@ 
    
    Execution finished with errors.
    {'exec_id': 'f7a40783-a3ea-4c35-8161-47f449c22e4b',
     'infos': ['2023-09-08 16:58:42.877515 INFO: Starting execution for task with name=RUN_INGEST',
               "2023-09-08 16:58:55.261806 INFO: Failed to execute 'datahub ingest'",
               '2023-09-08 16:58:55.264998 INFO: Caught exception EXECUTING task_id=f7a40783-a3ea-4c35-8161-47f449c22e4b, name=RUN_INGEST, '
               'stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
               '    task_event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
               '    return future.result()\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
     'errors': []}
    any easy way this can be fixed? I'm not even sure which container this is coming from. Thanks for any help.
    a
    b
    f
    • 4
    • 7
  • i

    icy-umbrella-3214

    09/08/2023, 7:50 PM
    I am trying to run the basic
    datahub docker start
    but I use colima instead of docker since I am on Macbook m1 and it fails
    a
    • 2
    • 4
  • i

    icy-umbrella-3214

    09/08/2023, 7:50 PM
    Copy code
    ❯ datahub docker quickstart
    Detected M1 machine
    [2023-09-08 12:50:31,809] INFO     {datahub.cli.quickstart_versioning:144} - Saved quickstart config to /Users/edmondoporcu/.datahub/quickstart/quickstart_version_mapping.yaml.
    [2023-09-08 12:50:31,810] INFO     {datahub.cli.docker_cli:645} - Using quickstart plan: composefile_git_ref='master' docker_tag='head'
    Docker doesn't seem to be running. Did you start it?
    ❯ docker ps
    CONTAINER ID   IMAGE                      COMMAND                  CREATED       STATUS      PORTS                                       NAMES
    7cadfc70b5d2   postgres:15.4-alpine3.17   "docker-entrypoint.s…"   11 days ago   Up 7 days   0.0.0.0:5432->5432/tcp, :::5432->5432/tcp   postgres-arroyo
  • i

    icy-umbrella-3214

    09/08/2023, 7:50 PM
    any suggestion?
  • b

    best-monitor-90704

    09/11/2023, 1:02 AM
    Hi , I am trying to change datahub default password, I have followed second method mentioned as per below, what is meaning of <absolute_path_to_your_custom_user_props_file> ? https://datahubproject.io/docs/authentication/changing-default-credentials/
    r
    • 2
    • 1
  • b

    best-monitor-90704

    09/11/2023, 1:03 AM
    image.png
  • m

    microscopic-spring-39376

    09/11/2023, 4:27 AM
    Hi I need to ingest from postgres which is behind a jump box. what is the best way to connect to such database in DataHub? Thank you
    r
    • 2
    • 1
  • b

    best-monitor-90704

    09/11/2023, 5:28 AM
    Hi , I am trying to build lineage using S3 as source with help of Python , I am unable to access S3 tables , please see below urn I am using def datasetUrn(tb1): return builder.make_dataset_urn("s3", tb1) upstreams=[ fldUrn("datahub.country.country_sample.csv", "country_id"),
    r
    • 2
    • 2
  • s

    shy-kangaroo-51257

    09/11/2023, 8:48 AM
    can we connect datahub to aws quicksight?
    r
    • 2
    • 1
  • s

    shy-kangaroo-51257

    09/11/2023, 9:18 AM
    How do execute queries that are created in datahub?
    r
    • 2
    • 1
  • w

    wonderful-library-51057

    09/11/2023, 4:28 PM
    i'm interested in an experience share on the challenges of administering kafka for datahub. we're working on a spike and that's one of the biggest support concerns for our team, as we don't already have kafka in our stack (or an in-house team with expertise to maintain it). how heavy is datahub's kafka usage? are most people deploying single-broker instances or is there normally a more complex setup involved?
    r
    • 2
    • 1
  • a

    alert-angle-39401

    09/11/2023, 8:19 PM
    Hi @witty-plumber-82249 i'm trying to send quality insights from great expectations into Data hub and running into the issue below :
    r
    • 2
    • 2
  • a

    alert-angle-39401

    09/11/2023, 8:19 PM
    unable to emit metadata into datahub
  • a

    alert-angle-39401

    09/11/2023, 8:19 PM
    i'm using data hub action like below
  • a

    alert-angle-39401

    09/11/2023, 8:19 PM
    name: datahub_action action: module_name: datahub.integrations.great_expectations.action class_name: DataHubValidationAction server_url: http://localhost:9092
  • a

    alert-angle-39401

    09/11/2023, 8:20 PM
    I've also tried using token(personal access token) still the same issue
  • a

    alert-angle-39401

    09/11/2023, 8:20 PM
    Can some one please tell me how to fix this issue?
  • a

    alert-angle-39401

    09/11/2023, 8:27 PM
    ('Unable to emit metadata to DataHub GMS', {'message': "HTTPConnectionPool(host='localhost', port=9092): Max retries exceeded with url: /aspects?action=ingestProposal (Caused by ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')))"})
1...686970...80Latest