https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • w

    wooden-football-7175

    02/08/2022, 3:54 PM
    Hello all. I posted an issue on troubleshoot channel : https://datahubspace.slack.com/archives/C029A3M079U/p1644330772754709
    l
    • 2
    • 3
  • m

    mysterious-portugal-30527

    02/09/2022, 12:43 AM
    I don’t get it.
    version 0.8.25
    Running
    docker QuickStart
    on Linux and connecting thru Chrome on a MBP, adding an ingestion thru the web application. Choosing
    Execute
    fails. Why is this failing:
    Copy code
    sink:
        type: datahub-rest
        config:
            server: '<http://localhost:8080>'
    Log shows:
    Copy code
    "ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by "
               "NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fae83a81a30>: Failed to establish a new connection: [Errno 111] "
               "Connection refused'))\n",
               "2022-02-09 00:27:27.263935 [exec_id=e989b898-fb4d-4eec-9d9c-965a78650cb9] INFO: Failed to execute 'datahub ingest'",
               '2022-02-09 00:27:27.269727 [exec_id=e989b898-fb4d-4eec-9d9c-965a78650cb9] INFO: Caught exception EXECUTING '
               'task_id=e989b898-fb4d-4eec-9d9c-965a78650cb9, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 81, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    Curl shows:
    Copy code
    curl <http://localhost:8080/config>
    {
      "models" : { },
      "versions" : {
        "linkedin/datahub" : {
          "version" : "v0.8.25",
          "commit" : "306fe0b5ffe3e59857ca5643136c8b29d80d4d60"
        }
      },
      "statefulIngestionCapable" : true,
      "retention" : "true",
      "noCode" : "true"
    }
    What am I missing??
    m
    • 2
    • 3
  • s

    shy-island-99768

    02/09/2022, 7:35 AM
    Hello all, I have a question about the ingestion of documentation that we have written in yaml for (bigquery) tables. What would be the best way to enrich the out of the box bigquery ingestion meta data with documentation that we have in version controlled yml files? Example below:
    Copy code
    full_name: project-p-p:stats.active_stats
    name: active_stats
    owners:
      - email: <mailto:abel@vanmoof.com|abel@vanmoof.com>
    notes:
    description: Collect stats...
    usage:
      - department_name:
        example_usage:
          - hello
    bigquery_link: <https://bigquery.googleapis.com/bigquery/v2/projects/blabla/datasets/bla/tables/active_stats>
    columns:
      - name: frame_number
        description:
        is_primary_key:
        aliases: []
        unit:
        relations: []
      - name: created_at
        description:
        is_primary_key:
        aliases: []
        unit:
        relations: []
      - name: product_id
        description:
        is_primary_key:
        aliases: []
        unit:
        relations: []
    m
    • 2
    • 6
  • p

    plain-farmer-27314

    02/09/2022, 2:24 PM
    Also:
    Copy code
    We now support the ability to ignore specific users when calculating Top Users of a Dataset/Column — this is useful when you want to exclude users designated for maintenance/automated execution.
    So we can yeet our airflow user out of datahub 🙂
    d
    • 2
    • 2
  • l

    lively-fall-12210

    02/09/2022, 4:03 PM
    Hello! In the Kafka Metadata Source, I am not sure how the config values
    domain.domain_key.allow
    and
    domain.domain_key.deny
    are used. Are they intended to extract domain names from the topic name by a capturing group in the regex? Or are they used to only keep topics that belong to a certain domain? Does somebody have an example? The documentation is a bit short here. Thanks a lot!
    l
    • 2
    • 2
  • w

    wooden-football-7175

    02/09/2022, 6:25 PM
    Hello channel. I have a silly question. I want to make lineage with
    glue pipelines
    that I imported from aws source. I could manage to use
    Airflow backend for lineage
    but I do not find documentation how to configure
    glue
    as a job to connect two differents `datasets`(also glue). Anyone has any reference? Thanks in advance!!
    c
    • 2
    • 3
  • h

    handsome-football-66174

    02/09/2022, 7:50 PM
    Hi Everyone, Are we able to add descriptions to Datasets during ingestion ? Any transformations or configurations present ?
    s
    • 2
    • 2
  • r

    rich-policeman-92383

    02/09/2022, 8:13 PM
    Hello How do we configure datahub to pull metadata from a SSL enabled ES cluster. While trying to configure host: "https://ip:9200" results in assertion error: host contains bad character. If we omit the scheme then we get connection error: caused by: ProtocolError datahub version v0.8.26, ES version: 7.5.x
    s
    d
    • 3
    • 8
  • g

    glamorous-house-64036

    02/09/2022, 10:18 PM
    Good day, I'm trying to do a simple ingestion from a PostgreSQL but facing some error messages that I straggle to understand. ( DataHub is running locally via "datahub docker quickstart" ) My yaml file:
    Copy code
    source:
      type: postgres
      config:
        # Coordinates
        host_port: URL:5432
        database: DATABASENAME
        # Credentials
        username: user
        password: password
        #Options
        include_tables: True
        include_views: True
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:9002/api/gms>"   #this path is what UI ingestion tool sugests, I also tried default <http://localhost:8080>" with same result
    Both postgres and datahub-rest plugins looks enabled. Upd: Error log moved into thread.
    plus1 1
    c
    i
    +2
    • 5
    • 14
  • r

    rich-winter-40155

    02/10/2022, 4:22 AM
    Hi All, we are setting up hive metadata ingestion on the airflow cluster. We are trying to follow the architecture proposed here https://github.com/linkedin/datahub/blob/master/docs/architecture/metadata-ingestion.md . How do we publish hive events to kafka and to datahub. If there is any example config, can you please point me to it. Thanks.
    i
    g
    • 3
    • 11
  • b

    broad-tomato-45373

    02/10/2022, 6:31 AM
    Hi I am trying to add users for login via user.props file. Steps i followed : 1. Created a configmap user-props (with the added users) . 2. created a file (named as overide_chart_values.yml) to override the chart values .
    Copy code
    extraVolumes:
      - name: user-props
        configMap:
          name: user-props
    extraVolumeMounts:
      - name: user-props
        mountPath: /datahub-frontend/conf/user.props
    3. upgraded the helm chart with the new values
    Copy code
    helm upgrade --install -f values.yaml -f overide_chart_values.yml datahub datahub/datahub
    But, i didn't succeed in getting new users for login. I am very much new to K8s and helm charts. Any help would be much appreciated.
    s
    • 2
    • 2
  • g

    gray-spoon-5206

    02/10/2022, 6:36 AM
    Good day everyone, I’m trying to ingest data from snowflake, however, I got an error like this(in the thread), can someone help me with it? thanks very much.
    b
    s
    s
    • 4
    • 11
  • a

    adorable-flower-19656

    02/10/2022, 6:57 AM
    Hi guys, I'm using BigQuery ingestion, is there a difference in performance when using 'use_exported_bigquery_audit_metadata' option? what is pros and cons? https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/bigquery.py#L309
    s
    • 2
    • 1
  • s

    square-machine-96318

    02/10/2022, 6:58 AM
    Hi datahub! I have some trouble in datahub system. Can I get help? I want to ingest dataset of postgreSQL. and I created ingestion source with web ui like below picture. and I decide the schedule as ’00 03 * * *’. But I initially did run manually to see if the data ingestion was good, but it doesn’t sink normally. What is the reason and how can I solve it? It was possible to manually sink through CLI on my EKS server.
    s
    • 2
    • 13
  • f

    few-air-56117

    02/10/2022, 7:33 AM
    Hi guys, i have a question, Its possible to se ingetion ui logs when the ingestion is running?
    a
    • 2
    • 2
  • g

    great-dusk-47152

    02/10/2022, 8:10 AM
    Follow this: https://datahubproject.io/docs/docker/airflow/local_airflow, can't run examples , only airflow,no datahub
    h
    • 2
    • 1
  • r

    rhythmic-kitchen-64860

    02/10/2022, 8:12 AM
    hi all, I want to ask how to add a scheduler if I'm trying to ingest data using package from
    datahub.ingestion.run.pipeline
    the whole code is
    from datahub.ingestion.run.pipeline import Pipeline
    # The pipeline configuration is similar to the recipe YAML files provided to the CLI tool.
    pipeline = Pipeline.create(
    {
       
    'source':{
         
    "type":"postgres",
         
    "config":{
             
    "username":"postgres",
             
    "password":"strongpass",
             
    "database":"northwind",
             
    "host_port":"localhost:5432",
             
    "database_alias":"test",
             
    "schema_pattern":{
               
    "allow":{
                   
    "public"
               
    }
             
    },
             
    "table_pattern":{
               
    "allow":[
                   
    "test.public.region",
                   
    "test.public.suppliers"
               
    ]
             
    }
         
    }
       
    },
       
    "sink":{
         
    "type":"datahub-rest",
         
    "config":{
             
    "server":"<http://localhost:8080>"
         
    }
       
    }
    }
    )
    # Run the pipeline and report the results.
    pipeline.run()
    pipeline.pretty_print_summary()
    and I want to try to make a scheduler from running that config, is it possible to do that? thank you.
    l
    h
    s
    • 4
    • 3
  • f

    few-air-56117

    02/10/2022, 8:54 AM
    Hi guys, i try to ingest bigquery-usage via ui. This is the recepie
    Copy code
    source:
        type: bigquery-usage
        config:
            projects:
                - p1
                - p2
            credential:
                project_id: 
                private_key_id: 
                private_key: '${PRIVATE_KEY}'
                client_email: 
                client_id: 
    sink:
        type: datahub-rest
        config:
            server:
    I got this error
    Copy code
    '1 validation error for BigQueryUsageConfig\n'
               'credential\n'
               '  extra fields not permitted (type=value_error.extra)\n',
    so its look like i can add credential on biguqery-usage ( on bigquery its work)
    d
    • 2
    • 7
  • s

    silly-beach-19296

    02/10/2022, 12:19 PM
    Hello, I made deployment of datahub + EKS and I want to modify the authentication connector to OKTA, how could I do it??? in the documentation they only mention that you should do it by modifying the .env file that is inside reacts front
    h
    g
    • 3
    • 2
  • c

    crooked-van-51704

    02/10/2022, 2:15 PM
    does anyone here have some experience with
    dbt
    ? I have seeing an issue when I try to ingest a dbt project, it causes a
    DuplicateKeyException
    When I disable the dbt node creation using
    disable_dbt_node_creation: True
    it works fine. So it must be related to the dbt specific metadata. Oddly, I can disable the dbt nodes, do the ingestion successfully, re-enable the dbt nodes, and now ingestion works without any errors The specific error I see in the stack trace is this
    Copy code
    'Caused by: java.sql.BatchUpdateException: Duplicate entry '
    "'urn:li:dataset:(urn:li:dataPlatform:snowflake,citibike_tripdata.' for key 'metadata_aspect_v2.PRIMARY'\n"
    h
    m
    +3
    • 6
    • 15
  • l

    limited-cricket-18852

    02/10/2022, 4:29 PM
    Hello! I can see in the roadmap of Q3 2021 that Datahub could show a preview of the data, has it been already released? Is there any entity in the demo instance that showcases this feature? Thanksss!
    d
    • 2
    • 4
  • a

    ambitious-guitar-89068

    02/11/2022, 5:01 AM
    Faced an issue with Tableau Ingestion: https://github.com/linkedin/datahub/issues/4119
    s
    • 2
    • 1
  • c

    curved-truck-53235

    02/11/2022, 1:47 PM
    Hi, everyone! I'm trying to ingest schema from Kafka but if fails
    s
    i
    • 3
    • 11
  • n

    narrow-bird-99605

    02/11/2022, 2:51 PM
    Hello, is there a way to set domain during ingestion?
    d
    • 2
    • 14
  • h

    handsome-football-66174

    02/11/2022, 4:21 PM
    Hi Everyone, Quick question regarding Lineages - How we add lineage between existing Data Task ( in a Data pipeline) and a dataset. I see this example - https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/library/lineage_dataset_job_dataset.py but it seems to be adding lineage to a data_job entity -
    Copy code
    entityUrn=builder.make_data_job_urn(
            orchestrator="airflow", flow_id="flow1", job_id="job1", cluster="PROD"
        ),
    s
    c
    • 3
    • 19
  • m

    modern-monitor-81461

    02/11/2022, 6:09 PM
    MySQL Databases and schemas VS containers and platform instances I have installed 0.8.26 to play with domains and containers and I have ingested metadata from a MySQL database. I can see that every schema has now its own container and that's all good. When I look at the parent of those containers (schemas), I see a container representing the database with a name of
    none
    . In Mysql, a database is pretty much the same as a schema, so it is unclear to me what the database should be... Seeing
    none
    is not what I was expecting. Is it to represent the fact that it is absent from Mysql? If so, would it be better to simply not create that container? Now if I want to document the hostname of the Mysql server in DataHub, is it when I need to use platform instances? I thought platform instances were used to differentiate different instances of Mysql servers, am I right? Looking for guidance here on how to use those concepts since I want to apply them to Datalakes in an Iceberg source I am currently working on.
    i
    l
    d
    • 4
    • 5
  • c

    cool-painting-92220

    02/12/2022, 12:13 AM
    Hi there! I've taken a look around the community Slack and the DataHub documentation and didn't seem to find anything on it, so I wanted to check here - is there a good way of connecting to AzureML for metadata ingestion, or is that something projected to be added to the selection of compatible sources in the future?
    ❤️ 1
    b
    • 2
    • 1
  • m

    mysterious-nail-70388

    02/14/2022, 8:25 AM
    Hello Team , I used PIP to install DataHub, but failed to run DataHub after installing 0.8.14 or 0.8.12, while running DataHub properly after installing 0.8.24 on the same server. why? It is normal since 0.8.16...
    g
    • 2
    • 4
  • p

    proud-accountant-49377

    02/14/2022, 9:58 AM
    Hi team! I´ve been trying to remove terms associated with a field from my dataset via api for several days and an unknown error appears that does not allow me to do so. Is it a bug, does anyone know something? Thank you!
    s
    g
    • 3
    • 22
  • r

    red-napkin-59945

    02/14/2022, 6:20 PM
    Hey team, I want to check if there is some existing python lib to parse Urns? I did not find any in the datahub repo. I wrote some on my owner and wondering if I should contribute it back?
    l
    • 2
    • 2
1...282930...144Latest