https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • r

    rapid-crowd-46218

    04/26/2023, 7:16 AM
    Hi, I try to ingest Glue data source. and in my recipe file, i setted
    emit_s3_lineage=true
    and
    glue_s3_lineage_direction=upstream
    , but he lineage may not appear in the UI. However, if it is specified as
    glue_s3_lineage_direction=downstream
    , the lineage will be visible in the UI. What could be the reason for this? there is no error cli ingest report. And after ingest, there is 'upstraemLineage' in source (glue) report.
    📖 1
    ✅ 1
    🔍 1
    l
    a
    a
    • 4
    • 5
  • t

    thousands-yacht-8284

    04/26/2023, 7:23 AM
    Hi all, Not sure this is the correct channel to ask my question but I give a try. I want to use graphQL to create a glossary term w/ custom properties. I found the mutation to create the term but I can't find how to add custom properties. Is it possible to do that in graphQL or is it only doable in a yaml file?
    ✅ 1
    l
    b
    • 3
    • 3
  • l

    late-furniture-56629

    04/26/2023, 10:17 AM
    Hi. I have a very general question. How can I backup ingestion and secrets configuration? So that it can be easily recreated in another environment? 🙂
    🩺 1
    plus1 3
    l
    a
    g
    • 4
    • 4
  • g

    gifted-market-81341

    04/26/2023, 11:54 AM
    Hello , I have a question regarding ingestion from MSSQL, we have an ingestion schedule to one of our databases in SQL Server and I noticed that lineage is not being generated for the Views and the tables that they rely on. Is that something that is supported or is that something that I need to handle?
    📖 1
    🔍 1
    l
    a
    +2
    • 5
    • 4
  • w

    witty-butcher-82399

    04/26/2023, 12:12 PM
    Hi DataHubers! This
    ASYNC_INGEST_DEFAULT
    feature called our attention. https://datahubspace.slack.com/archives/CV2UXSE9L/p1681587580923939?thread_ts=1681215034.637799&cid=CV2UXSE9L https://datahubspace.slack.com/archives/CV2UXSE9L/p1681588160609639?thread_ts=1681215034.637799&cid=CV2UXSE9L I have a couple of questions: • Is this flag exposed in the GMS API? As a user of the GMS API I would like to process some of my requests in async mode. • Assuming the async scenario and in case of authorization is enabled for the system, is the event authorized before sent to the async queues? Or those events will be unauthorized? Thanks!
    🔍 1
    📖 1
    🧠 1
    l
    a
    o
    • 4
    • 4
  • b

    billions-baker-82097

    04/26/2023, 3:21 PM
    I have to mention some driver that datahub should identify for my custom source....how can we do so? like MySQL type has driver pymysql similarly I have to do it for my own custom type...can you tell the way to do so ?
    l
    a
    • 3
    • 3
  • l

    lively-dusk-19162

    04/26/2023, 3:33 PM
    Hi team, I am forking datahub and trying to make datahub up and running using the below command: ./gradlew quickstartDebug When I do so, I am getting the following error inside elasticsearch-setup container. I have added ca certificates inside the docker files. Anyone please help me resolve the issue?
    l
    d
    • 3
    • 7
  • h

    helpful-tent-87247

    04/26/2023, 4:41 PM
    hey all - I have a use case where i want to ingest Looker data from 2 separate lookml projects that reference - is this possible? essentially one of the looker instance, our external-facing looker instance, is reference views and explores from our internal instance - is there a way to ingest these 2 instances in a way that tracks lineage between such that we can see dependencies in the external instance to views in the internal instance?
    l
    a
    • 3
    • 2
  • f

    fierce-restaurant-41034

    04/27/2023, 8:36 AM
    Hi all, I ingest snowflake tables into Datahub. Is there a way to see all DML commands instead of just SELECT in the queries tab? or in other places? for example: Let: table name - x and the last DML command was:
    Copy code
    insert into x values('foo')
    I want to see the insert command. Thanks
    🔍 1
    ✅ 1
    l
    a
    h
    • 4
    • 4
  • n

    numerous-refrigerator-15664

    04/27/2023, 10:03 AM
    Hi team, sorry for the newbie question. I'm trying to ingest my hive metastore. Since my hive metastore is on mysql and it's reachable, I'm considering using presto-on-hive recipe. The problem is... I'm getting error saying
    EROR {datahub.entrypoints:192} - Command failed: Cannot open config file  presto-on-hive.dhub.yaml
    when I try
    datahub ingest -c presto-on-hive.dhub.yaml
    . According to some threads in slack, the reason seems my datahub in docker container cannot read the yaml file in host directory. But I'm not getting the answer. So my question is... 1. Which container should be able to read my yaml file? datahub-gms? 2. Should I mount my host directory to the docker container? It is said "For docker, we set docker-compose to mount
    ${HOME}/.datahub
    directory to
    /etc/datahub
    directory within the GMS containers." on this page: https://datahubproject.io/docs/plugins/#plugin-installation but it seems the changes are not updated. Thank you in advance!
    ✅ 1
    📖 1
    🔍 1
    l
    m
    • 3
    • 20
  • i

    incalculable-processor-75603

    04/27/2023, 11:03 AM
    Hi all, I am created a PR to
    add ability to preserve dbt table identifier casing
    , but the vercel bot report that the deployment has failed. PR here: https://github.com/datahub-project/datahub/pull/7854 I am also tested the new code in my local environment and it work, so I don't know why thing happen, and don't know when the PR ready to merge. Therefore, what should I do next? Please help me Thanks for your advice!
    ✅ 1
    l
    f
    • 3
    • 3
  • f

    fresh-dusk-60832

    04/27/2023, 12:44 PM
    Hey guys, I'm trying to ingest Athena metadata. but it's not working. this is my recipe:
    Copy code
    source:
        type: athena
        config:
            aws_region: us-east-1
            work_group: primary
            include_views: true
            include_tables: true
            catalog_name: dynamodb
            database: default
            query_result_location: '<s3://xxx/xxx/>'
    and this catalog + database are reading data from my DynamoDB using Athena Connector (Lambda). If I configure my recipe to grab the metadata from the default catalog (awsdatacatalog), it works perfectly. Any clue? maybe the Athena connector only works with Data Source Type = AWS Glue Data Catalog?
    📖 1
    🔍 1
    l
    a
    • 3
    • 6
  • r

    rich-policeman-92383

    04/27/2023, 1:08 PM
    # Datahub Version: v0.9.6.1 # Source : DBT # DBT core version: 1.3.3 Hello We are using DBT to do some transformations with a hive table as a source. After the transformation and test are successfully executed , we use datahub cli to emit DBT metadata+lineage in datahub. The lineage presented in datahub does not add "Hive" dataset as a source in the lineage, instead it creates and adds a new dataset "DBT & Hive" as the source. Problem is that this "Hive" dataset already has all the business metadata added and we want it to be shown as the source instead of this new "DBT & Hive " dataset.
    l
    m
    a
    • 4
    • 9
  • b

    bland-orange-13353

    04/27/2023, 1:37 PM
    This message was deleted.
    ✅ 1
    l
    • 2
    • 1
  • a

    adamant-honey-44884

    04/27/2023, 4:54 PM
    Should have posted here to start with so cross posting now. https://datahubspace.slack.com/archives/CV2KB471C/p1682435333026279
    ✅ 1
    🔍 1
    📖 1
    l
    • 2
    • 2
  • c

    clever-magician-79463

    04/27/2023, 5:05 PM
    Hi Datahub team, Getting the following error - datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type redshift: ‘str’ object is not callable Attaching the whole error logs for your reference.
    exec-urn_li_dataHubExecutionRequest_626c2c02-d44e-4c50-99a1-6ebe0f06acf3.log
    l
    d
    +2
    • 5
    • 7
  • a

    able-evening-90828

    04/27/2023, 5:52 PM
    We like the new postgres improvement that can ingestion from multiple databases in one postgres instance. However, we found it a bit cumbersome to use because by default the postgres connection tries to connect to the database with the same name as the username. If such a database doesn't exist, the ingestion fails. In our case, we don't want to create a new database just to match the username we use. We think this can be easily addressed by setting up the connection to the
    postgres
    database when listing the databases. So we would change the following line:
    Copy code
    engine = create_engine(url, **self.config.options)
    to something like below:
    Copy code
    engine = create_engine(self.config.get_sql_alchemy_url(database="postgres"), **self.config.options)
    If there is no objection, we will send a PR out to address this. @hundreds-photographer-13496 @gray-shoe-75895 @famous-waitress-64616
    🔍 1
    📖 1
    ✅ 1
    l
    f
    +3
    • 6
    • 19
  • f

    flat-painter-78331

    04/28/2023, 12:54 AM
    Hi team, I'm trying to integrate Airflow with Datahub. i'm running Datahub and Airflow both on my Kubernetes cluster and I've followed the exact steps mentioned in https://datahubproject.io/docs/lineage/airflow#using-datahubs-airflow-lineage-plugin but none of the DAGs I've deployed are shown in Datahub and the task logs of the DAGs do not show any datahub logs. I'm on Datahub version 0.10.2 I've been struggling with this for days and I cannot figure out what I'm missing... Could you please help me resolve this? It'll be much appreciated!
    l
    d
    +3
    • 6
    • 14
  • e

    elegant-salesmen-99143

    04/28/2023, 12:30 PM
    Hi, can anyone please help me understand what is
    platform_instance
    in Kafka connect dosc? We have a working Kafka connection, but I wanted to enable stateful ingestion in it, but I can't without specifying platform instance, and I'm not sure what that is. We're using Confluent Kafka, is that it? Should I write smth like
    platform_instance: confluent
    ?
    ✅ 2
    🔍 1
    📖 1
    l
    h
    +2
    • 5
    • 24
  • i

    important-bear-9390

    04/28/2023, 4:18 PM
    Hello Team! Trying to in ingest SPARK (run in k8s) jobs into datahub. So far, I can only see downstream lineage, not upstream. Looking for problems like this, i found out more people having the same issue. Datahub: 0.9.2 datahub-spark-lineage: 0.10.2 (other versions return errors like:
    ERROR DatasetExtractor: class org.apache.spark.sql.catalyst.plans.logical.Aggregate is not supported yet.
    ) Any tips what could I do to solve this ?
    l
    b
    r
    • 4
    • 4
  • b

    bright-waitress-5179

    04/28/2023, 6:39 PM
    Hello, I am trying to setup datahub locally following the quickstart guide. I am able to navigate to http://localhost:9002 and setup the ingestions for snowflake, looker and lookml . However, I am getting errors for all three ingestions. For snowflake, I am seeing this error in the logs. version
    acryl-datahub, version 0.10.2.2
    Copy code
    'failures': [{'error': 'Unable to emit metadata to DataHub GMS',
                   'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                            'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:400]: Cannot parse request entity\n'
                                          '\tat com.linkedin.restli.server.RestLiServiceException.fromThrowable(RestLiServiceException.java:315)\n'
                                          '\tat com.linkedin.restli.server.BaseRestLiServer.buildPreRoutingError(BaseRestLiServer.java:202)',
                            'message': 'Cannot parse request entity',
                            'status': 400,
                            'id': 'urn:li:dataset:(urn:li:dataPlatform:snowflake,segment_prod.core_mobile_production.appointment_save,PROD)'}}]
    l
    a
    g
    • 4
    • 4
  • b

    bright-waitress-5179

    04/28/2023, 6:57 PM
    Hello, I am trying to setup datahub locally following the quickstart guide. I am able to navigate to http://localhost:9002 and setup the ingestions for snowflake, looker and lookml .Both looker and lookml ingestions are failing with this error. version
    acryl-datahub, version 0.10.2.2
    Copy code
    File "/tmp/datahub/ingest/venv-looker-0.10.2/lib/python3.10/site-packages/sqllineage/__init__.py", line 24, in _patch_updating_lateral_view_lexeme
        if regex("LATERAL VIEW EXPLODE(col)"):
    TypeError: 'str' object is not callable
    l
    b
    g
    • 4
    • 3
  • p

    purple-salesmen-12745

    04/29/2023, 6:59 PM
    do you know a way to connect a Thesaurus like https://agrovoc.fao.org/browse/agrovoc/en/ to the buissness glossary to keep fresh. Also it’ possible to find a sparkqel endpoint https://agrovoc.fao.org/sparql
    l
    b
    • 3
    • 2
  • r

    rich-policeman-92383

    05/01/2023, 8:09 AM
    # datahub version : v0.9.6.1 # datahub cli : 0.9.6.4 Hello Is there any way to specify query_max_execution while using trino source. I need to set it to 14400sec or more. Right now query gets timed out after 10mins. On asking the trino admins they said that this property is configurable on the client side. Error:
    Copy code
    [2023-04-30 18:24:05,254] ERROR    {datahub.utilities.sqlalchemy_query_combiner:403} - Failed to execute queue using combiner: (trino.exceptions.TrinoQueryError) TrinoQueryError(type=INSUFFICIENT_RESOURCES, name=EXCEEDED_TIME_LIMIT, message="Query exceeded the maximum execution time limit of 10.00m"
    
    ["Profiling exception (trino.exceptions.OperationalError) error 404: b'Query not found'\n(Background on this error at: <https://sqlalche.me/e/14/e3q8>)"]
    Recipe yaml:
    Copy code
    source:
      type: "trino"
      config:
        host_port: ip:port
        database: hive_2
    
        username: tr
        password:
    
        schema_pattern:
          deny:
            - .*information_schema.*
          allow:
            - B
            - A
    
        table_pattern:
          allow:
            - hive_2.A.table1
            - hive_2.B.table2
       
    
        profiling:
          enabled: True
    
        profile_pattern:
          allow:
           - hive_2.A.table1
            - hive_2.B.table2
    
    transformers:
      - type: "simple_add_dataset_tags"
        config:
          tag_urns:
            - "urn:li:tag:1_0_prod_datalake"
    
    pipeline_name: "trino_hive_prod_to_datahub_prod"
    
    datahub_api:
      server: "<https://gms:8080>"
      token: 
      
      
    sink:
      type: "datahub-rest"
      config:
        server: "<https://gms:8080>"
        token:
    🔍 1
    ✅ 1
    l
    • 2
    • 3
  • b

    bitter-evening-61050

    05/01/2023, 9:47 AM
    Hi Team, I have a airflow integrated with datahub.I have a dag where a procedural query was called from snowflake with inlets and outlets . the lineage for this dag is shown in datahub but the inlets and outlets are not pointing towards the datasets mentioned in the snowflake platforms. It is creating its own dataset with no schema and data.can anyone help me with this issue
    📖 1
    🔍 1
    l
    b
    • 3
    • 4
  • e

    elegant-nightfall-29115

    05/01/2023, 11:24 PM
    Hey Team, I want to run an ingestion of a policies file to replace the default policies of datahub at
    /datahub/datahub-gms/resources/policies.json
    however the ingestion-cron pod can't find that path. Recipe file looks like this
    Copy code
    source:
      type: file
      config:
        # Coordinates
        filename: ../policies.json
    
    sink:
      type: file
      config:
        filename: /datahub/datahub-gms/resources/policies.json
    I am trying to remove the permission of
    MANAGE_INGESTION
    from all users as to totally disable UI ingestion.
    ✅ 1
    🔍 1
    📖 1
    l
    g
    • 3
    • 6
  • b

    billions-baker-82097

    05/02/2023, 11:11 AM
    I was trying to ingest through UI, using OTHERS as a type and here's recipe I have used, source: type: sqlalchemy config: env: DEV connect_uri: 'mysql+pymysql://datahub:datahub@host.docker.internal:3306' platform: mysql platform_instance: "" include_tables: true include_views: true sink: type: datahub-rest config: server: 'http://host.docker.internal:8080'
    l
    a
    • 3
    • 5
  • b

    bland-orange-13353

    05/02/2023, 12:01 PM
    This message was deleted.
    ✅ 1
    l
    • 2
    • 1
  • p

    purple-printer-15193

    05/02/2023, 3:34 PM
    Hi all, I’ve granted all the Snowflake permissions as stated here https://datahubproject.io/docs/generated/ingestion/sources/snowflake#prerequisites. Does the Snowflake database show up as one of the nodes in the lineage? Or is it because it’s a data share that it wouldn’t show up? I ask because I noticed that one of our table queries the
    snowflake.account_usage.tag_references
    table but I don’t see this table in the lineage. The
    snowflake.account_usage.tag_references
    also never gets ingested by our Snowflake ingestion recipe. Lastly, when I try to just ingest the
    SNOWFLAKE
    database I get an error like below:
    Copy code
    "source": {
        "type": "snowflake",
        "report": {
          "events_produced": 0,
          "events_produced_per_sec": 0,
          "entities": {},
          "aspects": {},
          "warnings": {},
          "failures": {
            "permission-error": [
              "No tables/views found. Please check permissions."
            ]
          },
    I can definitely see and query the
    snowflake.account_usage.tag_references
    table using the Snowflake UI though so I’m not sure if it’s really a permission error at all. Thanks.
    l
    h
    +3
    • 6
    • 8
  • f

    fierce-animal-98957

    05/02/2023, 4:26 PM
    Hi Team, We are using “DataHubValidationAction” to send assertions metadata to DataHub. We are running this from inside Databricks using Great Expectations, that uses Spark engine. From the documentation, this currently works only with “SqlAlchemyExecutionEngine”. Do anyone of you know when this class will be enhanced to add Spark engine support? Anything on the roadmap? https://datahubproject.io/docs/metadata-ingestion/integration_docs/great-expectations/#capabilities https://docs.greatexpectations.io/docs/integrations/integration_datahub/
    ✅ 1
    l
    a
    • 3
    • 3
1...118119120...144Latest