DataHub #ingestion

rapid-crowd-46218

04/26/2023, 7:16 AM

Hi, I try to ingest Glue data source. and in my recipe file, i setted

emit_s3_lineage=true

and

glue_s3_lineage_direction=upstream

, but he lineage may not appear in the UI. However, if it is specified as

glue_s3_lineage_direction=downstream

, the lineage will be visible in the UI. What could be the reason for this? there is no error cli ingest report. And after ingest, there is 'upstraemLineage' in source (glue) report.

📖 1

✅ 1

🔍 1

thousands-yacht-8284

04/26/2023, 7:23 AM

Hi all, Not sure this is the correct channel to ask my question but I give a try. I want to use graphQL to create a glossary term w/ custom properties. I found the mutation to create the term but I can't find how to add custom properties. Is it possible to do that in graphQL or is it only doable in a yaml file?

✅ 1

late-furniture-56629

04/26/2023, 10:17 AM

Hi. I have a very general question. How can I backup ingestion and secrets configuration? So that it can be easily recreated in another environment? 🙂

🩺 1

plus1 3

gifted-market-81341

04/26/2023, 11:54 AM

Hello , I have a question regarding ingestion from MSSQL, we have an ingestion schedule to one of our databases in SQL Server and I noticed that lineage is not being generated for the Views and the tables that they rely on. Is that something that is supported or is that something that I need to handle?

📖 1

🔍 1

witty-butcher-82399

04/26/2023, 12:12 PM

Hi DataHubers! This

ASYNC_INGEST_DEFAULT

feature called our attention. https://datahubspace.slack.com/archives/CV2UXSE9L/p1681587580923939?thread_ts=1681215034.637799&cid=CV2UXSE9L https://datahubspace.slack.com/archives/CV2UXSE9L/p1681588160609639?thread_ts=1681215034.637799&cid=CV2UXSE9L I have a couple of questions: • Is this flag exposed in the GMS API? As a user of the GMS API I would like to process some of my requests in async mode. • Assuming the async scenario and in case of authorization is enabled for the system, is the event authorized before sent to the async queues? Or those events will be unauthorized? Thanks!

🔍 1

📖 1

🧠 1

billions-baker-82097

04/26/2023, 3:21 PM

I have to mention some driver that datahub should identify for my custom source....how can we do so? like MySQL type has driver pymysql similarly I have to do it for my own custom type...can you tell the way to do so ?

lively-dusk-19162

04/26/2023, 3:33 PM

Hi team, I am forking datahub and trying to make datahub up and running using the below command: ./gradlew quickstartDebug When I do so, I am getting the following error inside elasticsearch-setup container. I have added ca certificates inside the docker files. Anyone please help me resolve the issue?

helpful-tent-87247

04/26/2023, 4:41 PM

hey all - I have a use case where i want to ingest Looker data from 2 separate lookml projects that reference - is this possible? essentially one of the looker instance, our external-facing looker instance, is reference views and explores from our internal instance - is there a way to ingest these 2 instances in a way that tracks lineage between such that we can see dependencies in the external instance to views in the internal instance?

fierce-restaurant-41034

04/27/2023, 8:36 AM

Hi all, I ingest snowflake tables into Datahub. Is there a way to see all DML commands instead of just SELECT in the queries tab? or in other places? for example: Let: table name - x and the last DML command was:

Copy code

insert into x values('foo')

I want to see the insert command. Thanks

🔍 1

✅ 1

numerous-refrigerator-15664

04/27/2023, 10:03 AM

Hi team, sorry for the newbie question. I'm trying to ingest my hive metastore. Since my hive metastore is on mysql and it's reachable, I'm considering using presto-on-hive recipe. The problem is... I'm getting error saying

EROR {datahub.entrypoints:192} - Command failed: Cannot open config file  presto-on-hive.dhub.yaml

when I try

datahub ingest -c presto-on-hive.dhub.yaml

. According to some threads in slack, the reason seems my datahub in docker container cannot read the yaml file in host directory. But I'm not getting the answer. So my question is... 1. Which container should be able to read my yaml file? datahub-gms? 2. Should I mount my host directory to the docker container? It is said "For docker, we set docker-compose to mount

${HOME}/.datahub

directory to

/etc/datahub

directory within the GMS containers." on this page: https://datahubproject.io/docs/plugins/#plugin-installation but it seems the changes are not updated. Thank you in advance!

✅ 1

📖 1

🔍 1

incalculable-processor-75603

04/27/2023, 11:03 AM

Hi all, I am created a PR to

add ability to preserve dbt table identifier casing

, but the vercel bot report that the deployment has failed. PR here: https://github.com/datahub-project/datahub/pull/7854 I am also tested the new code in my local environment and it work, so I don't know why thing happen, and don't know when the PR ready to merge. Therefore, what should I do next? Please help me Thanks for your advice!

✅ 1

fresh-dusk-60832

04/27/2023, 12:44 PM

Hey guys, I'm trying to ingest Athena metadata. but it's not working. this is my recipe:

Copy code

source:
    type: athena
    config:
        aws_region: us-east-1
        work_group: primary
        include_views: true
        include_tables: true
        catalog_name: dynamodb
        database: default
        query_result_location: '<s3://xxx/xxx/>'

and this catalog + database are reading data from my DynamoDB using Athena Connector (Lambda). If I configure my recipe to grab the metadata from the default catalog (awsdatacatalog), it works perfectly. Any clue? maybe the Athena connector only works with Data Source Type = AWS Glue Data Catalog?

📖 1

🔍 1

rich-policeman-92383

04/27/2023, 1:08 PM

# Datahub Version: v0.9.6.1 # Source : DBT # DBT core version: 1.3.3 Hello We are using DBT to do some transformations with a hive table as a source. After the transformation and test are successfully executed , we use datahub cli to emit DBT metadata+lineage in datahub. The lineage presented in datahub does not add "Hive" dataset as a source in the lineage, instead it creates and adds a new dataset "DBT & Hive" as the source. Problem is that this "Hive" dataset already has all the business metadata added and we want it to be shown as the source instead of this new "DBT & Hive " dataset.

bland-orange-13353

04/27/2023, 1:37 PM

This message was deleted.

✅ 1

adamant-honey-44884

04/27/2023, 4:54 PM

Should have posted here to start with so cross posting now. https://datahubspace.slack.com/archives/CV2KB471C/p1682435333026279

✅ 1

🔍 1

📖 1

clever-magician-79463

04/27/2023, 5:05 PM

Hi Datahub team, Getting the following error - datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type redshift: ‘str’ object is not callable Attaching the whole error logs for your reference.

exec-urn_li_dataHubExecutionRequest_626c2c02-d44e-4c50-99a1-6ebe0f06acf3.log

able-evening-90828

04/27/2023, 5:52 PM

We like the new postgres improvement that can ingestion from multiple databases in one postgres instance. However, we found it a bit cumbersome to use because by default the postgres connection tries to connect to the database with the same name as the username. If such a database doesn't exist, the ingestion fails. In our case, we don't want to create a new database just to match the username we use. We think this can be easily addressed by setting up the connection to the

postgres

database when listing the databases. So we would change the following line:

Copy code

engine = create_engine(url, **self.config.options)

to something like below:

Copy code

engine = create_engine(self.config.get_sql_alchemy_url(database="postgres"), **self.config.options)

If there is no objection, we will send a PR out to address this. @hundreds-photographer-13496 @gray-shoe-75895 @famous-waitress-64616

🔍 1

📖 1

✅ 1

flat-painter-78331

04/28/2023, 12:54 AM

Hi team, I'm trying to integrate Airflow with Datahub. i'm running Datahub and Airflow both on my Kubernetes cluster and I've followed the exact steps mentioned in https://datahubproject.io/docs/lineage/airflow#using-datahubs-airflow-lineage-plugin but none of the DAGs I've deployed are shown in Datahub and the task logs of the DAGs do not show any datahub logs. I'm on Datahub version 0.10.2 I've been struggling with this for days and I cannot figure out what I'm missing... Could you please help me resolve this? It'll be much appreciated!

elegant-salesmen-99143

04/28/2023, 12:30 PM

Hi, can anyone please help me understand what is

platform_instance

in Kafka connect dosc? We have a working Kafka connection, but I wanted to enable stateful ingestion in it, but I can't without specifying platform instance, and I'm not sure what that is. We're using Confluent Kafka, is that it? Should I write smth like

platform_instance: confluent

✅ 2

🔍 1

📖 1

important-bear-9390

04/28/2023, 4:18 PM

Hello Team! Trying to in ingest SPARK (run in k8s) jobs into datahub. So far, I can only see downstream lineage, not upstream. Looking for problems like this, i found out more people having the same issue. Datahub: 0.9.2 datahub-spark-lineage: 0.10.2 (other versions return errors like:

ERROR DatasetExtractor: class org.apache.spark.sql.catalyst.plans.logical.Aggregate is not supported yet.

) Any tips what could I do to solve this ?

bright-waitress-5179

04/28/2023, 6:39 PM

Hello, I am trying to setup datahub locally following the quickstart guide. I am able to navigate to http://localhost:9002 and setup the ingestions for snowflake, looker and lookml . However, I am getting errors for all three ingestions. For snowflake, I am seeing this error in the logs. version

acryl-datahub, version 0.10.2.2

Copy code

'failures': [{'error': 'Unable to emit metadata to DataHub GMS',
               'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                        'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:400]: Cannot parse request entity\n'
                                      '\tat com.linkedin.restli.server.RestLiServiceException.fromThrowable(RestLiServiceException.java:315)\n'
                                      '\tat com.linkedin.restli.server.BaseRestLiServer.buildPreRoutingError(BaseRestLiServer.java:202)',
                        'message': 'Cannot parse request entity',
                        'status': 400,
                        'id': 'urn:li:dataset:(urn:li:dataPlatform:snowflake,segment_prod.core_mobile_production.appointment_save,PROD)'}}]

bright-waitress-5179

04/28/2023, 6:57 PM

Hello, I am trying to setup datahub locally following the quickstart guide. I am able to navigate to http://localhost:9002 and setup the ingestions for snowflake, looker and lookml .Both looker and lookml ingestions are failing with this error. version

acryl-datahub, version 0.10.2.2

Copy code

File "/tmp/datahub/ingest/venv-looker-0.10.2/lib/python3.10/site-packages/sqllineage/__init__.py", line 24, in _patch_updating_lateral_view_lexeme
    if regex("LATERAL VIEW EXPLODE(col)"):
TypeError: 'str' object is not callable

purple-salesmen-12745

04/29/2023, 6:59 PM

do you know a way to connect a Thesaurus like https://agrovoc.fao.org/browse/agrovoc/en/ to the buissness glossary to keep fresh. Also it’ possible to find a sparkqel endpoint https://agrovoc.fao.org/sparql

rich-policeman-92383

05/01/2023, 8:09 AM

# datahub version : v0.9.6.1 # datahub cli : 0.9.6.4 Hello Is there any way to specify query_max_execution while using trino source. I need to set it to 14400sec or more. Right now query gets timed out after 10mins. On asking the trino admins they said that this property is configurable on the client side. Error:

Copy code

[2023-04-30 18:24:05,254] ERROR    {datahub.utilities.sqlalchemy_query_combiner:403} - Failed to execute queue using combiner: (trino.exceptions.TrinoQueryError) TrinoQueryError(type=INSUFFICIENT_RESOURCES, name=EXCEEDED_TIME_LIMIT, message="Query exceeded the maximum execution time limit of 10.00m"

["Profiling exception (trino.exceptions.OperationalError) error 404: b'Query not found'\n(Background on this error at: <https://sqlalche.me/e/14/e3q8>)"]

Recipe yaml:

Copy code

source:
  type: "trino"
  config:
    host_port: ip:port
    database: hive_2

    username: tr
    password:

    schema_pattern:
      deny:
        - .*information_schema.*
      allow:
        - B
        - A

    table_pattern:
      allow:
        - hive_2.A.table1
        - hive_2.B.table2
   

    profiling:
      enabled: True

    profile_pattern:
      allow:
       - hive_2.A.table1
        - hive_2.B.table2

transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:1_0_prod_datalake"

pipeline_name: "trino_hive_prod_to_datahub_prod"

datahub_api:
  server: "<https://gms:8080>"
  token: 
  
  
sink:
  type: "datahub-rest"
  config:
    server: "<https://gms:8080>"
    token:

🔍 1

✅ 1

bitter-evening-61050

05/01/2023, 9:47 AM

Hi Team, I have a airflow integrated with datahub.I have a dag where a procedural query was called from snowflake with inlets and outlets . the lineage for this dag is shown in datahub but the inlets and outlets are not pointing towards the datasets mentioned in the snowflake platforms. It is creating its own dataset with no schema and data.can anyone help me with this issue

📖 1

🔍 1

elegant-nightfall-29115

05/01/2023, 11:24 PM

Hey Team, I want to run an ingestion of a policies file to replace the default policies of datahub at

/datahub/datahub-gms/resources/policies.json

however the ingestion-cron pod can't find that path. Recipe file looks like this

Copy code

source:
  type: file
  config:
    # Coordinates
    filename: ../policies.json

sink:
  type: file
  config:
    filename: /datahub/datahub-gms/resources/policies.json

I am trying to remove the permission of

MANAGE_INGESTION

from all users as to totally disable UI ingestion.

✅ 1

🔍 1

📖 1

billions-baker-82097

05/02/2023, 11:11 AM

I was trying to ingest through UI, using OTHERS as a type and here's recipe I have used, source: type: sqlalchemy config: env: DEV connect_uri: 'mysql+pymysql://datahub:datahub@host.docker.internal:3306' platform: mysql platform_instance: "" include_tables: true include_views: true sink: type: datahub-rest config: server: 'http://host.docker.internal:8080'

bland-orange-13353

05/02/2023, 12:01 PM

This message was deleted.

✅ 1

purple-printer-15193

05/02/2023, 3:34 PM

Hi all, I’ve granted all the Snowflake permissions as stated here https://datahubproject.io/docs/generated/ingestion/sources/snowflake#prerequisites. Does the Snowflake database show up as one of the nodes in the lineage? Or is it because it’s a data share that it wouldn’t show up? I ask because I noticed that one of our table queries the

snowflake.account_usage.tag_references

table but I don’t see this table in the lineage. The

snowflake.account_usage.tag_references

also never gets ingested by our Snowflake ingestion recipe. Lastly, when I try to just ingest the

SNOWFLAKE

database I get an error like below:

Copy code

"source": {
    "type": "snowflake",
    "report": {
      "events_produced": 0,
      "events_produced_per_sec": 0,
      "entities": {},
      "aspects": {},
      "warnings": {},
      "failures": {
        "permission-error": [
          "No tables/views found. Please check permissions."
        ]
      },

I can definitely see and query the

snowflake.account_usage.tag_references

table using the Snowflake UI though so I’m not sure if it’s really a permission error at all. Thanks.

fierce-animal-98957

05/02/2023, 4:26 PM

Hi Team, We are using “DataHubValidationAction” to send assertions metadata to DataHub. We are running this from inside Databricks using Great Expectations, that uses Spark engine. From the documentation, this currently works only with “SqlAlchemyExecutionEngine”. Do anyone of you know when this class will be enhanced to add Spark engine support? Anything on the roadmap? https://datahubproject.io/docs/metadata-ingestion/integration_docs/great-expectations/#capabilities https://docs.greatexpectations.io/docs/integrations/integration_datahub/

✅ 1