DataHub #ingestion

few-air-56117

12/17/2021, 7:57 AM

Hi all, how i tried to ingest data from biguery and make lineage automaticaly, this is the config

Copy code

source:
  type: bigquery
  config:
    project_id: <project_id>
    include_table_lineage: True
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

The table/views are in datahub but the lineage button si not available. Am i missing something? Thx a lot 😄

nice-planet-17111

12/17/2021, 8:29 AM

Hi all, does datahub support

bigquery udfs

ingestion? I tried to do it, but it really returns nothing (even if i set include_views: true)

few-air-56117

12/17/2021, 8:58 AM

Hi everyone, i have a question, For biguqery lineage, if a create a table base(C) on a view(B) base on other table (A), the linage is not A->B>C, is A-C, the view is excluded. It is normal?

green-football-48146

12/17/2021, 10:03 AM

Hi all, when we ingest the metadata of

hive

, if it encounter an abnormality in some tables, the ingestion will be interrupted. Is there any way to skip these abnormal tables when errors occur?

proud-accountant-49377

12/17/2021, 11:42 AM

Hi everyone!😊 When I add a term to a field in my dataset’s schema, this term only appears in the editableSchema object ... is there any way for it to directly modify the schemaMetadata object and appear there? Thanks!

best-planet-6756

12/17/2021, 7:32 PM

Hi all, I have ingested an Oracle DB and added a database alias to the recipe. Is there a way to query on the alias in graphql?

millions-fall-80793

12/20/2021, 4:17 AM

Hey Guys I am using this business glossary to ingest into DataHub (V 0.8.18) via this recipe. All works fine. My question is, How do I purge/delete the glossary terms?

busy-zebra-64439

12/20/2021, 9:01 AM

Hi Team , I have a issue while setting up the ingestor . i have prepared the connection yml and i ran the ingestor command datahub ingest -c mysql_ingestor.yml but , it throwing the error mysql is disabled; try running: pip install 'acryl-datahub[mysql]' when i run the pip install 'acryl-datahub[mysql]' the error shows as ERROR: Could not find a version that satisfies the requirement acryl-datahub[mysql] (from versions: none) ERROR: No matching distribution found for acryl-datahub[mysql] How to activate the mysql source for ingestion.

witty-butcher-82399

12/20/2021, 9:28 AM

Is there any chance that the acryldata’s pyhive fork includes the SparkSQL dialect? We have been testing Datahub Hive connector with Spark Thrift Server and pyhive requires those updates. From what I’ve been reading, authors of pyhive (dropbox) doesn’t want to include the new dialect, instead they want Spark to be Hive compatible as they claim to be. This can be noted here or here. Thanks! 🧵

few-air-56117

12/20/2021, 2:37 PM

Hi all, did anyone had speed problem when they tried to ingest data from bigquery with a 1/2 years gaps between start date and end date?

red-pizza-28006

12/20/2021, 2:51 PM

I am noticing an issue with the Snowflake Lineage. Here is an example - I created a temp table with CTE to build the actual dataset like this

Copy code

CREATE TEMP TABLE temp.temp_stone AS
WITH upd_stone_response_messages AS
          (
             SELECT DISTINCT srm.id AS stone_response_message_id
             FROM src_payment.stone_response_messages srm
             WHERE updated_at BETWEEN $date_start AND $date_end
               AND request_type = 'authorize'
               AND transaction_id IS NOT NULL
          )

Here you can see I have a dependency on

src_payment.stone_response_messages

, but when i look at the lineage UI, I only see that the dataset is built using temp_stone.but nothing more than that. The SQL inside of the CTE is not captured in the lineage

modern-monitor-81461

12/20/2021, 3:09 PM

Hi all, I am writing a custom source (Iceberg in this case. I know it's on the roadmap, but I need it now and I'm using this to understand the datahub internals) and I am having problems adding a

MetadataChangeEvent

with a

SchemaMetadata

aspect. It looks like something is rejected by the Avro validator, but it doesn't tell me what. Is there a trick to figure what exactly is incompatible with the schema?

Copy code

File "/datahub/metadata-ingestion/src/datahub/cli/ingest_cli.py", line 82, in run
    pipeline.run()
File "/datahub/metadata-ingestion/src/datahub/ingestion/run/pipeline.py", line 157, in run
    for record_envelope in self.transform(record_envelopes):
File "/datahub/metadata-ingestion/src/datahub/ingestion/extractor/mce_extractor.py", line 46, in get_records
    raise ValueError(

ValueError: source produced an invalid metadata work unit: MetadataChangeEventClass(...

microscopic-elephant-47912

12/20/2021, 8:40 PM

Hi all, I'm trying to ingest lookml files but I got an error. I looked around but could not find a solution or a bug report. Could you please check ?

looker-dwh-master.zip

mysterious-lamp-91034

12/23/2021, 5:07 AM

Hi I have ingested 116554 tables to datahub, the web UI began to crash(waiting forever) when 10k tables were ingested. I am not sure what was going on.

datahub docker check

shows no issue The context is I am running

docker-compose.quickstart.yml

in my dev machine

abundant-photographer-45796

12/24/2021, 6:25 AM

I performed a superset ingestion, the yml code is shown below

Copy code

source:
  type: superset
  config:
    # Coordinates
    connect_uri: <http://localhost:8088>

    # Credentials
    username: xxx
    password: xxx
    provider: db

sink:
  # sink configs
  type: "datahub-rest"
  config:
    server: "<http://192.168.229.4:8080>"

then, I carried out the ingestion command,

Copy code

datahub ingest -c superset.yml

I get this hint But in my datahub homepage, I can't see the charts. Can someone tell me why? Thank you

busy-zebra-64439

12/27/2021, 11:26 AM

Hi Team , I am facing the below issue while i am trying to ingest the oracle data using the docker image. Kindly provide some help to resolve this issue. Docker command - docker run 3d271c19a693 ingest --config /data/oracle_ingestor.yml Error - DatabaseError: (cx_Oracle.DatabaseError) DPI-1047: Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared object file: No such file or directory". See https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html for help (Background on this error at: http://sqlalche.me/e/13/4xp6)

rich-policeman-92383

12/27/2021, 12:32 PM

Hello Can we specify a hive queue name while ingestion metadata from hive source. Like in beeline we can do something like:

Copy code

beeline -e "set tez.queue.name='myqueue' describe formatted myschem.mytable;"

agreeable-river-32119

12/28/2021, 6:49 AM

Hello team, now we run scheduler tasks by Apache Dolphinscheduler.I found that you have provided acryl-datahub[airflow] as a lineage component. How can I develop acryl-datahub[dolphinscheduler] for us? As a contributor of Apache DolphinScheduler,I want to participate in datahub.😊

busy-zebra-64439

12/28/2021, 1:55 PM

Hi Team , i am having some queries on the oracle ingestion. Issue 1 : i faced the below error column at array position 0 fetched with error using the include_views: False , this issue got fixed Was this issue fixed on any version of ingestion. currently used docker image - linkedin/datahub-ingestion:head Issue 2 : we see very minimal table are cataloged , like only 1% of table was catalogued during the ingestion process. could someone please provide an update on this issue. sample yaml: source: type: oracle config: # Coordinates host_port: localhost:2115 # Credentials username: sampleuser password: sample service_name: SAMPLE include_views: False table_pattern: ignoreCase: False schema_pattern: ignoreCase: False view_pattern: ignoreCase: False sink: type: "datahub-rest" config: server: "http://localhost:8080"

nice-autumn-10105

12/28/2021, 5:35 PM

Does anyone using the mssql ingestor support integrated auth. vs. uid and pwd? Our database environment here only allows integrated auth,

curved-magazine-23582

12/28/2021, 8:38 PM

Hello team, where can I find more info about current implementation of PK/FK support, info such as UI, supported platform / data stores, etc?

lemon-cartoon-14299

12/29/2021, 12:13 AM

Hello All, I am pretty new to the data hub tool and started off with the docker installed on my laptop. I was able to import some of the metadata from trino datasource into data hub. Have couple of issues. Would appreciate if some one can help me here. 1. I turned on the profiling for trino data source but I still dont see any stats around it. The stats and lineage tabs are disabled always. 2. Is there a way to setup lineage manually between data sources?

better-orange-49102

12/30/2021, 7:34 AM

for the command datahub docker quickstart, there is an option to build locally (ie --build-locally). Why does it still do a docker-compose pull if we specify to build locally?

nice-country-99675

12/30/2021, 12:13 PM

👋 HI team! I have a pretty vague question.. and I would like to make it as concise as possible... I have a Redshift ingestion coded as Airflow DAG, which runs a pipeline that looks like this

Copy code

pipeline = Pipeline.create(
        {
            "source": {
                "type": source,
                "config": {
                    "username": f"{conn.login}",
                    "password": f"{conn.password}",
                    "database": f"{conn.schema}",
                    "host_port": f"{conn.host}:{conn.port}",
                    "database_alias": alias,
                    "env": "PROD",
                    "schema_pattern": {
                        "deny": deny_schemas
                    },
                },
            },
            "transformers": transformers,
            "sink": {
                "type": "datahub-rest",
                "config": {"server": f"{datahub.host}"},
            },
        }
    )
    pipeline.run()
    pipeline.raise_from_status()

The thing is the DAG ends with

Copy code

{local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL

It's the only thing that is actually logged.... it seems it fails as soon as the process starts. At first, I thought it was a memory issue, we increase the pods memory, and now we are pretty far from the memory limit. It even fails when I ran dray_run. Locally it's working fine. Locally I'm using Airflow 2.2.2 while in production I'm using Airflow 2.2.1. I would really appreciate any suggestion...

teamwork 1

gentle-florist-49869

12/30/2021, 2:42 PM

Hi team, please anyone here have a initial tutorial to see datahub logs/data into elasticsearch or kibana? both alreay up and working

damp-ambulance-34232

01/03/2022, 4:25 AM

Did datahub support ingest hive table with kudu format

red-pizza-28006

01/03/2022, 12:52 PM

Hi team - I started ingesting data from confluent cloud kafka but it seems to be super slow to ingest the data. Here is an example. You can see it takes about 2 mins per topic which is not scalable for us

Copy code

[2022-01-03 13:44:47,679] INFO     {datahub.cli.ingest_cli:81} - Starting metadata ingestion
[2022-01-03 13:47:19,428] INFO     {datahub.ingestion.run.pipeline:77} - sink wrote workunit kafka-<topic1>
[2022-01-03 13:49:50,867] INFO     {datahub.ingestion.run.pipeline:77} - sink wrote workunit kafka-<topic2>

better-orange-49102

01/03/2022, 1:27 PM

For the access tokens, do they work with the edit policies? Ie if person A does not have edit rights to a dataset and passes in a MCE/MCP about that dataset to :9002/api/gms with his access token, will he be rejected? also, if i do not allow users to generate their own token, can I still query for user's tokens in the backend (via custom UI code) and use it to ingest metadata on their behalf?

gentle-florist-49869

01/03/2022, 6:14 PM

Hi Team, happy new year - I'm trying to create a datahub-mae-consumer docker container via yml - https://github.com/linkedin/datahub/blob/master/docker/docker-compose.consumers.yml - but received the error: *************************** APPLICATION FAILED TO START *************************** Description: Field systemAuthentication in com.linkedin.metadata.kafka.config.EntityHydratorConfig required a bean of type 'com.datahub.authentication.Authentication' that could not be found. The injection point has the following annotations: - @org.springframework.beans.factory.annotation.Autowired(required=true) - @org.springframework.beans.factory.annotation.Qualifier(value=systemAuthentication) Action: Consider defining a bean of type 'com.datahub.authentication.Authentication' in your configuration. 2022/01/03 163547 Command exited with error: exit status 1

adventurous-apple-98365

01/04/2022, 2:14 AM

Hey all - wondering if anyone has any ideas about creating new tags when they are ingested as part of a dataset. When we ingest a dataset we are adding custom tags that don't yet exist in the

GlobalTags

aspect. The tag itself isn't ingested(not in the elastic tag index so we cant search!) but does properly appear on the dataset and in the list of 'filter checkboxes' when viewing data sets. Is there anyway to have the tag also created outside of ingesting the tag separately before ingesting the data set? Not sure if it makes sense to try solve it in one place (within dathaub itself) versus in each of our ingestion plugins

plus1 1