DataHub #ingestion

rapid-room-3050

09/08/2021, 8:42 PM

I am new in DataHub... looking to ingest metada using hive plugin; we currently have presto + hive metastore, what is the best way to ingest the data on this case? appreciate any help !

aloof-gigabyte-305

09/09/2021, 5:21 PM

Hi all, I’m new using Datahub trying to ingest some files as part of a POC. I’m trying to modify the example/demo_data using my own files but I’m having troubles with it: "generate_demo_data.sh" depends on the file “all_covid19_datasets.json” that is not available. If I try to run "bigquery_covid19_to_file.yml", I don't have credential for BigQuery. I assume if I want to prepare my files using “enrich.py” I should follow the same layout (json) as the “all_covid19_datasets.json” file but because the file is not available I can’t figure out the layout. To recap, I’m trying to run “enrich.py” using my files as the input file but can’t figure out what the layout of the input file should be. I understand it has to be json, but I also understand there are different json layouts like “split”, “table”, etc. Any help or direction is much appreciated. Thanks!

curved-daybreak-29035

09/10/2021, 10:35 AM

Hi! I’m new to datahub, trying some ingestion before implementing it for our company. How to manage schemas and tables that are deprecated or changed in names? I tried to follow some suggestion here but seems like links are gone and not exactly sure how to delete or soft delete tables that are deprecated. thanks in advance!

brave-market-65632

09/13/2021, 6:43 AM

On the topic of lineage, our workloads are mostly Snowflake to Snowflake. Use cases being CTAS and merges. These are orchestrated by non-airflow tools including a custom designed one. For such cases, I think the preferred way to capture lineage is via the REST API (https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_rest.py)? Is there a way for metadata-ingestion framework to parse SQL queries and pick table - table lineage? Thanks!

calm-river-44367

09/13/2021, 8:49 AM

Hey everyone, I noticed a hdfs dataset in the demo, even though hadoop is not supported by sources of datahub. Does anyone have any idea how it was ingested or can this be done for other unsupported sources like minio? I would be grateful for your help

faint-hair-91313

09/13/2021, 9:36 AM

Guys, I've ingested a business glossary recipe and it does not show up in the

datahub ingest list-runs

, so cannot roll back if needed. Does it happen on your side, too? The other recipes show up.

red-smartphone-15526

09/13/2021, 11:32 AM

Hey! In Bigquery we have a couple of sharded tables, where one new table is created each day, ie project.dataset.table_20210901, project.dataset.table_20210902 etc. Does anyone know how to avoid ingesting metadata for all of these tables? Would like to only ingest metadata for the lastest shard (with todays date).

faint-hair-91313

09/13/2021, 1:47 PM

Hi again, I am testing the Business Glossary ingestion and lineage towards a dataset. I am able to see the dataset pointing to the business glossary, but not the other way around. Related Entities looks empty. I am running 0.8.12. Also restored indices.

curved-jordan-15657

09/13/2021, 2:08 PM

Hello guys! I’m trying to use airflow with datahub kafka based connection. I’ve configured connection from Airflow UI, i wrote the kafkabroker with <my-broker>:9092. Since i’m using schema registry, i wrote that in “extra” field in json format. It’s something like:

Copy code

{
  "sink": {
    "config": {
      "connection": {
        "schema_registry_url": "<my-schema-registry>:8081"
      }
    }
  }
}

But it gives me an error like:

Copy code

sink
  extra fields not permitted (type=value_error.extra)

Also above lines in log, i see

Copy code

Host: <my-broker>:9092, Port: None, Schema: , Login: , Password: None, extra: {'sink': {'config': {'connection': {'schema_registry_url': '<my-schema-registry>:8081'}}}}

So, it gets my host but can not get port? I didn’t get it. Also, how can i resolve the extra fields problem. Thanks!

calm-river-44367

09/14/2021, 8:31 AM

please somebody help me out.How can I create users in datahub in order to set policies

blue-megabyte-68048

09/15/2021, 3:20 PM

Is there an ingester for Collibra, by chance?

thousands-tailor-5575

09/15/2021, 3:31 PM

Hi, we are testing out Redash metadata ingestion library. Is there a posibility to extract chart -> dataset lineage? It looks like there is only Chart -> Data Platform lineage being retrieved.

lively-laptop-26966

09/15/2021, 11:06 PM

news for Elasticsearch???

microscopic-musician-99632

09/16/2021, 1:28 PM

In the datasources page SAP HANA could not be found. Can you please let me know if it is supported or planned?

curved-sandwich-81699

09/16/2021, 2:45 PM

Hi all! Is it possible to ingest Glossary Terms at the root level without using nodes/folders like it was done here for AccountBalance/CustomerAccount/SavingAccount?: https://demo.datahubproject.io/browse/glossary

best-toddler-40650

09/16/2021, 3:37 PM

Hi I elaborated a recipe to connect to our oracle server as below:

Copy code

source:
  type: oracle
  config:
    username: user
    password: password
    host_port: myserver:1522
    
    service_name: servicename

which translate in the following DSN passed to sqlalchmy connect dialect

Copy code

'dsn': '(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=myhost)(PORT=1522))(CONNECT_DATA=(SERVICE_NAME=myservicename)))'

however, the oracle server requires further parameters in the DSN in order to authenticate, as below:

Copy code

'dsn': '(description= (retry_count=20)(retry_delay=3)(address=(protocol=tcps)(port=1522)(host=myhost))(connect_data=(service_name=myservicename))(security=(ssl_server_cert_dn="CN=certsever, OU=Oracle, O=Oracle Corporation, L=City, ST=State, C=Country")))

Any idea on how to insert these extra parameters in the oracle recipe

aloof-gigabyte-305

09/17/2021, 1:09 AM

Hi! I'm ingesting BusinessGlossary using yml files as it was shown on "DataHub Product Updates: Aug 27 2021 Community Town Hall" (

https://www.youtube.com/watch?v=dVqSZdN64sQ&t=310s▾

). I used samples: examples/bootstrap_data/business_glossary.yml and examples/recipes/business_glossary_to_datahub.yml. The ingestion seems to works fine, I get the green message "Pipeline finished successfully" with 0 failures, 0 warnings, and 68 records_written. But when I go to the UI, I cannot find any Glossary anywhere. Any idea?

eager-answer-71364

09/17/2021, 4:52 AM

How to get mce entity by specific aspects using api? Currently I use format <host>:8080/entities/<urn> but it return all aspects. Are there any params in API to get specific aspect?

happy-orange-34018

09/20/2021, 9:19 AM

Hi team, I am trying to catalog mysql in datahub and I want the datasets in datahub to have some properties such as created time of the table and a property telling if a dataset is a view or table. How can I enable it in datahub?

stocky-noon-61140

09/20/2021, 9:47 AM

Hi everyone - is it possible to restrict / manage the ingestion privileges? In other words: With my locally installed version of Datahub, I don't need to authenticate to ingest data into Datahub. Is there a possibility to require username/passwort authentication before metadata can be ingested?

average-bear-318

09/20/2021, 10:33 AM

Hi Team, I store some metadata info of our ETL jobs in Postgres. Is it possible to ingest the data and not the metadata for the postgres table via DataHub?

👍 1

quiet-kilobyte-82304

09/20/2021, 8:02 PM

anyone has an example mce for

KeyValueSchema

? Looking into ingesting key/value pairs from couchbase into datahub. I’m more interested in seeing how fields that have

isPartOfKey

set to

true

are exposed on the UI https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/schema/SchemaField.pdl#L84

polite-flower-25924

09/20/2021, 8:56 PM

I’m facing the following error when running Hive ingestion with the latest linkedin/datahub-ingestion

(97bed71)

It works with older versions like

c9c1ba4

. I guess some libraries are changed and this needs to be fixed.

microscopic-musician-99632

09/21/2021, 5:45 AM

In 2 threads here https://datahubspace.slack.com/archives/CUMUWQU66/p1629897623476100 and https://datahubspace.slack.com/archives/CUMUWQU66/p1620136722311400 I understand that datahub supports openAPI based metadata ingestion. Could you please guide as to how this is supported and how can the API metadata be ingested. Also is OData supported?

high-hospital-85984

09/21/2021, 10:00 AM

Hi! I was playing around with

datahub ingest list-runs

and got presented with something unexpected. Most IDs are random GUIDs, and one suspicioulsy large run wrt row count is called

no-run-id-provided

. We primarily use the kafka sink, is there a way of providing some human-readable name to the runs for easier rollback?

calm-river-44367

09/22/2021, 7:01 AM

I want to delete a specific dataset, I tried doing so by the command in https://datahubproject.io/docs/how/delete-metadata#delete-by-urn but it only deletes the rows that my dataset contains and not the dataset itself. Is there anyway to hard delete in datahub. please let me know if you got any ideas

calm-sunset-28996

09/22/2021, 3:20 PM

Hey, I'm currently building an API gateway in front of the GMS endpoint. However the RestEmitter does not accept custom /additional headers. So currently to get it to work I've monkey patched it with my own addition. Would it be of interest to allow a param

extra_headers

or something which would be k,v to add to the emitter? Then for future usecases people could add whatever they want if they would need it.

loud-vase-59377

09/23/2021, 12:43 AM

Has anyone tried ingesting from bigquery? I was testing it out and cant make the pattern to work. For starters, I am trying to ingest and profile one specific table. I tried this but it ingest all tables in db1.

Copy code

source:
  type: bigquery
  config:
    project_id: "su-project1"
    schema_pattern: 
        allow: 
           - "db1"
    table_pattern: 
        allow:
           - "db1.table1"   
    profiling: {
      enabled: true,
    }   
    profile_pattern:
        allow:
           - "db1.table1" 
      
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

I tried removing the schema pattern and retain the table pattern but it ingest all database in the project. Are there other options I need to set?

ripe-furniture-93265

09/23/2021, 8:33 AM

Hi All! Is there a way to configure custom “cluster” name for Airflow metadata ingestion? I’ve successfully set up airflow to send metadata and lineage information following the docs here with default config https://datahubproject.io/docs/metadata-ingestion#setting-up-airflow-to-use-datahub-as-lineage-backend, but now it looks like “cluster” part of urn defaults to “prod” and we would like to configure this to be unique per Airflow environment

eager-answer-71364

09/23/2021, 11:03 AM

I got an error when accessing UI with address: <host>/browse/dataset. Looking at log on docker and see it (screenshot). Anyone know which error it is? How to resolve it? Thanks.