DataHub #ingestion

calm-river-44367

10/27/2021, 8:54 AM

I want to insert profiles for a dataset in the stats part of datahub. is there a json file or another solution I can use for an example on how to do what i want. i will appreciate your help

bland-teacher-17190

10/27/2021, 1:21 PM

Hi team, are there any more extensive example of emitters? I am trying to get some SQL schema into Datahub, but I can't find good examples on how to construct mce with schemas. All that's in the examples/library are for mces with upstream and downstream info.

blue-holiday-20644

10/27/2021, 1:39 PM

Hi- has anyone managed to get lineage ingestion working with AWS Managed Airflow? MWAA configuration options seem quite limited.

curved-jordan-15657

10/27/2021, 3:38 PM

Hello team, do you have any airflow dag code to rollback or delete all datasets or any other suggestion? I guess you are planning to delete entities by GraphQL, how about it?

victorious-dream-46349

10/27/2021, 4:27 PM

Hi team, for our use case, we are writing a

terraform provider

for datahub (and yes, we will opensource once it is matured). Something like below will add dataset for bigquery with schemas defined.

Copy code

resource "datahub_dataset" "test" {
  platform = "bigquery"
  name = "testdataset"
  origin = "DEV"
  owner = "dinesh"
  schema_name = "test"
  
  field {
    field_path = "name"
    native_datatype = "String()"
    recursive = false
    nullable = true
  }

  field {
    field_path = "address"
    native_datatype = "String()"
    recursive = true
    nullable = false
  }

  tags = ["tag1", "tag2"]
  upstreams = [
    datahub_dataset.gcs.id,
    "urn:li:dataset:(urn:li:dataPlatform:gcs,testgcs,DEV)"
  ]
}

Question: For this terraform provider, we are using rest-api based results. Will it scale ? Should we use graphql based apis ?

👍 1

calm-morning-92759

10/27/2021, 4:42 PM

Hello everyone, in the Google Cloud there is the possibility to define labels for resources. So far we have not been able to query these labels via the API and save them in datahub. Is there a suitable solution to the problem here? Information on this is very welcome. Thank you very much.

acceptable-vr-75043

10/27/2021, 10:31 PM

If we have multiple Snowflake instances and MySQL clusters, how do you or what's the best way to handle naming collisions? Since the dataset URN is

platform

(snowflake or mysql),

name

(db.schema.table for snowflake, db.table for mysql) and

fabric

which doesn't include the snowflake/mysql cluster name. (https://datahubproject.io/docs/what/urn/)

agreeable-hamburger-38305

10/27/2021, 11:34 PM

I am trying to get the rendered yaml with ingestion-cron enabled, getting this error. Not super familiar with Helm, can someone help? Thanks!!

Copy code

>> helm install datahub datahub/datahub --set-string datahub-ingestion-cron.enabled=true --dry-run
>> dependencies.go:49: Warning: Condition path 'datahub-ingestion-cron.enabled' for chart datahub-ingestion-cron returned non-bool value

rhythmic-sundown-12093

10/28/2021, 8:34 AM

Hi, I have two tables related to foreign keys in mysql, I imported them to datahub but their relationship is not shown on the page Actually there is foreign key data inside the metadata_aspect_v2 table in datahub database I put the process I executed in the log.txt file, please check @loud-island-88694 @little-megabyte-1074

log.txt

red-pizza-28006

10/28/2021, 9:17 AM

Trying to understand something - If I ingest all users from Azure AD using this recipe - https://datahubproject.io/docs/metadata-ingestion/source_docs/azure-ad without enabling SSO, will those users be able to login into Datahub as well?

damp-ambulance-34232

10/28/2021, 11:26 AM

How to remove kudu table from hive.yml ingestiton file

red-pizza-28006

10/28/2021, 2:37 PM

Another question regarding setting up OIDC using Azure AD - it asks for a discovery URL for Azure AD any ideas what it would be? from this page it shows the path to be https://docs.microsoft.com/en-us/azure/active-directory/develop/v2-protocols-oidc

Copy code

/.well-known/openid-configuration

but what would be the URL?

square-painting-93399

10/28/2021, 4:50 PM

Hello all, I am a bit confused on updating an entity. My organization is currently running our datahub instance on a k8 cluster. I initially ingested our aws athena data via a recipe file. We had a table schema update and I would like to update it in datahub. When I go to rerun the same recipe file, via the "datahub ingest -c" command I am getting errors. Is this the practice that I should be doing? Should I be deleting the entity beforehand? I tried searching the documentation for awhile, and I was not able to find it. Apologies if it's there 🙂

acceptable-greece-56919

10/28/2021, 5:19 PM

Hello everyone, Does Datahub supports a plugin for Presto Version 347? The Superset Plugin supports authentication using Open ID? Thank you very much !

damp-ambulance-34232

10/29/2021, 3:59 AM

Got the error when ingestion from superset. I can ingest some dashboard but not all

Copy code

File "/usr/local/lib/python3.6/dist-packages/datahub/entrypoints.py", line 91, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
File "/usr/lib/python3/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
File "/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/datahub/cli/ingest_cli.py", line 58, in run
    pipeline.run()
File "/usr/local/lib/python3.6/dist-packages/datahub/ingestion/run/pipeline.py", line 125, in run
    for wu in self.source.get_workunits():
File "/usr/local/lib/python3.6/dist-packages/datahub/ingestion/source/superset.py", line 339, in get_workunits
    yield from self.emit_dashboard_mces()
File "/usr/local/lib/python3.6/dist-packages/datahub/ingestion/source/superset.py", line 250, in emit_dashboard_mces
    dashboard_data
File "/usr/local/lib/python3.6/dist-packages/datahub/ingestion/source/superset.py", line 213, in construct_dashboard_from_api_data
    position_data = json.loads(raw_position_data)
File "/usr/lib/python3.6/json/__init__.py", line 348, in loads
    'not {!r}'.format(s.__class__.__name__))

TypeError: the JSON object must be str, bytes or bytearray, not 'NoneType'

handsome-belgium-11927

10/29/2021, 8:19 AM

Hello everyone! Is there any information on how to construct a field urn? I'm trying to ingest foreign keys, and error says that field urn is invalid, I haven't seen any reference to it before.

tall-controller-60779

10/29/2021, 11:32 AM

Hello everyone. Is it possible to ingest DataJobSnapshot via POST request? It works successfully for datasets and other entities, but with DataJobSnapshot always receive "union type is not backed by a DataMap or null" error.

nice-planet-17111

10/29/2021, 12:51 PM

Hi everyone, I want to know what queries are behind

SQL_Profiling

.. Does anyone know where can i find related logs or does anyone know what queries are behind it? 🙂

better-orange-49102

10/29/2021, 1:05 PM

i noticed that inside class DatasetProfile in schema_classes, there are 2 attributes, eventGranularity and partitionSpec, but they dont seem to be populated when i run a data profiling job on a postgres table. Is it actually in use, or is it "for future use"? Cos im trying to implement my own profiler using pandas-profiler with a JDBC source

damp-ambulance-34232

11/01/2021, 8:05 AM

Hi, when a new table show up in my database, did datahub auto ingestion my new table to datahub?

dazzling-notebook-2883

11/01/2021, 9:50 AM

Hello everyone, I was wondering if there is any community effort ongoing to import ontology in W3C format (Turtle, JSON-LD) to DataHub Business Glossary?

damp-minister-31834

11/01/2021, 11:04 AM

Hi all. I found that there is a 【Queries】button in dataset but the color is always grey even if I made some queries. So how to make the buttle bright?(I use hive source as test)

red-pizza-28006

11/01/2021, 3:54 PM

I am looking to ingest data from Salesforce, I noticed that they have an API to describe all objects within Salesforce, and the response looks like this

Copy code

{
  "size": 112,
  "totalSize": 112,
  "done": true,
  "queryLocator": null,
  "entityTypeName": "FieldDefinition",
  "records": [
    {
      "attributes": {
        "type": "FieldDefinition",
        "url": "/services/data/v53.0/tooling/sobjects/FieldDefinition/MessagingSession.Id"
      },
      "DataType": "Lookup()",
      "Description": null
    },
    {
      "attributes": {
        "type": "FieldDefinition",
        "url": "/services/data/v53.0/tooling/sobjects/FieldDefinition/MessagingSession.Owner"
      },
      "DataType": "Lookup(User,Group)",
      "Description": null
    },
    {
      "attributes": {
        "type": "FieldDefinition",
        "url": "/services/data/v53.0/tooling/sobjects/FieldDefinition/MessagingSession.IsDeleted"
      },
      "DataType": "Checkbox",
      "Description": null
    },
    {
      "attributes": {
        "type": "FieldDefinition",
        "url": "/services/data/v53.0/tooling/sobjects/FieldDefinition/MessagingSession.Name"
      },
      "DataType": "Auto Number",
      "Description": null
    }
  ]
}

As you can see the DataType is not similar to what we have in other languages. Would datahub be able to handle this if I manually ingest this by putting this in a ingestable file?

curved-jordan-15657

11/01/2021, 5:03 PM

Hi team, i have a question about storing metadatas into mysql. Why datahub stores pipeline metadatas by keeping all versions instead of overriding? We have so many dags and some of them are running every 5 min, 10 min etc. I’ve checked the mysql and saw that we have 1500 version number in specific task created in 4 days

important-camera-38424

11/01/2021, 7:46 PM

Hello everyone. I'm excited about the potential of DataHub. I successfully ran ingestion from glue but find no lineage for most of our jobs. The curious thing is that under Platforms, I see two boxes: Glue with 315 objects and S3 with 4. Only the ones listed as S3 give me lineage. None of the Glue jobs do. All our sources and targets are in S3. Is there a setting or config required to extract lineage from Glue jobs? Any hints would be apprecaited.

victorious-dream-46349

11/02/2021, 11:03 AM

Please refer this question related to ingestion using rest api

witty-butcher-82399

11/02/2021, 5:31 PM

Hi! I haven’t found any mention to partition columns in the SchemaField metadata or any other schema-related PDL https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/schema/SchemaField.pdl Is there any plan to include such info in the model?

damp-minister-31834

11/03/2021, 3:26 AM

Hi all! I want to ask something about how to ingest metadata automatically. I ingest hive source using

datahub ingest -c hive_to_rest.yml

. But when my data updated in hive, I need to run the command again to update the metadata in datahub? If I want continuous update once ingested, what should I do?

sparse-planet-56664

11/03/2021, 8:36 AM

Has anyone tried ingesting thoughtspot data?

acceptable-eye-63357

11/03/2021, 8:38 AM

I see iceberg table support is on the roadmap for this quarter. Any idea how it’s going and when it will be available? Also what is planned for Great Expectations?