DataHub #ingestion

agreeable-hamburger-38305

10/05/2021, 11:11 PM

Hi, I’ve been ingesting from BigQuery with profiling enabled. Does anyone know why it’s successful for some of the tables? The run finishes with no failures or warnings, but only some of the tables in the UI have column-level stats, others just show “no data”

agreeable-hamburger-38305

10/06/2021, 5:39 AM

Is it possible to skip columns with a certain name while doing sql profiling?

boundless-room-44377

10/06/2021, 8:45 PM

hi, is it possible to create lineage MCEs between ML models and Datasets/FeatureTables ? looking at this code in the Python lib:

Copy code

def make_lineage_mce(
    upstream_urns: List[str],
    downstream_urn: str,
    lineage_type: str = DatasetLineageTypeClass.TRANSFORMED,
) -> MetadataChangeEventClass:
    mce = MetadataChangeEventClass(
        proposedSnapshot=DatasetSnapshotClass(
            urn=downstream_urn,
            aspects=[
                UpstreamLineageClass(
                    upstreams=[
                        UpstreamClass(
                            dataset=upstream_urn,
                            type=lineage_type,
                        )
                        for upstream_urn in upstream_urns
                    ]
                )
            ],
        )
    )
    return mce

it looks like a

DatasetSnapshotClass

is required when defining the downstream URN, while I would. like to be able to set that as an ML Model or even FeatureTable

rough-eye-60206

10/06/2021, 9:16 PM

Hi, Can I ingest metadata with just description for the tables located in separate file (tables columns/datatypes has already been ingested before from hive). Can i achieve it using API's/emitters? Is that possible? Can someone please provide me an example, please ?

better-orange-49102

10/07/2021, 7:20 AM

i noticed that the browsepath for datasets are in the format

/prod/source/dataset_name

, i'm just wondering if there are any implications for renaming it such that it becomes

/prod/source/x

, where x is not a short string that is not necessarily unique. I don't see any issues navigating to the datasets if i were to use the same value for x for all datasets.

agreeable-hamburger-38305

10/07/2021, 5:43 PM

Hi all! I’m having trouble ingesting sample queries from BigQuery. Monthly queries and Top Users are successfully shown in the UI, but “No Sample Queries” in the “Queries” tab. Does anyone know what the reason might be? I made sure the table was queried in the past day, and I have access to the Project History Query in the BigQuery GCP console. Here’s the yaml I used:

Copy code

source:
  type: bigquery-usage
  config:
    env: "DEV"
    projects:
      - <project-id>
    top_n_queries: 10

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

gentle-father-80172

10/07/2021, 10:40 PM

Hi all - question - I see that

Looker

metadata ingestion has a nice set of permissions so that I can query the API. However,

LookML

ingestion via API requires Admin level access. Is there a specific reason for this? Trying to get Admin access from my boss so any extra info can be useful! Thanks! 😄

calm-sunset-28996

10/08/2021, 12:35 PM

Hey, at the moment we have some Qlik dashboards ingested and we would like to add lineage for them. However we don't have charts yet. I saw that you can't add lineage to dashboards, only to charts. Would it be a big refactor to add this? I can have a look at this if it's limited in scope.

plus1 2

clean-piano-28976

10/08/2021, 2:23 PM

Hi all 👋🏾– I wonder if anyone can help as I’m having difficulties ingesting data from Looker (both

Looker

and

LookerML

✅ 1

numerous-yak-58823

10/08/2021, 7:43 PM

Any plans for integrating Spline in DataHub? It would unlock really powerful features like Spark automatic lineage even at column level.

👍 1

mammoth-lawyer-49919

10/11/2021, 3:00 AM

Hi all -I am ingesting metadata for Lookml using 0.8.14 version. Looks like there is an issue with dbname value for Athena platform.I am getting URN as _urnlidataPlatform:athena,.db_name.table_name,PROD_ Watch for . before db_name It should ideally come out as _urnlidataPlatform:athena,db_name.table_name,PROD_ I think below line should be changed to include athena as well https://github.com/linkedin/datahub/blob/ef2182f1021c02a72733dfaa058add10f14de0c7/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L735

bland-orange-13353

10/11/2021, 7:56 AM

This message was deleted.

witty-keyboard-20400

10/11/2021, 12:57 PM

While trying to capture collections metadata from Mongodb, I faced this error:

Copy code

OperationFailure: BSONObj size: 17365481 (0x108F9E9) is invalid. Size must be between 0 and 16793600(16MB)

Is there any configuration which let's the Datahub skip the document with larger field values ?

agreeable-hamburger-38305

10/11/2021, 8:20 PM

A question I got from a coworker: how easy/hard would it be to add a new custom/unsupported data source?

fresh-fish-73471

10/11/2021, 9:09 PM

Query : How to change 8080 port for datahub-gms for docker based installation? Tried approaches: 1. Changed port for datahub-gms mentioned in docker-compose.yml file pre-container creation. Container creation successful but below error encountered when trying to access the datahub-gms rest backend. 2. Installed through quickstart.sh after making changes to supposedly required files. Container creation successful but same error again encountered. 3. Tried alternate open ports with same issue being replicated. ERROR: ConnectionError: HTTPConnectionPool(host='X.Y.Z.A', port=8082): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8537d00f50>: Failed to establish a new connection: [Errno 111] Connection refused'))

rough-eye-60206

10/11/2021, 10:15 PM

Hello Team, Can anyone guide me on how to ingest Glossaryterms or provide me a sample code on how to generate/ingest glossary terms to UI. Please help me.

witty-keyboard-20400

10/12/2021, 5:33 AM

When we ingest metadata from different source systems in an organization, it's possible that at the origin a field is called "score", while downstream system may name the same field as "marks". How do we handle it in DataHub to convey to a user that both are same fields? Is there any way to capture semantics in DataHub?

nice-planet-17111

10/12/2021, 6:21 AM

Hi team, is it possible to ingest metadata from

CloudSQL

(on GCP) ->

datahub

? If it is, is there any quickstart guide or docs? Thanks in advance 🙂

bumpy-activity-74405

10/12/2021, 11:41 AM

hey, is there a way to keep only

last versions of aspects? Mysql is growing rapidly in size and I don’t really have any use for old aspects

nice-planet-17111

10/12/2021, 12:35 PM

Hi team 🙂 is there a way, or is there anyone that tried to detect metadata changes in data source "in real-time", and ingest them automatically (if possible, only the changed part) to datahub? I'm looking at the mce builder and emitters, but i think i'm completely lost. 😢

👀 1

boundless-room-44377

10/12/2021, 6:16 PM

hi, I have an ingestion troubleshooting question. I have tried to execute code similar to this example: https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_kafka.py and I don't get any errors, but also do not see the lineage connection show up in the UI if I run the same code but change the sink to the REST connector it works. my question is how I might go about debugging this? I looked in the containers at the kafka topics/messages and gms logs and didn't see anything that indicated an error or failure, but maybe I"m not looking in the right place note: I am running the "quickstart" DataHub on an EC2 instance. I've been able to run ingestion recipes that use the kafka sink to pull datasets in

agreeable-hamburger-38305

10/12/2021, 8:21 PM

I’m running into this weird issue where, when I ingest table metadata from BigQuery with profiling enabled, only one or two of the numeric columns actually get

Min

Max

Mean

Median

and

Standard Deviation

, while all the others just show “unknown”.

Null

and

Distinct

stats and

sample values

are all working fine. Anyone know what might be causing this? The columns with missing stats have a high % of null (99.9x%), but there are still some valid values in there.

rapid-piano-43271

10/13/2021, 2:59 AM

@here I have a question regarding pull based meta data ingestion. Do we pull the entire schema of the data source? If so, how is it applied to the DataHub store, is the old schema completely replaced or does it perform upsert (update or insert) operations?

cuddly-family-62352

10/13/2021, 8:12 AM

I have a question: for example, mysql, MCE events are generated by monitoring logs or by rotation？

melodic-helmet-78607

10/13/2021, 8:55 AM

Hi team, does business glossary have versioning? As a user, I want to see previous definitions and want to see what has been changed

melodic-helmet-78607

10/13/2021, 8:59 AM

Also, any tips to differentiate entity / glossary that has been approved or not? My use case is that every ingested entity / glossary can be approved / rejected by datahub user via UI.

rough-eye-60206

10/13/2021, 4:48 PM

Hello Team, 1. I want to know, how to associate different glossary terms to different datasets ??. Currently i was able to achieve it on the UI by manually assigning them. But i would like to know is there any other automated option/emitter's or API's for achieving that. Please can anyone help me or provide me an example if available for reference. Thank you.

witty-keyboard-20400

10/14/2021, 6:38 AM

MongoDB ingestion failure:

Copy code

OperationFailure: not authorized on db_kg to execute command { aggregate: "system.views", pipeline: [ { $sample: { size: 100 } } ], allowDiskUse: true, cursor: {}, lsid: { id: UUID("305f9d4f-fd8b-4fbd-8cf6-9257c4399403") }, $clusterTime: { clusterTime: Timestamp(1634193294, 4), signature: { hash: BinData(0, 9B140107B447AC1BFBE704B411400CF7EEF4E04D), keyId: 7012684930027094017 } }, $db: "db_kg", $readPreference: { mode: "primaryPreferred" } }, full error: {'operationTime': Timestamp(1634193295, 1), 'ok': 0.0, 'errmsg': 'not authorized on db_kg to execute command { aggregate: "system.views", pipeline: [ { $sample: { size: 100 } } ], allowDiskUse: true, cursor: {}, lsid: { id: UUID("305f9d4f-fd8b-4fbd-8cf6-9257c4399403") }, $clusterTime: { clusterTime: Timestamp(1634193294, 4), signature: { hash: BinData(0, 9B140107B447AC1BFBE704B411400CF7EEF4E04D), keyId: 7012684930027094017 } }, $db: "db_kg", $readPreference: { mode: "primaryPreferred" } }', 'code': 13, 'codeName': 'Unauthorized', '$clusterTime': {'clusterTime': Timestamp(1634193295, 1), 'signature': {'hash': b'\x8f\x98\x0b\x97l\xbd\xab\x96\xcc\x91\x14QQ7\xc8)d\xd7W"', 'keyId': 7012684930027094017}}}

sample size is just 100:

schemaSamplingSize: 100

What does ingestion's schema inference need to execute

aggregate: "system.views",

witty-keyboard-20400

10/14/2021, 11:50 AM

For ingesting metadata from MongoDB, I want to see some examples of how

collection_pattern.deny

is mentioned in the config section and its sample values.

crooked-wolf-53758

10/14/2021, 4:26 PM

I have the json schema for a set of json documents. What would be the best way to ingest this metadata? @mammoth-bear-12532 I saw your reply to a similar question a few months ago, but the links no longer work.