DataHub #ingestion

white-horse-97256

02/13/2023, 7:14 PM

Hi, I am getting the following error when trying to create a column-level lineage:

Copy code

The datum UpstreamLineageClass({'upstreams': [UpstreamClass({'auditStamp': AuditStampClass({'time': 0, 'actor': 'urn:li:corpuser:unknown', 'impersonator': None, 'message': None}), 'created': None, 'dataset': 'urn:li:dataset:(urn:li:dataPlatform:neo4j,labels.Asset,STG)', 'type': 'TRANSFORMED', 'properties': None})], 'fineGrainedLineages': [FineGrainedLineageClass({'upstreamType': 'FIELD_SET', 'upstreams': ['urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:neo4j,labels.Asset,STG),account_id)'], 'downstreamType': 'NONE', 'downstreams': [], 'transformOperation': None, 'confidenceScore': 1.0})]}) is not an example of the schema.

✅ 1

powerful-telephone-2424

02/13/2023, 9:32 PM

Hi folks, I’m trying to understand executors in DataHub Ingestion with the goal to write my own executor. I couldn’t find any documentation on how to do something like this. Are any pointers from the community on how I can get started?

cold-airport-17919

02/13/2023, 9:57 PM

Hi, I have metadata details i.e. field name, type, description on a spreadsheet for a dataset. Can I read this data into datahub? Is there an option? I was thinking of the csv-enricher module but I believe it works to enrich the existing dataset. Thank you Ballu

✅ 1

bland-lighter-26751

02/14/2023, 12:03 AM

Hey everyone, Updated to v0.10.0, did a reingest, and it looks like lineage between Metabase and BigQuery doesn't work at all now? All the mappings are gone. Anyone else that uses the two see this?

ambitious-notebook-45027

02/14/2023, 2:11 AM

hello，i want ingest hive DB,and get error like this

Copy code

FAILED: SemanticException [Error 10056]:
    Queries against partitioned tables without a partition filter are disabled for safety reasons.
    If you know what you are doing, please set hive.strict.checks.no.partition.
    filter to false and make sure that hive.mapred.mode is not set to 'strict' to proceed.
    Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.
    No partition predicate for Alias "lubian" Table "lubian"

how can i do?@Mayuri N

✅ 1

hallowed-shampoo-52722

02/14/2023, 5:10 AM

https://datahubspace.slack.com/archives/C033H1QJ28Y/p1676323639995389

plain-cricket-83456

02/14/2023, 7:26 AM

@hundreds-photographer-13496 hello,keyloack is used to implement single sign-on (SSO). If an application performs keyloack logout (this application shares the keyloack service with datahub), how does datahub implement linkage logout?

shy-hairdresser-85182

02/14/2023, 9:18 AM

Hi Guys, I have made code changes to mssql plugin to read platform from recipe file and provided platform as synapse to ingest synapse data..it's able to ingest as expected in ui..but it's not reflecting it's default browse path by decoding dataset urn..how to enable that default behaviour without using transformer at recipe level?!

colossal-smartphone-90274

02/14/2023, 12:24 PM

Hi all, One of the data sources I am using is powerbi-report-server however due to a recent upgrade of our on-premise PowerBI system, the ingest appears to no longer work. The issue is to do with this function in the report_server.py file (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/powerbi_report_server/report_server.py) I am using this version -> Version1.15.8377.1837(September 2022)

Copy code

def get_all_reports(self) -> List[Any]:
        """
        Fetch all Reports from PowerBI Report Server
        """
        report_types_mapping: Dict[str, Any] = {
            Constant.REPORTS: Report,
            Constant.MOBILE_REPORTS: MobileReport,
            Constant.LINKED_REPORTS: LinkedReport,
            Constant.POWERBI_REPORTS: PowerBiReport,
        }

On the PoC version of DataHub, I removed the MOBILE_REPORTS line of the code snippet and the ingest worked again however I will need a different strategy for my OpenShift deployment. Has anyone else had this issue with the ingest?

rich-pager-68736

02/14/2023, 2:43 PM

Hi guys, during ingestion, we also extracted usage stats including the top users for our assets. However, due to some internal regulations we have to remove those. I already changed the recipe to not ingest that information anymore, but how can I delete those already ingested top users? Rolling back everything seems a bit crude... I have not found any way to do this - any advice?

✅ 1

dazzling-microphone-98929

02/14/2023, 2:47 PM

Hi everyone, I have a doubt about dataset type mapping. My Power BI data source is Redshift, can I ingest the data?

lemon-scooter-69730

02/14/2023, 5:22 PM

Suddenly bigquery ingest failing with this error

Copy code

('Failed to load service account credentials from /tmp/tmpuvp2cqms', ValueError('Could not deserialize key data. The data may be in an incorrect format, it may be encrypted with an unsupported algorithm, or it may be an unsupported key type (e.g. EC curves with explicit parameters).', [_OpenSSLErrorWithText(code=503841036, lib=60, reason=524556, reason_text=b'error:1E08010C:DECODER routines::unsupported')]))

tall-caravan-42586

02/14/2023, 5:53 PM

Hi Team

tall-caravan-42586

02/14/2023, 5:54 PM

metadata-ingestion build failing with this error , please help me

fancy-crayon-39356

02/14/2023, 6:57 PM

Hello team! Here at my company we recently rolled out datahub to production and we are becoming heavy users of it ❤️ However I've recently noticed a problem when ingesting both DBT and Snowflake sources. The bundled resource (dbt+snowflake) has duplicated columns. This is due to a lowercase urn being ingested with an uppercase urn. I'm running

datahub cli

on version

v0.10.0

. Digging into this problem I've found this PR: https://github.com/datahub-project/datahub/pull/7063/files that changed the

DBTColumn

name from

catalog_column["name"].lower()

catalog_column["name"]

. Essentially, making the column URN the same as in the catalog (which comes from snowflake and, in that case, its uppercased). The problem is that in the Snowflake recipe we are lowercasing urns by default (

convert_urns_to_lowercase=True

), causing the mismatch. What is the standard going forward here? Are we sticking to lowercase urns to ensure cross-platform compatibility or DBT will use whatever is defined in the catalog? I'm happy to submit a PR to maybe introduce a

convert_urns_to_lowercase

flag to the DBT recipe as well, if that's the standard going forward.

bland-barista-59197

02/14/2023, 7:00 PM

Hi Team, I have following question 1. Can datahub integrate with third party secret manager? 2. is there any way to trigger third party web api / process after ingestion like Google DLP or Microsoft Purview?

enough-lamp-79907

02/14/2023, 7:21 PM

Hello Team, I have multiple files in S3 bucket, which contains same kind of data but partitioned by date folder. I create a ingestion for s3 and it creates a ingestion for each parquet file (like 1000s injestions). Is it possible to kind of have single metadata for all those files because the metadata would be same but also keeping count and other info of the files.

source:

type: s3

config:

path_specs:

- include: <s3://test/dumps/kafka/order/daily/{partition_key[0]}={partition[0]}/*.parquet>

aws_config:

aws_profile: dev

aws_region: eu-central-1

env: "dev"

profiling:

enabled: false

✅ 1

white-horse-97256

02/14/2023, 9:39 PM

Hi again, I have created a mysql recipe file and executed it through datahub cli. I have later created and ran a new recipe file with another host name and credentials...i see that it replaced my first file config in the UI. Question is is there way to not override existing recipe files and create a new entry/job for new recipe file on UI?

polite-actor-701

02/15/2023, 1:55 AM

Hi all. I have a question about Ingestion. When I ingest metadata from Tableau, some Entities(Workbooks/Dashbords/Charts) are missing. If I ingest the same Project again, some of the missing Entities are ingested. And others are still missing. But there is no error in gms or ingest logs. What's the problem? Is this bug?

calm-jewelry-98911

02/15/2023, 3:21 AM

Hey guys, I seem to have (soft) conked my gms service in local. Essentially I am trying to mark a user as inactive/Suspended. I set the corpUser's

CorpUserStatus

as suspended using the MCP below (and subsequently emitting it) ->

Copy code

In [47]: import time
    ...: mcp2 = MetadataChangeProposalWrapper(
    ...:     entityType="corpuser",
    ...:     changeType=ChangeTypeClass.UPSERT,
    ...:     entityUrn=make_user_urn('apanwar'),
    ...:     aspectName=CorpUserStatusClass.get_aspect_name(),
    ...:     aspect=CorpUserStatusClass(
    ...:         status='SUSPENDED',
    ...:         lastModified=AuditStampClass(
    ...:             time=int(time.time()*1000),
    ...:             actor='urn:li:corpuser:datahub'
    ...:         )
    ...:     ),
    ...: )

The GMS logs have captured an error related to this -

Copy code

Caused by: java.lang.IllegalArgumentException: No enum constant com.linkedin.datahub.graphql.generated.CorpUserStatus.SUSPENDED
        at java.base/java.lang.Enum.valueOf(Enum.java:240)
        at com.linkedin.datahub.graphql.generated.CorpUserStatus.valueOf(CorpUserStatus.java:6)
        at com.linkedin.datahub.graphql.types.corpuser.mappers.CorpUserStatusMapper.apply(CorpUserStatusMapper.java:19)
        at com.linkedin.datahub.graphql.types.corpuser.mappers.CorpUserStatusMapper.map(CorpUserStatusMapper.java:13)
        at com.linkedin.datahub.graphql.types.corpuser.mappers.CorpUserMapper.lambda$apply$3(CorpUserMapper.java:64)
        at com.linkedin.datahub.graphql.types.common.mappers.util.MappingHelper.mapToResult(MappingHelper.java:22)
        at com.linkedin.datahub.graphql.types.corpuser.mappers.CorpUserMapper.apply(CorpUserMapper.java:63)
        at com.linkedin.datahub.graphql.types.corpuser.mappers.CorpUserMapper.map(CorpUserMapper.java:46)
        at com.linkedin.datahub.graphql.types.corpuser.CorpUserType.lambda$batchLoad$0(CorpUserType.java:95)
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
        at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
        at com.linkedin.datahub.graphql.types.corpuser.CorpUserType.batchLoad(CorpUserType.java:96)
        ... 18 common frames omitted

Was wondering if I missed something here? I thought SUSPENDED would be a valid value for the

CorpUserStatus.status

as mentioned in the schema class' getter and setter ->

plain-nest-12882

02/15/2023, 5:29 AM

Howdy, Does datahub supports a custom action defined by great expectations(gx)? If so, can I have the datahub Post API that is used to push the results to datahub using non sql alchemy engines

✅ 1

numerous-account-62719

02/15/2023, 8:46 AM

Hi Team, Is InfluxDB supported in datahub? If yes then how to ingest the data from InfluxDB

rich-policeman-92383

02/15/2023, 10:05 AM

Helllo LDAP source creates user based on sAMAccountName AD attribute https://github.com/datahub-project/datahub/blob/v0.9.5/metadata-ingestion/src/datahub/ingestion/source/ldap.py#L47 Is there a way to use a filter like the one we use in datahub-fronted "AUTH_OIDC_USER_NAME_CLAIM=email; AUTH_OIDC_USER_NAME_CLAIM_REGEX=([^@]+)". Problem is that the user created by LDAP source is different from the one created by frontend. datahub version: v0.9.5

broad-wire-76841

02/15/2023, 10:49 AM

hello team, I am creating a pipeline(emitter) which tags an owner to a an entity. now wanted to know, what to do in scenarios when the said user does not exist in datahub yet. ? is there a way to identify if user exists and if not create first programatically?

✅ 1

ripe-eye-60209

02/15/2023, 6:31 PM

Hello Team, given the powerbi ingestion, the source is connected successfully but we get no events generated. what could be the issue here? any idea?

limited-forest-73733

02/15/2023, 6:50 PM

Hey team i ingested my snowflake tables and enabled the profiling but its showing uknown in each columns. I am attaching my recipe and UI table stats. can anyone please help me out

✅ 1

white-horse-97256

02/15/2023, 7:34 PM

Hi Team, question regarding ingestion, is there any difference between scheduling a python function to run the mysql ingestion via Python emitter vs scheduling a recipe yml script in datahub tool for ingestion?

calm-jewelry-98911

02/15/2023, 7:55 PM

Hey team, still looking for answers on my question. Basically, what is the best way to mark a CorpUser as SUSPENDED/Inactive (I used

CorpUserStatus

but faced issues as described in the OG question) - https://datahubspace.slack.com/archives/CUMUWQU66/p1676431303025729

silly-dog-87292

02/15/2023, 7:57 PM

Hello Team, I am trying to ingest my spark application lineage to data hub (running on seperate EC2 machine with docker quickstart). I followed the steps in the documentation , however on running the spark application , i still dont see any metadata/lineage getting captured in my datahub system (EC2). Any thoughts on what could be the problem?

white-horse-97256

02/15/2023, 10:28 PM

Hi Team, is there a java package to create a pipeline to ingest from mysql source similar to this python script https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/programatic_pipeline.py

✅ 1