DataHub #ingestion

salmon-cricket-21860

01/20/2022, 5:22 AM

Hi, I am ingesting hive table lists, columns from Hive Metastore RDS (MySQL). However, I would like to set owners and tags for the ingested hive tables. Can I set owner, tags for dataset entities after ingestion?

careful-engine-38533

01/20/2022, 7:41 AM

I try to delete the dataset using the following command `datahub delete --env PROD --entity_type dataset --platform mongodb`but it fails with error java.lang.NullPointerException\n\tat com.linkedin.metadata.entity.ebean.EbeanEntityService.deleteUrn(EbeanEntityService.java:577)\n\tat com.linkedin.metadata.resources.entity.EntityResource.lambda$deleteEntity$13(EntityResource.java:313)\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)\n\t... 81 more\n', 'message': 'java.lang.NullPointerException', 'status': 500} - any help?

gentle-country-27659

01/20/2022, 7:42 AM

Hello, I'm using Java Emitter, and there was an error when I filled in Chinese in Aspect information：[HTTP Status400] Parameters of method 'ingestProposal' failed validation with error 'ERROR :: /proposal/aspect/value :: \"{\"description\":\"你好19960916yshantest男\"}\" is not a valid string representation of bytes\n'\n\tat com.linkedin.restli.server.RestLiServiceException. But when I used Python Emitter, everything worked. I can convert Json from Chinese to Unicode with Ingest -c, but Java Emitter doesn't

handsome-belgium-11927

01/20/2022, 9:59 AM

Hi, team! What are the best practices to delete dataset to dataset lineage? Also, is there any way to delete datasets other than Datahub CLI?

few-air-56117

01/20/2022, 11:20 AM

hi guys, i tried to ingest some data from bigquery i got these

Copy code

/usr/local/lib/python3.9/site-packages/google/cloud/bigquery/client.py:535: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.

any ideea? 😄

careful-engine-38533

01/20/2022, 1:31 PM

Can I have a single recipe file for multiple config like 'kafka' and 'mongo' - I have attached a screenshot of a sample. Any help?

broad-battery-31188

01/20/2022, 4:09 PM

Hello , I was wondering if Snowflake ingestion can get away with REFERENCES access instead of Select as it only requires access to Metadata. https://docs.snowflake.com/en/user-guide/security-access-control-privileges.html#:~:text=enables%20viewing%20the%20structure%20of%20a%20table%20(but%20not%20the%20data)%20via%20the%20DESCRIBE%20or%20SHOW%20command%20or%20by%20querying%20the%20Information%20Schema.

plus1 1

modern-monitor-81461

01/21/2022, 1:44 AM

On-demand AzureAD users ingestion Sorry if this was already answered, but I couldn't find it.. Here it goes: I have DataHub deployed in Azure using OIDC. When a user logs in, its profile is read from the IDToken and a CorpUser is created inside DataHub. That's all good. But then I have users that have never logged in DataHub and are running pipelines that create tables and their username are tagged as owners of that table. When I ingest that table, an ownership aspect is created and it tries to link with a CorpUser. But since the user has never logged in, that link goes to void... I understand that's why there is an AzureAD source to ingest all users (or members of a group) pro-actively, to avoid the situation I just described. For security reasons I won't go into, my organization is reluctant in having a copy (or a partial copy) of the AD users stored in an external database. We understand that when you use a tool, your identity will be exposed and we are ok with this. What we are not fan of is storing identities of users that 1. Never logged in DataHub 2. Never created a pipeline/dataset/dashboard/etc... What we would like is to ingest only users that are relevant to DataHub and the things it keeps track, on-demand. Has this been done already? One way I thought this could be fixed is by creating a transformer and looks for user identities and fetches user info from AD when the CorpUser is missing from DataHub. Obviously this transformer would have to appear in pretty much every recipe we run, but we are ok with this. • Is there a better way to solve this? • Would there be bad side-effects of operating like this?

miniature-television-17996

01/21/2022, 8:42 AM

Hello! Any body knows why after ingest from gbq , i have only dataset without lineage (

DatasetUpstreamLineage or UpstreamLineage

) thank you for your help!

better-orange-49102

01/21/2022, 10:23 AM

is there a another way to delete specific dataset profiles other than going to ES to delete the document? I'm thinking of doing profiling on datasets but on specific day partitions (so that each profile run is a day's worth of data) so may need to delete a specific profile when things go wrong

hallowed-apple-72040

01/21/2022, 12:14 PM

Hello, using docker-compose.quickstart.yml file from git hub repo. I am able bring up all the components on aws ec2. Later I ingested the athena tables. Can someone please help me where the data will be stored in the ec2. I would like to take a backup of the data and push it to s3.

few-air-56117

01/21/2022, 12:57 PM

hi guys, i try to ingest some matadata from biguqery and i got this er

Copy code

UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed

. Another college try end he dosent have this error My setup

Copy code

DataHub CLI version: 0.8.19.1
Python version: 3.8.2 (default, Nov  4 2020, 21:23:28) 
[Clang 12.0.0 (clang-1200.0.32.28)]
google-api-core==2.3.2
google-api-python-client==2.33.0
google-auth==2.3.3
google-auth-httplib2==0.1.0
google-cloud==0.34.0
google-cloud-appengine-logging==1.1.0
google-cloud-audit-log==0.2.0
google-cloud-bigquery==2.31.0
google-cloud-bigquery-storage==2.10.1
google-cloud-core==2.2.1
google-cloud-logging==2.7.0
google-cloud-pubsub==2.9.0
google-cloud-secret-manager==2.8.0
google-cloud-storage==1.27.0
google-crc32c==1.3.0
google-resumable-media==2.1.0
googleapis-common-protos==1.54.0
grpc-google-iam-v1==0.12.3

His setup

Copy code

DataHub CLI version: 0.8.19.1
Python version: 3.9.9 (main, Nov 21 2021, 03:16:13) 
[Clang 13.0.0 (clang-1300.0.29.3)]
1:58
google-api-core==2.3.2
google-api-python-client==2.33.0
google-auth==2.3.3
google-auth-httplib2==0.1.0
google-cloud==0.34.0
google-cloud-appengine-logging==1.1.0
google-cloud-audit-log==0.2.0
google-cloud-bigquery==2.31.0
google-cloud-bigquery-storage==2.10.1
google-cloud-core==2.2.1
google-cloud-logging==2.7.0
google-crc32c==1.3.0
google-resumable-media==2.1.0
googleapis-common-protos==1.54.0
grpc-google-iam-v1==0.12.3

Its looks similaryl, idk why this is happaning. Thx a lot 🙏

broad-tomato-45373

01/21/2022, 1:52 PM

Hi , I am trying to ingest data from superset to datahub. charts, dashboards and their internal ineage was loaded successfully, but not able to find any inputs or the source datasets to get the end - to - end lineage. Any help would be much appreciated .. For reference, config_file

Copy code

source:
  type: superset
  config:
    # Coordinates
    connect_uri: <>

    # Credentials
    username: admin
    password: admin

sink:
  type: "datahub-rest"
  config:
    # TODO : Parameterizing the sink url from single config
    server: <>

under lying source for superset datasets is redshift ..

few-air-56117

01/21/2022, 2:21 PM

how can i hard delete a dataset?

salmon-cricket-21860

01/21/2022, 3:22 PM

Hi, How can I delete this invalid dataJob urn? I tried deleting the invalid urn with the command below, but failed to delete it. (attached a screenshot)

Copy code

datahub delete --urn "urn:li:dataJob:(urn:li:dataFlow:(airflow,commercial_commercial_type_r0,PROD),notebook_affiliate)" --hard

This is the invalid datajob urn

Copy code

https://.../tasks/urn:li:dataJob:(urn:li:dataFlow:(airflow,commercial_commercial_type_r0,PROD),notebook_affiliate)/Documentation?is_lineage_mode=false

better-orange-49102

01/22/2022, 1:47 PM

is there a way to delete a specific dataset profile? say, i manually create a dataset MCE and some dataset profiles and ingested via rest. However, while i can rollback the dataset, the profile cannot be deleted.

salmon-rose-54694

01/24/2022, 12:35 PM

I build up datahub with source code and deployed on kubernetes, but mae-consumer component has below errors, any config i missed? Thanks in advanced.

better-orange-49102

01/25/2022, 2:01 AM

could anyone confirm that we can put the frontend instead of pointing to gms in the ingestion recipe? ie

Copy code

sink:
  type: "datahub-rest"
  config: 
    server: "<http://localhost:9002/api/gms>"

I thought i read somewhere that we can point via frontend proxy to gms, but i cant seem to find it anymore. I was trying to remove the need for a GMS ingress. however, curl to http://localhost:9002/api/gms/config doesnt work.

mysterious-nail-70388

01/25/2022, 6:49 AM

Hi, if I want to manually add lineage to two tables of the mysql data source, I should transfer what to which aspects of the DataHub interface

eager-oxygen-76249

01/25/2022, 6:50 AM

Hi Team, would like to know if its possible to create custom pipelines in Datahub & ingest data against it. We have lot of scheduler & system which executes application/queries which modified tables. we would want to capture all of those jobs in DataHub Lineage

plus1 1

mysterious-nail-70388

01/25/2022, 8:07 AM

Hi, how did we add these hive data source queries here, which I found on demo

bland-orange-95847

01/25/2022, 9:04 AM

Hi, I am using the “newly” introduced way to ingest lineage data via REST with the MetadataChangeProposalWrapper. I want to track the ingests of every lineage change via custom runIds but setting the systemMetadata seems to does not have any impact on this and I cannot track the changes (or either revert them in case of failure) Is this a known issue or by design? Anybody else facing this or have an idea? Thanks 🙂

plus1 1

breezy-guitar-97226

01/25/2022, 9:20 AM

Hi channel, I have a couple of questions about ingestion, and more specifically the Kafka connector. The first one is about any plans to implement statefulness in the short/medium term for this connector, in a similar way that many of the

sql

ones already do. This would be of great help for us, allowing us to remove the custom code that it is actually in charge of cleaning up stale topic metadata. The second question is actually a feature request, however I wanted to gauge a bit how useful such a feature could be for the Datahub community. This is our use case in a few words: in our Kafka deployments we model application access via internal users, these users only exist in Kafka, and it would be very useful for us if such information (internal user entities) could be also ingested (maybe optionally) by the Kafka connector. This would simplify governance and give us a more complete view on the state of the deployments. Please share your thoughts! Thanks!

flaky-airplane-82352

01/25/2022, 1:13 PM

Hi, is there a way to automatically ingest table and column description of a SQL Server database in datahub? I'm trying to do so but description field always comes null value...

brave-forest-5974

01/25/2022, 1:41 PM

❓ I'm looking at the tranformers and documentation in metadata-ingestion. One use case I have for dbt ingestion is that I want to completely skip tests and seed data, until we can ensure that it's not just overwhelming noise in the UI. Is there a transformer (or can a custom one) to prevent MCEs from continuing on to ingestion?

gentle-florist-49869

01/25/2022, 3:23 PM

Hello Guys, may you help me with a little question about profiling, please? I'm doing the demo yml with my local SQL DB and colunms (min, max, mean, median) didn't work - Do you now what I doing wrong, please? thank you

mysterious-lamp-91034

01/25/2022, 10:31 PM

Do we have a REST API to ingest table properties? Something like https://datahubproject.io/docs/how/add-user-data/#using-restli-api Thanks!

breezy-controller-54597

01/26/2022, 5:22 AM

I created a Chart based on the Hive data (default.table) in Superset, and when I ingested from Hive and Superset to DataHub, and Lineage was generated. (great!😆) However, since the urn of the original dataset is different between the dataset directly ingested from Hive and the Chart ingested from Superset, the dataset was registered as a duplicate.🤔 • dataset generated by Hive ingestion: ◦ urnlidataset(urnlidatasetPlatform:hive,default.table,PROD) • dataset generated by Superset ingestion: ◦ urnlidataset(urnlidatasetPlatform:hive,_*Apache Hive.*_default.table,PROD) The urn of the dataset ingested from Superset will not match because the database name set in Superset is added to the urn. Is it possible to match urns, for example adding "Apache Hive" to the urn ingested from Hive and ingest it?

adorable-flower-19656

01/26/2022, 7:15 AM

Hi guys, I'm using Datahub as lineage backend of GCP Composer(Airflow). I found something strange. 1. Run a DAG which have 3 tasks. so Datahub shows 3 tasks for the DAG(Pipeline) and its lineage 2. Remove 2 tasks in the DAG and re-run 3. Expect: Datahub shows 1 task for the DAG and updated lineage 4. Actual: Datahub still show 3 tasks and their lineage Which component or configuration should I check? Is it normal?

witty-butcher-82399

01/26/2022, 11:27 AM

Hi! How can I enable the

statefulIngestionCapable

mentioned here? I only found this reference in a java class here. Do I need to rebuild the project just to enable it?