Greetings Team We have been testing a bit Glossary Terms ing DataHub #troubleshoot

Greetings Team: We have been testing a bit Glossa...

hallowed-dog-79615

07/20/2022, 8:00 AM

Greetings Team: We have been testing a bit Glossary Terms ingestion and we have found some unexpected behavior. Let's go through some steps: 1.- It does not matter if we create A Glossary Term in the UI before adding it massively through a CSV ingestion. But let's say we create it. We create a term called "Active_users". 2.- We add some documentation to our just-created term. Again this is not mandatory but helps identify the issue later. 3.- We proceed to add the term to several dataset objects. For this we leverage the CSV ingestion feature. We prepare our CSV following the guidelines in the documentation. We ingest the CSV. And the term "Active_users" is added to our datasets! It seems it worked. 4.- But then if we go to a dataset's entity page and click on the "Active_users" term badge, so we access its own entity page. And there we see that the documentation we added is missing. 5.- Then we start playing around and realize that the term "Active_users" is duplicated. There are two different entity pages: the one of the term we created manually (urnliglossaryTerm:82a86728-087a-4232-bfbe-5a9a2790f6ce), and the one of the term we added through CSV (urnliglossaryTerm:Active_users). As you see, their ids are quite different. 6.- Not only that. In the Glossary terms menu, the ingested term is not even visible, we can only access its page through other entities badges. The manually created one is of course in the list, but nothing appears under "Related entities". 7.- Even more, we realized we are not able to delete the ingested term. We cannot even remove it from datasets. If we try to remove it, it says "Successfully removed", but then the term still there when you refresh. We understand this is a bug, even if we were missing some step in which we had to associate the ingested term with an already existing one, having both the same name, not being able to delete or access a term does not seem like a desired behavior. I apologize if this have been reported elsewhere, I found Glossary term bugs but they didn't reach the "not being able to delete part". Thanks!! Dani

better-orange-49102

07/20/2022, 9:30 AM

Copy code

5.- Then we start playing around and realize that the term "Active_users" is duplicated. There are two different entity pages: the one of the term we created manually (urn:li:glossaryTerm:82a86728-087a-4232-bfbe-5a9a2790f6ce), and the one of the term we added through CSV (urn:li:glossaryTerm:Active_users). As you see, their ids are quite different.

yup, if you create the term in UI, the URN is going to look like some UUID. you should refer to that URN in the csv ingestion. Does replacing the invalid term in the csv spreadsheet with another valid term, then reingesting help to remove it?

thank you 1

hallowed-dog-79615

07/20/2022, 9:44 AM

I may try to ingest specifying the urn id instead of the text, but, the thing is that it does not matter if we didn't create a glossary term through the UI (that's what I state in the point 1). If we add the term ingesting from CSV, even if the term does not exist previously, we cannot delete it anyway

hallowed-dog-79615

07/20/2022, 3:49 PM

Adding a bit more details. We just created a Glossary node (folder), through the UI and saw this in the "Recent viewed". We don't know if it is a bug or it something really desired, but does not seem very comfortable for any user to see the folder listed in this way!!

echoing-airport-49548

07/21/2022, 12:47 AM

Hi @hallowed-dog-79615 I apologize for the bad experience. It looks like we don’t currently check if a term exists in DataHub before applying it when running CSV ingestion. I’m going to add a fix for this that should be available in the next release

thank you 1

echoing-airport-49548

07/21/2022, 12:47 AM

As far as current solutions, I would recommend rolling back that ingestion run

echoing-airport-49548

07/21/2022, 12:48 AM

https://datahubproject.io/docs/how/delete-metadata/#rollback-ingestion-batch-run

echoing-airport-49548

07/21/2022, 12:48 AM

Take a look at this guide for how to do that, and it should delete the bad entities that were ingested

echoing-airport-49548

07/21/2022, 12:49 AM

I think if you rerun the CSV ingestion and make sure the term exists when adding it to the CSV (for now, before the guardrails are added on the next version), you shouldn’t run into this problem

thank you 1

echoing-airport-49548

07/21/2022, 12:49 AM

Please let me know if that makes sense and if you have any questions

hallowed-dog-79615

07/21/2022, 7:45 AM

Thanks @echoing-airport-49548, nice to hear that you are implementing a fix! 😄 I have, nevertheless, one question. This one's regarding what you say about "I think if you rerun the CSV ingestion and make sure the term exists when adding it to the CSV". Indeed, some of the terms I applied in the CSV already existed, with exact same spelling and case, but anyway the became duplicated. How should I specify the already existing terms in the CSV? May I indicate the id like this: urnliglossaryTerm:82a86728-087a-4232-bfbe-5a9a2790f6ce ? I mean, instead of like this: urnliglossaryTerm:Active_users. Thanks again!

echoing-airport-49548

07/21/2022, 3:26 PM

Ah yes so you would have to specify the exact urn for the glossary term

echoing-airport-49548

07/21/2022, 3:28 PM

check it out on this page https://demo.datahubproject.io/glossaryTerm/urn:li:glossaryTerm:11bf8c56-87a8-4127-945a-9632649a325c/Documentation?is_lineage_mode=false

echoing-airport-49548

07/21/2022, 3:28 PM

you can get that by clicking this copy icon on the glossary term page

famous-florist-7218

08/29/2022, 3:19 AM

@nutritious-printer-9873 number 5 seems like the same problem as we have.

better-orange-49102

08/29/2022, 3:22 AM

you need to be aware that if you created the glossary term in UI, the URN would be a UUID and thats the URN to specify in csv enricher

famous-florist-7218

08/29/2022, 3:32 AM

Hi @better-orange-49102. In my case, we do have a transformer like this. It works fine, but these glossary terms [Email, PII] isn’t listed in Glossary UI page. Otherwise, if I create glossary term with the same name manually, it would be display correctly.

Copy code

transformers:
    -
        type: simple_add_dataset_terms
        config:
            term_urns:
                - 'urn:li:glossaryTerm:Email'
                - 'urn:li:glossaryTerm:PII'
    -
        type: pattern_add_dataset_schema_terms
        config:
            term_pattern:
                rules:
                    email: ['urn:li:glossaryTerm:PII', 'urn:li:glossaryTerm:Email']
                    first_name: ['urn:li:glossaryTerm:PII']
                    last_name: ['urn:li:glossaryTerm:PII']

famous-florist-7218

08/29/2022, 3:35 AM

It would be great if the transformers can create and append glossary term. And these terms can be managed and listed properly on the UI.

better-orange-49102

08/29/2022, 3:36 AM

the problem is because the system assumes that the entities exist, even if they don't. the same behavior can be observed for adding tags or fictitious dataset urns in lineage mcps. perhaps a modification to the existing transformers to check for existing entity is required

famous-florist-7218

08/29/2022, 3:43 AM

It makes sense. The transformer should have a checking method for existing entity, and create a new one for non-existing. It will be more convenient.

4 Views

Open in Slack

Previous Next