Hi all, I've just locally deployed and started pla...
# advice-metadata-modeling
s
Hi all, I've just locally deployed and started playing around with Datahub. I have successfully invested the metadata from bigquery but I want to add the tags/description that we add in the data hub UI to bigquery. How can we access the newly added meta data ?
b
datahub doesnt send back the tags to the source system, if thats what you're asking
s
Thanks for the revert , Yes , datahub doesn't send tags to source system but we can have a seperate pipeline which can extract the tags from datahub metadb and then send it to the source system. My doubt is where can I access that tags in datahub metadb.
b
you can query GMS for the tags currently applied on each dataset (shown in the code here) https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_add_tag.py if you want to apply tags as it is added to the dataset, i guess Actions Framework is possible but i'm not too familiar with it
s
Understood, Will try this approach. Thank you @better-orange-49102 for the suggestion!
l
Actions framework is definitely recommended here
You can listen for tag change events
s
Hi @loud-island-88694 instead of having a listener which will sync events change for tags! As per requirement I don't need to have a dynamic behaviour here i.e instead of syncing it using action framework. I can sync it daily by using a pipeline which extract the changes and send it to source system. (I can handle a lag of T-1 day) Let me know your thoughts!
l
Extracting only the changes in the daily pipeline will be tricky but it is doablr
s
@loud-island-88694 @better-orange-49102 while trying to extract the tags , while using the below sample I cant extract the list of tags available (not the changes but complete list of tags).Any suggestions to resolve this.
Copy code
import logging
from typing import Optional

from datahub.emitter.mce_builder import make_dataset_urn, make_tag_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper

# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough)
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

# Imports for metadata model classes
from datahub.metadata.schema_classes import (
    ChangeTypeClass,
    GlobalTagsClass,
    TagAssociationClass,
)

log = logging.getLogger(__name__)
logging.basicConfig(level=<http://logging.INFO|logging.INFO>)


# First we get the current tags
gms_endpoint = "<http://localhost:8080>"
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))

dataset_urn = make_dataset_urn(platform="bigquery", name="test_dataset.test_table", env="PROD")
print(dataset_urn)

current_tags: Optional[GlobalTagsClass] = graph.get_aspect_v2(
    entity_urn=dataset_urn,
    aspect="globalTags",
    aspect_type=GlobalTagsClass,
)

print(current_tags)
b
you mean the list of tags retrieved is not all the tags that is applied on the dataset?
are those tags applied on the field level?
s
no tags are applied on the dataset , through the datahub frontend UI
b
i don't understand. So you ran the script above, and it doesnt return you any tags of the dataset, even though you have added the tags in UI?
is the dataset_urn targeting the correct dataset?
s
1. "i don't understand. So you ran the script above, and it doesnt return you any tags of the dataset, even though you have added the tags in UI?" - yes , It doesn't return any tags 2. yes, it is pointing to correct dataset. 3. Also, can we verify the tags added from backend . As in the tags added must be stored in a db. Can we access that db?
b
3. It's inside MySQL metadata_aspect_v2 table. Not sure why 1 doesnt work. @big-carpet-38439
The only other possible reason is that the tags are applied on dataset fields, which means that the information is kept in another aspect and not globaltagclass. But since you say it's applied on dataset...🤷‍♂️
s
@better-orange-49102 @loud-island-88694 I need your help in below situation : I extracted the data from BQ and loaded it to datahub , then I deleted the table from BQ and again ingested it to datahub. But in datahub the table is still present, how can I delete it from datahub also? or may be a soft delete might also work.
l
@steep-furniture-57251 please set
stateful_ingestion.enabled
to true in your recipe
🙌 1
s
Thankyou @loud-island-88694