https://datahubproject.io logo
Join Slack
Powered by
# advice-metadata-modeling
  • q

    quick-continent-74418

    10/31/2023, 4:24 PM
    Hello, question for #advice-data-governance, how people with datahub manage
    Enum
    list of values ? The list of supported values (and sometimes the explanation) is very valuable and I feel I miss something. Are people documenting values in the description ? Not documenting at all ? In any different other way ?
    c
    • 2
    • 8
  • h

    high-hospital-85984

    11/01/2023, 7:43 AM
    👋 (I tried to search for the word
    kappa
    and got ZERO hits 😅 ) Has anyone tried to model a kappa architecture, including the different versions of processors and target datasets, in Datahub? It might not be super user-friendly for humans, but we have some programatic use cases in mind that would benefit from the graph data.
    r
    • 2
    • 2
  • f

    few-receptionist-67008

    11/03/2023, 4:24 PM
    hi! will newly presented entity verification process (via structured properties ) be also available for glossary terms? the last town hall demo included an example on dataset entitites.
    r
    • 2
    • 2
  • m

    most-scientist-56654

    11/07/2023, 3:00 PM
    Hello, I need an advice on how to manage metadata for multi-tenant environment. For example, we have N GCP projects with similarly named and structured datasets/tables, so we ingest all of them introducing N ingest sources. However this introduces a problem - now we have N similar entities in DataHub, so if we want to update a description (or any part of metadata in general) it'll require to do some repetition. From what I understand - one way to overcome this is to use
    siblings
    aspect. However, after looking into the source code and commit history I realized that it was introduced specifically for associating database entities with
    dbt
    models. So, is the
    siblings
    the only way to actually group similar entities and share metadata between them? Thanks.
    plus1 2
    c
    • 2
    • 5
  • h

    high-air-78476

    11/08/2023, 2:35 PM
    Team, Please review the RFC https://github.com/datahub-project/rfcs/pull/6
    l
    • 2
    • 2
  • g

    glamorous-ambulance-32929

    11/09/2023, 10:23 AM
    Hi team, I have created a custom action for when a tag "pii" has been added to a dataset. I would like to access the metadata of each dataset when this tag has been added, in order to then upload said dataset with its metadata to dataverse. How can I access the dataset's metadata?
    class CustomActionConfig(BaseModel):
    # Whether to print the message in upper case.
    to_upper: Optional[bool]
    # A basic example of a DataHub action that prints all
    # events received to the console.
    class CustomAction(Action):
    @classmethod
    def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action":
    action_config = CustomActionConfig.parse_obj(config_dict or {})
    return cls(action_config, ctx)
    def __init__(self, config: CustomActionConfig, ctx: PipelineContext):
    self.config = config
    def act(self, event: EventEnvelope) -> None:
    print("Custom Action! Received event:")
    metadata = event.event
    print(json.dumps(<http://metadata.as|metadata.as>_dict(), indent=4))
    def close(self) -> None:
    pass
    Thank you for your help
    r
    • 2
    • 1
  • i

    important-train-83364

    11/09/2023, 11:52 AM
    Hello, i’m doing experiments trying to customize the metadata model. I’ve followed this guide https://github.com/datahub-project/datahub/tree/master/metadata-models-custom and basically I succeeded in adding this aspect to an hive dataset and using the script in the example also got this aspect filled and showed in the UI as a tab. Now, i’m trying to make a step forward, and i’m customizing a dataproduct (the example is on a dataset). I’ve succeeded to change the model and also to insert data on the dataproduct, and also be able to see that in the json payload of that using the cli. I think that what is missing is the UI part for the dataproduct, in fact, i cannot see these data like it happens for dataset. I don’t realize if this is a missing feature, or something is wrong in my model. Any advice here? (Using version 0.11.0)
    r
    o
    b
    • 4
    • 5
  • p

    purple-refrigerator-27989

    11/16/2023, 4:20 AM
    Hi all, I found that all ingested data stored in neo4j has two edges(eg: ingestionSource and r_ingestionSource), I wonder why there are always two edges, what do they mean specifically?
    f
    • 2
    • 1
  • f

    fierce-guitar-16421

    11/16/2023, 3:51 PM
    Dear DataHub Team, We are now developing our custom datahub actions based on our custom metadata models. But we noticed it is quite tricky to put them together, since the datahub actions package requires the standard acryl-datahub package as a dependency, and we don’t want to fork it but still want to use our own build of the acryl-datahub package which contains our own models. It might appear reasonable to decouple the datahub-action from depending on the acryl-datahub package, since it is a framework and should not rely too much on the model details in the acryl-datahub package. Does that make sense?
    r
    h
    g
    • 4
    • 6
  • p

    powerful-dawn-10711

    11/21/2023, 12:23 PM
    Hi all, I want advice regarding how to document Source 2 Target mapping into datahub. Currently we’re using xls sheet template to manage data mapping. is there away I can automate the data mapping into datahub? I am using Spark as ETL / Airflow as Scheduler Slack Conversation
  • l

    lively-terabyte-64366

    12/12/2023, 2:42 PM
    How to deal with multiple dbt projects which define same sources? Hey guys! Newbie's here! :) I have a question regarding dbt core and datahub integration. For example, we have 2 dbt projects. Both of this projects define very same source table (Example: common logs stream) In one project, all columns have descriptions, but second project doesn't. And in result we have race condition when source table column descriptions depends on when metadata is ingested. How to deal with this situation? I don't want to disallow in recepy, because this can easily be removed or forgotten or just doesn't allow other team to define source.
    Copy code
    entities_enabled:
          sources: 'NO'
    r
    • 2
    • 1
  • h

    hallowed-artist-55761

    12/26/2023, 7:02 AM
    is datahub having the capability to connect with kubeflow (kubernetes for ML models) our team wants to leverage datahub with out kubeflow setup any help regarding this would be highly appreciated
    r
    • 2
    • 1
  • p

    powerful-dawn-10711

    12/28/2023, 9:02 AM
    Is there a way I can propagate tags across lineage? Suppose that I've a Table (Customer) with attributes such as (name, age). And I add tag to "age" col as PII tag. Will any view with age col that read from that table will have the same tag? I want something like classification propagation to be exact. Is it possible in Datahub using tags or another way around? Thread in Slack Conversation
  • a

    adorable-salesclerk-90917

    01/04/2024, 6:05 PM
    Tried searching but no luck..has anyone created tags for fields denoting which ones have data quality checks on them? we have a set of baseline checks and business critial one and was looking to see what patterns other have done
    g
    a
    r
    • 4
    • 4
  • m

    miniature-intern-36799

    01/10/2024, 7:19 PM
    Hi there, how is it that a table can be "Composed of" two different tables and how do I unset this? I was trying to manually identify a downstream dependency via the DataHub UI but ended up somehow merging the two tables, and can't seem to unwind. I've tried looking around but there doesn't seem to be any documentation on how these entities get nested through setting upstream/downstream. This is on v0.2.14.1 in case that's helpful.
    r
    g
    • 3
    • 2
  • l

    limited-monitor-26855

    01/19/2024, 4:58 AM
    Hi there, any recommendation for metadata modeling? Should i start with integration sources ( data touch points ) and later try to group them into data products, or should i start with the data product first thinking ?
    a
    • 2
    • 1
  • m

    millions-art-55322

    01/25/2024, 3:56 PM
    What kind of metadata?
  • s

    salmon-quill-44349

    02/06/2024, 1:39 PM
    Hi, I am trying to connect elastic search indexes in datahub v0.10.5 and getting below error.
    Copy code
    'Unable to emit metadata to DataHub GMS:
    Can you please help me out on this ?
  • w

    witty-butcher-82399

    02/06/2024, 4:11 PM
    Domains is missing
    status
    aspect. Is there any reason for that? we are planning on a contribution for that 🙂
    Copy code
    - name: domain
        doc: A data domain within an organization.
        category: core
        keyAspect: domainKey
        aspects:
          - domainProperties
          - institutionalMemory
          - ownership
    In most of (all?) other entities, deletion is implemented as soft-deletion, whereas in Domains a hard-deletion is performed (well... that's my guess giving the missing
    status
    aspect 😅). Also,
    status
    aspect is the one supporting the stateful ingestion feature in the connectors. Actually, we have a custom connector for domains and we are thinking on implementing the stateful ingestion feature and so that's when we missed the aspect.
    d
    r
    • 3
    • 6
  • g

    gifted-diamond-19544

    02/19/2024, 11:59 AM
    Hi there! I hope I am asking this on the right place. How could we model services on Datahub? say for example, We have certain dataset on our lake, and there is a service that takes data from that service and turns it into other data, or sometimes just consumes it and does not produce any data. We would like to model what are the services that are consuming data on out lake. Anyone has any idea on how to model that? Thank you 🙂
    r
    • 2
    • 5
  • c

    calm-alligator-12692

    02/20/2024, 9:26 AM
    Hi community 👋 Want some advice. We're using a mixture of Databricks and Redshift in our team. Redshift is connected as an external catalog in Databricks. This means that one redshift table can show up as a Redshift entity and a Databricks entity. Can I merge them somehow? My other option is to only ingest the Redshift tables via one source plugin, but: • If I ingest via Databricks plugin, I lose out on a lot of the usage stats and SQL generated lineage within Redshift • If I ingest via Redshift plugin, the downstream databricks tables which use the Redshift tables don't show any lineage to the redshift tables.
    b
    • 2
    • 3
  • c

    calm-alligator-12692

    02/20/2024, 9:28 AM
    I don't mind writing a custom script to do this, just need to know where to look
  • a

    abundant-journalist-31750

    02/21/2024, 6:30 PM
    Hi! is it possible to connect to elasticsearch behind the scenes and extract datahub data ? or we need to go through graphql and API ? my aim is to extract dump and import into Cosmos graph database.
    g
    • 2
    • 2
  • c

    chilly-processor-85299

    02/29/2024, 5:33 PM
    Hi team! I am updating dataset descriptions via graphQL, and I can see them update correctly on UI. But when I query the description on graphQL, the description is not updated. How can I get the most up-to-date descriptions?
    r
    • 2
    • 4
  • l

    limited-motherboard-51317

    03/02/2024, 2:13 PM
    Hi! I was testing new feature of Datahub, data contract. I learned it from video, but can find a specification of data contract in documentation. Can someone point me out where syntax of data contract is documented?
    r
    • 2
    • 3
  • l

    limited-motherboard-51317

    03/05/2024, 12:45 PM
    Hi! Can someone provide an example how to add properties to Glossary Term using REST API?
    i
    r
    • 3
    • 6
  • i

    incalculable-park-61483

    03/06/2024, 6:50 PM
    Hi! Is there a way to programmatically add owners to a glossary term via graphql or the sdk? I tried the docs but couldn't find anything. Thanks!
    r
    b
    • 3
    • 4
  • r

    red-actor-28068

    03/12/2024, 3:03 PM
    Looking for a way to use something like glossary terms to store data mappings - for example, lets say I'm a car dealership that has a bunch of datasets about cars, but internally the cars themselves are stored in columns with some inscrutable code like H28938, and C20193. I'd like to tag the columns with mappings such as Honda: H28938, or H1091982, Chrysler: C20193, etc, so that users could more easily find which datasets contain data that have things to do with Hondas or Chryslers. Can Datahub accommodate something like this or am I building out a separate tool?
  • l

    little-painter-30105

    03/12/2024, 11:33 PM
    Hi Team, We have Datahub integrated with Airflow + Snowflake + dbt + tableau. I am trying to do some custom metadata updates using graphql API. Currently airflow DAG owner name is flowing DAG to Datahub. We want to keep a different set of owner naming conventions in Airflow, but it should be updated to
    ldap
    user name (datahub signed in user) in Datahub for these Airflow entities (DAGs and tasks). How can I update Airflow owner name in Datahub (keeping different Airflow owner in Airflow UI) ? Is there a way we can update using API calls or ingestion in Datahub?
    r
    • 2
    • 1
  • s

    strong-father-14840

    03/13/2024, 2:33 PM
    Hi, we would like to share our Datahub instance with our customers but without sharing the dbt model definitions and PowerBI dataset definitions. It appears that view access policies for datasets are rather limited, is it possible to hide dataset definitions or even delete them?
    r
    • 2
    • 2