https://datahubproject.io logo
Join Slack
Powered by
# design-data-product-entity
  • g

    great-toddler-2251

    08/23/2022, 9:50 PM
    I will share details when we have them, but at a high level here is what we’ve been thinking/considering • add ‘*data mesh*’ as a new data platform type, to organize new data product entities; it seemed better than having the existing container entity to organize them • a new DataProduct entity • a new inputOutputPort aspect • new data product versions of the dataset aspects datasetProperties, datasetUsageStatistics, *datasetProfile*; alternatively maybe make those not tied to dataset. • schemaMetadata on the data product entity; this is where things get a little interesting, since in my view the data product has a schema, and all datasets associated with the data product have the same schema, so how to address the fact that dataset currently has a schema as well? • a somewhat similar issue has to do with lineage since I believe the lineage is between data products, not datasets. So the general question is how to limit/constrain datasets that are part of the data product, without duplicating what’s there. I was originally thinking to have the new data product entity reference 0..n dataset entities. But maybe it would be better to not try and shoehorn the existing dataset entity and instead just have what the data product entity needs. Curious what peoples’ thoughts are on this, particularly given I’m very new to DH and not familiar with standards, guidelines, best practices • I would also like to have domains be singular for a data product i.e. a data product only belongs to one domain, and similarly ideally I was thinking that owners would be singular for a data product, though on that point I’m less concerned
    b
    s
    • 3
    • 6
  • e

    elegant-state-4

    12/20/2022, 2:56 PM
    @great-toddler-2251 Thank you for sharing your thoughts on introducing a Data Product entity to datahub. I am new to this community and I read with great interest what you posted. I think the way you proposed to go about adding this feature makes sense to me and is inline with our expectations out in the fields. I am currently working on building Data Mesh platforms for our clients and what you described matches what we looking for. I would however make a couple of suggestions: • A data product has to have a least one dataset on its output ports. One of the key characteristics of a data product is that it has to be "Valuable on its own". A DP that produces nothing cannot be that. • I would definitely have dataProductProperties, dataProductUsageStatistics, and dataProductProfile for data products
    thank you 1
    g
    s
    • 3
    • 5
  • e

    elegant-state-4

    01/11/2023, 9:16 PM
    Hey folks! I am new to datahub and I took an initial stab at adding the Data Product entity as per @great-toddler-2251's diagram above. My code is on a fork. I having some build issues and need help from folks more experienced than me. Frontend is not my forte and some of the issues I am facing are frontend-related. FYI this is very much work in progress so more still needs to be done.
    l
    s
    • 3
    • 5
  • l

    limited-refrigerator-50812

    01/13/2023, 3:01 PM
    Hi guys, I am pretty new to DataHub. I have an academic background, but my industrial partner (KPN; major Dutch telecom provider) is working with DataHub so I am very interested in seeing what's happening here and have enjoyed reading the discussion so far. Since we are working on adding data products to our datahub data catalog, I figured I would contribute my own thoughts and work to the channel (it might be a bit much I apologise in advance). Am very interested to hear what you all think.
  • l

    limited-refrigerator-50812

    01/13/2023, 3:04 PM
    So if you accurately want to describe data products with their underlying output ports and data you need a good definition of what a data product is. Literature and industry are still discussing this, but I reckon for good description and making use of what datahub already has we want to describe data products at three levels: Data Product As has been noted in the discussion here, data products can be pretty abstract or vague. Different people have different ideas about what data products are. In fact the whole data mesh design emphasises that data products can have many different manifestations. IMO their flexibility is one of their greatest strenghts: if I am a data provider I can define my data product in a way that makes sense in my particular context. Of course the flipside of this is that it is hard to come up with a standardised way to describe data products. More on this later, for now I focus on the relations between data products and output ports and data sets. Data (Set) We could give any number of definitions for this, @great-toddler-2251 and @swift-controller-36449 have made excellent points (imo) on the ambiguitiy of this. If we look at the way data sets are defined in datahub, I don't think this meets @great-toddler-2251's definition of a set because they can definitely change. What I think is important for the metadata modelling in this context is that we consider the data to be the thing we actually want to exchange. What I mean by that is that output ports exist to expose the data, data products exist to logically group, govern, describe, exchange, etc the data. At the end of the day, the data is what actually goes from data providers to data consumer and the output ports and data product metadata exist to support this transaction. So when I talk about data in this context, I mean what is already well-defined/supported in datahub, such as the datahub definition of a dataset, but it could be any collection of files that we want to exchange with our data products. Output Ports The output ports exist to expose the data. Output ports can expose different representations of data or even different data altogether. @great-toddler-2251 righfully calls them inputOutputPorts because the output ports of some data products can be the input ports of other data products. As an example of the first category consider exposing inventory data: I could expose inventory data by streaming with a kafka topic output port, this is useful for data consumers that need to know if an item is available right now (e.g., salespersons). Alternatively, I could also expose daily/weekly updates of the inventory, with manual corrections made by the inventory manager as a set of excel-files available through an api or some other portal. This would be valuable to data consumers who value correctness over speed (such as people in my finance department.) As an example of the second, consider a data product with privately identifiable information (PII). I might want to expose an anonymised version of the data through a separate output port, which people can already browse/use without the extra hassle that comes with dealing with PII. Now the reason I propose data at these three levels is exactly, because they have 1..* and 0..* relations. Each data product can have multiple output ports and multiple data sets/files/sources underlying them. Each data set/file/source can have associated metadata such as a technical schema with technical lineage. Each output port can describe different instructions for gaining access or different requirements/policies for consuming the underlying data. Finally, the data product describes all information that exists at an aggregated level. This definitely includes an owner and a description, but can also include a logical schema as well as lineage/relations with other data products.
  • l

    limited-refrigerator-50812

    01/13/2023, 3:09 PM
    So now, what fields do we actually want to describe? I am showing here a (smaller) version of my own big formal model. Which I hope to publish in an academic setting over the next couple of months. For convenience I have included some (if not all) aspects with dashed lines. Output ports could be implemented as aspects in DataHub or they could be entities, I am not sure. A lot of these fields are already contained in datahub, some of them aren't I don't want to spam even more and discuss every single field. Just gonna point out one thing: Based on my own experiences with data meshes and data products, one of the greatest obstacles is the organisational change that is required to effectively cultivate data providers. It's currently no-one's job to build and maintain data products even though it's a lot of work to do so. My own thoughts on the matter are, that at the very minimal we have to be able to establish that data products are valuable (and which data products are valuable). Only then can we effectively motivate the efforts that data providers have to put in to creating and maintaining (and describing in metadata) data products. There is a lot of literature on how to theoretically price data products, but noone seems to have effectively put this into practice. Personally, I believe that the value of data products can only be established through how it is used by the consumers and so I have added a use case entity into the metadata model and encourage data consumers to document their use case.
    DataMeshStandard_Simplified.pdf
    thanks bear 1
    a
    m
    • 3
    • 3
  • l

    limited-refrigerator-50812

    01/13/2023, 3:12 PM
    We have implemented this model through mockups that implement (a subset of) this data model in the datahub business glossary. Of course ideally data products would exist as queryable entities with hierarchical links to output ports and data sets, which is why my interest is piqued by @elegant-state-4's work. If you'd like I can share some screenshots of those as well.
  • g

    great-toddler-2251

    01/13/2023, 4:17 PM
    @limited-refrigerator-50812 thanks for the detailed input and feedback. Very helpful and interesting. At this stage of data mesh, all ideas are good ideas I think. I wanted to share a diagram I put together a few months back to describe to folks in my company a logical view of a data product. fwiw …
    thanks bear 1
    l
    • 2
    • 7
  • g

    great-toddler-2251

    01/13/2023, 4:24 PM
    by multimodal I mean different transports e.g. files, streams, HTTP, etc. and potentially different wire formats e.g. Kafka port w/Protobuf vs a HTTP port w/JSON
    l
    m
    • 3
    • 7
  • m

    mammoth-bear-12532

    01/25/2023, 6:22 AM
    I've started an RFC to capture the discussion on this channel. https://github.com/datahub-project/rfcs/pull/1
  • m

    mammoth-bear-12532

    01/25/2023, 6:23 AM
    Haven't captured all the threads yet, would be great to get some help there 🙂
    l
    • 2
    • 2
  • n

    narrow-bear-42430

    02/13/2023, 3:39 PM
    Hi All (and @mammoth-bear-12532) - sorry for being rather late to the party, but I wanted to add my thoughts to this - partly as we're a prospective customer, but also because I've been doing some work on this across my last 2-3 companies over the last few years and maybe my thoughts are useful here... I've attached a copy of my thinking (I thought pdf would be easier than a massive post! I will get round to posting this as a blog at some point!) This has been used at a couple of companies, at least in the early stages of talking about Data Products at each company, so it does have at least some real-world production-level feedback. The essence is splitting the documentation into build-time and and run-time data. Much of the build-time info should be known at conception of the product and perhaps populated from info held in the original Jira (or thing-tracker) ticket and then enriched subsequently through build/ deploy pipelines. The Data Product itself is multi-modal - you should be able to get the same result from interrogating the data product across different delivery platforms (e.g. Tables, Streams or APIs) - but the documentation should be consistent. Differences between the Platforms should be encapsulated in a Data _Contract_ And if a picture is worth a thousand words - here's the image:
    Data_Product_Documentation_Artefacts.pdf
    m
    l
    • 3
    • 8
  • p

    proud-dusk-671

    03/31/2023, 10:29 AM
    https://datahubspace.slack.com/archives/CV2KB471C/p1680161427852299 For your perusal ^
  • q

    quiet-dusk-83720

    04/28/2023, 7:46 AM
    I just saw the demo of the data product support in datahub, good job. Something that I thought about when the difference between ‘internal’ vs ‘shareable’ assets/datasets came on screen. Is an ‘output port’ not what is meant with a shareable asset/dataset (using the terminology of zhamak)? So instead, marking assets/datasets with ‘internal’ vs ‘output port’ could be a good idea to align to that terminology?
    plus1 1
    l
    • 2
    • 1
  • f

    fancy-toddler-89669

    05/01/2023, 8:13 AM
    I loved the townhall@mammoth-bear-12532! Very nice to see the progress towards a real Data Product entity 🥳 Currently within our company we're misusing the glossary to create data product entities. This covers most practical usecases. However what would be really valuable for us in a seperate entity is the concept of output ports. Currently within the glossary there's no way to define seperate output ports of a product. Having this feature in the entity would provide a definite starting benefit above the glossary itself!
    plus1 4
  • l

    limited-refrigerator-50812

    05/01/2023, 10:44 AM
    Is the code shown in the demo available somewhere? I was slowly building my own data product entity and it would be interesting to compare / start merging the two.
  • m

    mammoth-bear-12532

    05/15/2023, 2:30 PM
    @quiet-dusk-83720 @fancy-toddler-89669 @limited-refrigerator-50812 thanks for your feedback so far! The https://github.com/datahub-project/datahub/pull/8039 PR that covers the features covered during the demo is now up and will be merged soon. We’re definitely interested in hearing your feedback on output port terminology (as proposed by Zhamak) versus public / private (as implemented by most software systems including most recently
    dbt
    ) (https://docs.getdbt.com/docs/collaborate/govern/model-access)
    q
    l
    g
    • 4
    • 6
  • l

    limited-refrigerator-50812

    05/16/2023, 8:42 AM
    Let's consider a data product for a company that offers digital subscriptions to its customers. The image shows a simple logical model with three tables that I think is fairly self-explanatory. Now we consider what kind of output ports could be constructed to make this data available to consumers. I will give three examples. For each example, there is the same logical model; they all belong to the same data product; with the same owner, but we are making the data available in a different manner. This includes the access rights captured by public/private/protected that can be found in the
    dbt
    terminology but goes beyond that, as we will see. Scenario 1 There exists an output port that simply makes all the data available through a single SQL-based API. The corresponding output port describes how you can get access rights, under what conditions, who to contact with questions, etc. This information can also be captured in a data contract for this output port. Scenario 2 Our company wants to do marketing research to predict patterns in customer behaviour. However,
    customer
    data contains personally identifiable information (PII) protected by regulations such as GDPR, ToS, California's data protection regulation, etc. Therefore, an output port is created that exposes only the
    subscription
    and
    product
    tables. Moreover, some anonymisation script is run to ensure that information in the
    subscription
    table cannot lead back to individuals. Scenario 3 The company runs a service that allows
    customers
    to view their
    subscription
    details, which consumes data from this data product. The data needs to be accessible in a more timely manner than our SQL-based back-end supports so we expose it through streaming. So an output port is created that exposes the data in a kafka-stream, with its own instructions, SLA, etc. So all three output ports expose the same data. Just different distributions of this data. I would like to describe the data on one level (e.g., the dataset or the data product level) and the output ports/ways it can be consumed on the output port level. Output port 1, and 2, expose data from the same backend/platform, but have different access rights and permitted usage (which could be captured with public/private terminology). However, output port 3 is different in a way that cannot be captured by public/private. You could address this, by saying output port 3 is a different data set and /or a different data product. However, that leads to duplicated efforts, as we would then need to describe a lot of information twice.
    👍 2
  • p

    proud-dusk-671

    05/24/2023, 4:50 PM
    What is the meaning of a data product being private? cc: @mammoth-bear-12532
    m
    • 2
    • 1
  • a

    aloof-dentist-85908

    05/30/2023, 10:54 AM
    Hi, I just watched the recording of the last townhall. I am really looking forward to Data Products! Something was not completely clear to me: Will the possibility to have the same asset (e.g. dataset) in many Data products be SaaS/Acryl only? Thanks! 🙂 @mammoth-bear-12532
    m
    q
    d
    • 4
    • 6
  • q

    quiet-dusk-83720

    06/02/2023, 8:34 AM
    On reading the docs I’ve seen this planned update to the data product feature: “Support for semantic versioning of the Data Product entity” thinking about this, is it not at the level of the output ports that we need versioning? think of a scenario to evolve a data product en only break the data contract of 1 output port but not the other output ports, then the version information is needed at the port level, no?
    g
    • 2
    • 3
  • n

    nutritious-orange-23137

    06/20/2023, 9:45 AM
    Hi team! Thanks for supporting such a great concept as data products. I'm just wondering if it's possible to create/update data product properties in ui? I created a new data product in web ui, but I couldn't add any properties to it :( In general I want to grant some users (business owners) rights to edit this data, but seems that I can only ingest properties from yml file. Nb: I have all the (admin) rights on my datahub instance, id addition I granted myself with all privileges for "data product" type. Thanks!
    m
    • 2
    • 1
  • n

    nutritious-lighter-88459

    07/03/2023, 3:27 PM
    Hi Datahub Team, I am trying to create a Data Product via the OpenAPI's
    POST /entities/v1
    endpoint along with below request body:
    Copy code
    [
      {
        "entityType": "dataproduct",
        "entityKeyAspect": {
          "__type": "DataProductKey",
          "id": "ikhatri-dp-openapi"
        },
        "aspect": {
          "__type": "DataProductProperties",
          "name": "OpenAPI Test DP",
          "description": "Data Product Created Via OpenAPI",
          "customProperties": {
            "creation source": "openapi",
            "created date": "3rd July, 2023"
          },
          "assets": [
            {
              "destinationUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.employee,PROD)",
              "created": {
                "time": 1688134320000,
                "actor": "urn:li:corpuser:etl",
                "impersonator": "urn:li:corpuser:jdoe"
              },
              "lastModified": {
                "time": 1688134320000,
                "actor": "urn:li:corpuser:etl",
                "impersonator": "urn:li:corpuser:jdoe"
              }
            },
            {
              "destinationUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.department,PROD)",
              "created": {
                "time": 1688134320000,
                "actor": "urn:li:corpuser:etl",
                "impersonator": "urn:li:corpuser:jdoe"
              },
              "lastModified": {
                "time": 1688134320000,
                "actor": "urn:li:corpuser:etl",
                "impersonator": "urn:li:corpuser:jdoe"
              }
            }
          ]
        }
      }
    ]
    This returns me 201 response status along with the unique urn of the the created data product. However, after this operation I am no longer able to list all data products (including the ones I created via GraphQL api). I keep getting unknown error occurred error message (PFA). Even searching doesn't work. However, when I enter the urn directly into the url as
    {hostname:port}/dataProduct/urn:li:dataproduct:ikhatri-dp-openapi
    then I am able to view the data product. 🤔 Is there something missing in my request body or is there an issue with listing api used on UI ?
    m
    • 2
    • 8
  • b

    blue-mechanic-1369

    07/25/2023, 12:33 PM
    Hi, I have a conceptual question. For Data Product properties, would you put these as properties of the Data Product entity, the Asset entity (e.g. table) or both? For example, Data Product Owner, Security Classification, Lineage, Retention Policy, Version Maybe it's a property-by-property basis? Interested in your thoughts!
  • h

    handsome-train-99822

    08/17/2023, 9:21 PM
    Hi @little-megabyte-1074 cc: @cuddly-knife-31265 I wanted to follow up on the shared entity behavior we saw in our datahub env. Curious if this is something that is a feature or a configuration on our end. Summary: • Our eintities consit of GCP Datasets & Looker Projects • We have a Domain 'US' in our env • Under Domain 'US', we have Data Product A & Data Product B • When adding a Big Query table, we can assign the entity to BOTH Data Product A & Data Product B • HOWEVER, when we assign an Looker dashboard entiy to both . Only ONE Data Product A & Data Product B Want confirm whether this is a deliberate feature or if it's possible to assign the entity to multiple data products? Any suggestions would be much appreciated. Thank you!
  • m

    mammoth-bear-12532

    08/31/2023, 6:30 AM
    Hi @handsome-train-99822: sorry for seeing this late, it is possible to assign an entity to multiple data products.
    h
    l
    r
    • 4
    • 7
  • q

    quiet-dusk-83720

    09/05/2023, 2:10 PM
    Is there any support planned to also have the workflow of requesting data access to an output poort of a data product or to a data product somehow supported in DataHub? Of integrateable in datahub out of the box? (cc @mammoth-bear-12532)
    m
    • 2
    • 6
  • g

    gifted-bird-57147

    12/12/2023, 11:01 AM
    Hi, I'm trying to implement some dataproducts using the CLI. I want to specify a group as owner:
    Copy code
    owners:
      - id: urn:li:corpGroup:512344f8-3107-4194-bbb9-0c1f127a57f6
        type: BUSINESS_OWNER
    This is an existing group in our Datahub environment. But when I ingest the recipe the corpGroup URN is not recognized and a new user URN is created: urnlicorpuserurnlicorpGroup512344f8-3107-4194-bbb9-0c1f127a57f6. Via the UI I am able to associate a group as owner and set an ownership type. If I then use the datahub dataproduct diff function to sync the UI changes back to the yml the ownership type is set to NONE... Is this a know issue? I'm using cli version 0.12.0.3 on managed acryl v0.2.13.3.
    r
    • 2
    • 1
  • b

    brash-crayon-20992

    02/26/2024, 3:57 PM
    Hi team! General question about Data Products. I've read the doc and played with these in the UI and it's a great feature. I had a general question around "Data Products". In my mental models, Data Product could be the combination of datasets coming from different teams and domain to create new datasets that power a business case (Data Product). In that model, a dataset can be linked to one or more data products. I see that the definition in the documentation mentions that a Data Product belongs to a single domain. Is my mental model wrong? If so, how would you go about logically grouping datasets from different domains for a new purpose?
    r
    • 2
    • 5
  • l

    limited-refrigerator-50812

    03/05/2024, 3:32 PM
    During the last town hall @mammoth-bear-12532 mentioned a PR that included defined output ports. Does anyone have a link to that?
    r
    • 2
    • 2