Hi foks! Has anyone modelled dataset, dashboard, ...
# advice-metadata-modeling
w
Hi foks! Has anyone modelled dataset, dashboard, chart, etc cost? As a user, I would like check cost of my dataset so I can prioritise actions for the most expensive ones. Eg if costly and no usage, then plan deletion. Cost can be complex and depend on multiple dimensions, one of them being storage size. I haven’t seen any reference to storage size in the entity aspects. Is this something that you have considered? Thanks!
e
I know @big-carpet-38439 was looking into this some time ago!
w
About cost, I was thinking on something like this:
Copy code
MODEL

DatasetCostStatistics:
- costItem: array[CostItem] 

CostItem:
- concept: string 		; describes the concept item, use this to explain measure units
- count: double 		; number of items for the given concept
- costFactor: double 	; cost of a unit

EXAMPLES

- urn: 'urn:li:dataset:(urn:li:dataPlatform:kafka,xxx,PROD)'
  costItem:
  - concept: Topic size (GBs)
    count: 15
    costFactor: 0.001

- urn: 'urn:li:dataset:(urn:li:dataPlatform:s3,yyy,PROD)'
  costItem:
  - concept: Storage (GBs)
    count: 1500
    costFactor: 0.00001
  - concept: GDPR Deletions
    count: 8500000
    costFactor: 0.001
  - concept: GDPR Extracts
    count: 4500000
    costFactor: 0.001

- urn: 'urn:li:dataset:(urn:li:dataPlatform:ZZZ,zzz,PROD)'
  costItem:
  - concept: Storage (GBs)
    count: 1500
    costFactor: 0.00001
  - concept: Licensing (€)
    count: 10000
    costFactor: 1
The overall idea is cost needs to be explained where it comes from. So cost is split into multiple concepts and each concept is specified in the original measurement unit + the cost factor to convert it into money. WDYT?
b
We have been thinking about this internally. The challenge I see is that cost factor is subject to change outside of our control
I almost think that capturing "code proxy units" (e.g. storage GBs) is a more bullet proof approach
And then allowing users to find things by that
w
We are having the same debate internally: cost factor can be tricky. The main benefit of measuring cost in terms of money is twofold: it enables comparing datasets across platform and all the metrics can easily sum up. • Following the example of the storage size (GBs): cost of the storage size can differ a lot from one platform to the other so hard to compare. • About sum up: when considering those code proxy units, you cannot sum up them. Instead, once you translate all of them into money, you can accumulate many other costs such cpu, network or even licensing, amortization cost of the support team… all them easily sum up.
Still thinking on this during the weekend 🙂 This is an example that illustrates the need for a cost metric: 1TB of storage in Oracle is much more expensive than 1TB of AWS S3. This is something that may be unnoticed if exclusively looking at storage metric, however it is clearly noted when shown as cost.
b
Hey! Sorry to surface this thread after a couple of months, but I am stumbling upon it after searching for relevant “cost” discussions in Slack. I am currently thinking through capturing end-to-end cost attribution of data assets via our data lineage in datahub. The idea of having a comprehensive lineage of a data asset and then being able to ask “how much did it cost for the data asset to get here” or rather “how much does it cost to maintain it” is becoming increasingly more desirable to our company. Obviously there are nuances around attributable costs for multi-tenant components (e.g. data pipelines), and they should be thought through. Going on the ideas presented in this thread, I think it would make sense to lean into the open-ended metadata modeling from datahub and expose not only properties that could be attributable to cost (e.g. storage, cpu cores, etc.) but also exposing a way to publish last observed cost for some quantum (e.g. March 2023 cost, trailing 30 days costs, etc). I think this is desirable since we may have infrastructure and services that already help us calculate costs but we want to use datahub to surface and visualize these costs in an e2e fashion. I want this feature to exist, but I’m looking for some help in figuring out what would be the right direction to generalize this information. I have some time carved out this quarter to work on this and would love to see if we could get some traction on this. I noticed that ML models already publish cost categories and cost codes: https://github.com/datahub-project/datahub/pull/2166/files I think this is a good start and perhaps something that could be generalized across asset types in addition to what I described above. Would love to hear what folks in this thread think! Also, if there is a better place to chat about this, please let me know!
w
In my organization, we have rolled up a new custom timeseries aspect for the datasets.
Copy code
record CostItem {
  concept: string
  amount: double
  measurementUnit: MeasurementUnit
  costPerUnit: double
}
record CostTimeseriesStatistics includes TimeseriesAspectBase {
  costItems: optional array[CostItem]
}
enum MeasurementUnit {
    BYTES
    GIGABYTES
    SECONDS
    DAYS
    EUROS
    ...
}
This aspect is usually populated by data platform owners; the shared cost of the platform is allocated individually for the datasets individually. As a timeseries, we can track cost along time. Cost is an array because there may be multiple items here: hot/cold storage, computing resources, network, support, licensing, etc. And for each item, we have the original measurement unit and a cost factor to convert that amount into money and so we can aggregate all items.
b
Thanks for sharing @witty-butcher-82399! This is super interesting. How have you been liking this modeling so far? Has it been working for all of your use-cases? Also, and excuse my ignorance, but how are you adding this custom timeseries aspect to datahub?
(mostly trying to understand if you are forking datahub)
w
How have you been liking this modeling so far?
how are you adding this custom timeseries aspect to datahub?
For the moment, we link this aspect to datasets only. It could make sense for other entities too, such as Charts and Dashboards... or Data Platform Instances. But we are not there yet. We haven't forked for adding the custom aspect. https://datahubproject.io/docs/metadata-modeling/extending-the-metadata-model/#to-fork-or-not-to-fork There you have a good starting point for defining custom aspects and no forking. Hope this helps.
Has it been working for all of your use-cases?
Not sure if all... but it worked so far 😅 At this moment we just want to show the cost of a dataset; quite simple.
b
@witty-butcher-82399 thank you so much! One more question, how are you visualizing the cost in datahub?
w
We have a custom UI
b
That’s really cool. Did you fork datahub or is the custom UI outside of datahub, @witty-butcher-82399?
w
Custom UI outside of DataHub
b
Super interesting! Would you open to talking to me and someone on my team about the work you’ve done around cost visualization and it’s use-cases, @witty-butcher-82399? I could set up 15-20 minutes and we could over zoom sometime in the next week or two. I’d love to learn how you are using datahub to support your use-case and any learnings we could use in my org and potentially contribute back to datahub.
w
Happy to chat and <30mins is 👌. Next week would be fine. What's your timezone? CEST for me.
l
I’ll join as well, we are in PST. One interesting problem we are also trying to solve is to pull storage usage data, as large amount of costs can be attributed to users running queries. It is also very interesting to be able to compare ingestion vs usage, as there are might be examples of a dead weight. While it is not very relevant for Data Hub, the issue with usage tracking (or even calculating proper price of dataset creation) for folks who use Trino is that it is not possible to attribute queries to underlying storage use. Details in: https://trinodb.slack.com/archives/CP1MUNEUX/p1681249046392269
b
Hey! Sorry. We had some things come up last week. We are in PST. I’ll send a DM to work out time and a zoom link.