Hello <#CV2KB471C|general>, I'd like to raise a qu...
# getting-started
n
Hello #general, I'd like to raise a question f Can we use DataHub to store metadata other than store the metadata structure? I'd like to give you some context to better understand my question: My application process daily an high volume of inbound feeds for several customers. Each feed is transformed to a datamodel, processed to clean/compute/add some information and then stored in a parquet file on a given location (so far then we are working on Hadoop but this may change at some point). Each of those feed i'd be interesting to store information like: - the version of parser/processor/ecc which did generate it - the version of datamodel used ( and if it's deprecated ) - the inbound feed which did generate it - the date when it was generated - the location where i can find the output/input feed - the customer owning the inbound feed Those are just few example of metadata i would like to attach to a given dataset and to store with the main purpose of search through them later on. At the same time I would need to implement an ACL to restrict access to those metadata. I'm currently analyzing DataHub solution to asses if it could satisfy those requirements. I therefore clone the repository and tried some data ingestion. I've played with Rest.li to create some dataset. My first impression is that DataHub is meanly meant to store, manage and search through metadata structure only but maybe I'm approaching this tool from the wrong point of view. Given the usecase i described above can you suggest me if DataHub can fit my needs upon having implemented the appropriate extension to the actual model?
o
A lot of this information seems like it would be appropriate to add through lineage and relationships. We've found the schema to be very flexible and easy to update so I don't think any of this would preclude you from using Datahub šŸ™‚ some of it is just not out-of-the-box
b
It sounds like you'd like to attach a variety of custom metadata to a dataset. If so, this is exactly what DataHub is designed for. However, you need to create your own custom metadata aspects (https://github.com/linkedin/datahub/blob/master/docs/what/aspect.md) which will then automatically extend the MCE event for you to include the additional information. Once the metadata is ingested, it's a matter of indexing them properly for search (see https://github.com/linkedin/datahub/blob/master/docs/how/search-onboarding.md).
n
Thank you for you suggestion. Now i can see how to use datahub to model my usecase. I still have two point for which i'd need your support: • how you would model a discontinuos timeseries? An aspect having an array of timespam?
• I've tryed to create a new aspect, extend DatasetAspect ref list. When i try to build the project i've failure on gms lib. I did try to extend the file com.linkedin.dataset.datasets.snapshot.json, adding the new aspect and added the related import in gms\impl\src\main\java\com\linkedin\dataset\rest\resources\Datasets.java but i still have some failure:
Task gmsimpl:checkRestModel FAILED
Exception in thread "main" java.lang.IllegalArgumentException: 1,234: "com.linkedin.dataset.SupportedAnonymizationZone" cannot be resolved. Did i miss some step?
b
Please take a look at this doc on how to add a new aspect. You shouldn't need to modify
com.linkedin.dataset.datasets.snapshot.json
manually as it's generated as part of the build.
As for modeling discontinuous time series, an array is certainly one option, assuming there isn't a large amount of data.