Currently a schema is just an <aspect of a dataset...
# advice-metadata-modeling
g
Currently a schema is just an aspect of a dataset. We’d like to have a schema be a top level standalone entity, independent of datasets, effectively having DataHub as a schema registry along with all the other great capabilities. And if a dataset did get created based on the same schema, we’d want to reference it. Is there a way to achieve this today? Is this a feature request I should submit? As a short term workaround we could just expose Confluent Schema Registry to those users who need the standalone schema support, but really we’d like to see this integrated into DataHub so we can get all the other nice stuff like lineage, glossary, tags, owners, etc.
plus1 4
m
Hey @great-toddler-2251 would be great to have an RFC around this.
I just raised a draft PR to get the discussion started: https://github.com/datahub-project/datahub/pull/5976
Please add in suggestions for requirements that we should be taking into account
g
@mammoth-bear-12532 thanks for creating the RFC.
what is the status of this? where does it go from here?
m
Planning to raise an implementation PR next week that shows an example model
g
@mammoth-bear-12532 any update on having Schema as an Entity?
m
Hi Ray, it's been on our backlog for some time now. With the new energy around Data Products, I think we can get both done together. We will move on this in Feb for sure.
g
Thanks @mammoth-bear-12532 for the update. Right now my priority would be for the Schema first partly because I think it will be easier to deliver. The reason I pinged you on this is because we’re planning to do work on saving schemas into DataHub starting late Feb early March, so I’m trying to figure out the best approach e.g. until schema as an entity is in DH do I create it as an empty dataset? Do I wait until schemas are available?
m
Yes you can definitely get started by registering as a Dataset with subType: schema
quick question: what is the schema format that you are planning to register into DataHub?
g
the format depends on what you might support 🙂 We’re thinking OpenAPI 3.x because it imposes some structure like the ‘info’ block and a schemas section. But if there was a way to get/put any of OpenAPI 3.x, JSON Schema, Protobuf, Avro (last 3 obviously supported by Confluent Schema Registry) that would also work for me. Happy to discuss details directly
@mammoth-bear-12532 is there any update on this? I’m trying to find the feature request to track but can’t find it. If you can point me to it that would be great. Thanks!