Hi everyone, beginner question: can Datahub functi...
# getting-started
c
Hi everyone, beginner question: can Datahub function as a schema registry in a production environment? I.e. does it have feature parity with something like confluent schema registry?
s
It is not supposed to be used just as a schema registry. Please see https://datahubproject.io/docs/features for details
c
Thanks @square-activity-64562; I meant more in terms of parity/functionality with something like confluent schema registry. I.e. one can store schemas on Datahub, but confluent schema registry does that but also provides many utilities (e.g. the ease & functionality of using schema registry with kafka producer and/or serializer). Do any sort of packages exist within the Datahub ecosystem to assist when producing/consuming data?
i
No such feature currently exists, does not mean one could not exist in the future but it is extremely overkill. DataHub is much much more than just a registry of schemas. It’s scope is much larger and for a different use-case than confluent’s schema registry. Theoretically it is possible but the system was not designed for the use-cases like Kafka’s schema registry which are low-latency. Do you want to deploy a multi-component, distributed system which includes a relational database, elasticsearch, kafka and then all DataHub components just to serve as a schema registry for data & producers/consumers ?
That said DataHub is a great component to have in your data stack if it becomes a central part of your data ecosystem where data producers emit metadata to DataHub and administrative tools consume metadata information from DataHub. I.e: Notifications when pipelines fail, datasets don’t meet certain expectations, etc…
c
I don't want DataHub as just a schema registry, but was curious if it could fulfill this use-case in addition to the other needs it meets
cc: @great-toddler-2251
m
Hey @colossal-sandwich-50049 this could work, but of course needs the wrapper libraries that perform schema registration and retrieval in the codecs.
We don't have any currently, but would welcome contribs in this area
c
Thanks @mammoth-bear-12532, would be keen to contribute if we build something like that in-house
g
thanks all; to provide a little context, we’re looking at DH to provide a key piece of the Data Mesh self-service platform, covering the discoverability of Data Products. Data Products have both a semantic schema as well as one or more syntactic schemas. The former is general, not (necessarily) specified by a specific format like Protobuf, Avro, GraphQL, SQL, etc. However, the latter, the syntactic schema, is one or more of those. A single Data Product can have multi-modal outputs, so both Avro and GraphQL for example. We’re exploring options at this point. DH could have the semantic schema, and we build support for generating syntactic schemas. This reduces the burden on the Data Product developers so they only need to know about and define one schema (format), the semantic one. Or we could just push it to the Data Products and they need to define and provide the schemas for each output type they support. But the key is that it’s not tied to a data*set* but to a data product, which could have many datasets (or none/infinite one for a streaming case). Hope that helps. 🙂