Hi everyone beginner question can Datahub function as a sche DataHub #getting-started

Hi everyone, beginner question: can Datahub functi...

colossal-sandwich-50049

07/06/2022, 1:16 PM

Hi everyone, beginner question: can Datahub function as a schema registry in a production environment? I.e. does it have feature parity with something like confluent schema registry?

square-activity-64562

07/06/2022, 1:17 PM

It is not supposed to be used just as a schema registry. Please see https://datahubproject.io/docs/features for details

colossal-sandwich-50049

07/06/2022, 1:36 PM

Thanks @square-activity-64562; I meant more in terms of parity/functionality with something like confluent schema registry. I.e. one can store schemas on Datahub, but confluent schema registry does that but also provides many utilities (e.g. the ease & functionality of using schema registry with kafka producer and/or serializer). Do any sort of packages exist within the Datahub ecosystem to assist when producing/consuming data?

incalculable-ocean-74010

07/06/2022, 7:00 PM

No such feature currently exists, does not mean one could not exist in the future but it is extremely overkill. DataHub is much much more than just a registry of schemas. It’s scope is much larger and for a different use-case than confluent’s schema registry. Theoretically it is possible but the system was not designed for the use-cases like Kafka’s schema registry which are low-latency. Do you want to deploy a multi-component, distributed system which includes a relational database, elasticsearch, kafka and then all DataHub components just to serve as a schema registry for data & producers/consumers ?

incalculable-ocean-74010

07/06/2022, 7:02 PM

That said DataHub is a great component to have in your data stack if it becomes a central part of your data ecosystem where data producers emit metadata to DataHub and administrative tools consume metadata information from DataHub. I.e: Notifications when pipelines fail, datasets don’t meet certain expectations, etc…

colossal-sandwich-50049

07/06/2022, 8:25 PM

I don't want DataHub as just a schema registry, but was curious if it could fulfill this use-case in addition to the other needs it meets

colossal-sandwich-50049

07/06/2022, 8:25 PM

cc: @great-toddler-2251

mammoth-bear-12532

07/06/2022, 8:41 PM

Hey @colossal-sandwich-50049 this could work, but of course needs the wrapper libraries that perform schema registration and retrieval in the codecs.

mammoth-bear-12532

07/06/2022, 8:41 PM

We don't have any currently, but would welcome contribs in this area

colossal-sandwich-50049

07/06/2022, 8:48 PM

Thanks @mammoth-bear-12532, would be keen to contribute if we build something like that in-house

great-toddler-2251

07/06/2022, 8:53 PM

thanks all; to provide a little context, we’re looking at DH to provide a key piece of the Data Mesh self-service platform, covering the discoverability of Data Products. Data Products have both a semantic schema as well as one or more syntactic schemas. The former is general, not (necessarily) specified by a specific format like Protobuf, Avro, GraphQL, SQL, etc. However, the latter, the syntactic schema, is one or more of those. A single Data Product can have multi-modal outputs, so both Avro and GraphQL for example. We’re exploring options at this point. DH could have the semantic schema, and we build support for generating syntactic schemas. This reduces the burden on the Data Product developers so they only need to know about and define one schema (format), the semantic one. Or we could just push it to the Data Products and they need to define and provide the schemas for each output type they support. But the key is that it’s not tied to a data*set* but to a data product, which could have many datasets (or none/infinite one for a streaming case). Hope that helps. 🙂

3 Views

Open in Slack

Previous Next