Hi, is it possible to create an "empty" dataset, a...
# getting-started
d
Hi, is it possible to create an "empty" dataset, and then afterwards use kafka schema-registry ingestion to add a schema to this dataset?
My use case is that i want to "create" a dataproduct before there is any actual data or schema in schema-registry. In this creation process i will add topic, acl rules, create certificates for the users and most importantly i want to add business and technical metadata to the dataproduct. Im trying to avoid to have this being a multi stage process that requires human interventions
b
you can pre-create an empty dataset, but you need to know which urn the ingestion will map to (you cant force the ingestion process to use a custom URN.. unless you want to tweak the code). there is a particular pattern to the urn, though, once you understand how its formed
d
@better-orange-49102 ok thanks 🙂
f
I can confirm that it probably works. Like @better-orange-49102 said above, I can push any “empty” entities even the whole pipeline from the emitter. Then the ingestion job will enrich them with the metadata.
d
Im having a bit of struggle deciding whats the best course of action regarding metadata ingestion. My first thought was using existing CI/CD-pipelines and just having a yaml-file in git, first i felt like this was a very good IAC-kind of solution but this weekend i felt that it did not make the metadata dynamic or flexible. A metadata change would require a git commit. To mitigate this i thought that users should be able to enrich or change metadata in datahub afterwards but this would make us a situation where the git metadata would go out of date very fast. Bleh.
And it would make the ingestion process very complex as it would need to merge content between git yaml file and existing metadata in datahub since users would probably change on both ends ..... 🙂
a
For the empty dataset creation, Do you need to define a custom source for this or just use existing sources with an empty table?
Is there a way to create a dataset with an API?