Hello, Is there any easy way of ingesting a protob...
# ingestion
m
Hello, Is there any easy way of ingesting a protobuf schema? There is code in the datahub repo to read and ingest avro schemas (https://github.com/linkedin/datahub/blob/c8a3e6820204bf8c59ce4afae3d2b5d9dfdc0b71/[…]tadata-ingestion/src/datahub/ingestion/extractor/schema_util.py) so I'm wondering if there is something similar for protobuf out there ...
g
Hey there @many-guitar-67205! One community member @gentle-night-56466 is actually working on implementing this 🙂
(but it does not yet exist)
g
Yeah, I am working on it, java based though. The module itself can be used independently of the main repo, PR and README. LMK if this might be useful to you and if there is something that might be missing for your protobuf use cases.
m
Let's see if I understand this correctly: • the idea is to extract the dataset metadata from both the
.proto
file and the
FileDescriptorSet
(generated as described here) • this is a step you would typically add to a deployment pipeline A few things come to mind: • schema registry only has the textual representation (.proto), so this would not work in a self-discovery way. • generating the .protoc files is an extra step in the build process, (not all projects use protoc, e.g. in scala you would typically use scalapb) • Although I like the idea of adding metadata in the .proto, I'm a bit worried on extending the message itself for these, as this means (part of) the metadata ends up in the message itself. When using pb in high-volume/high throughput scenarios, this is not desired. For my current use case though, the work you have done looks solid. I'm investigating what the possibilities are with datahub. We have kafkatopics that use pb, so I'll start some experimenting with your branch.
g
Yes, auto-discovery would require extracting textual proto schema from a registery and compiling before DataHub ingest. If the ci/cd pipeline is jvm based, then the protoc could be generated with this maven plugin. Of course it does add an additonal step outside of scalapb. Scalapb mentions maintaining compatibility with the google compiler, it may actually generate the binary protoc as a temporary file. I certainly see the same kinds of options like
retain_source_code_info
. The messages do not contain the metadata fields/values. Protobuf uses protobuf to describe the schema syntax which is why it looks odd.