Hey team, I’m able to ingest metadata from severa...
# ingestion
p
Hey team, I’m able to ingest metadata from several data sources (Kafka, Hive etc.). They are ingested through pull-based approach and I define the ingestion pod as k8s Job. If I want to do this operation periodically, I can easily convert k8s Job to CronJob and set proper schedule (e.g. every day). I just wonder if we can use push-based approach instead of pull-based? If so, how can we do that with the current architecture? I think this question is answered here. As far as I understand, there is no out-of-the-box solutions except Airflow. If we want to ingest metadata from Hive, Kafka, Superset, Looker we need to pull the metadata periodically? Please correct me if I’m wrong.
1
plus1 1
l
Hello Seref. Each of these systems will need integration work to enable push. @green-football-43791 has looked into push-based integration for Superset - we just haven't had time to get to it.
For kafka, we're investigating what it would take to do push-based ingestion from schema registry.
For Hive, hive-hooks allow you do the push-based emission. We are yet to schedule this work.
in all cases, it is a matter of finding the right integration point to drop the datahub emitter. Would love contributions from you as well 🙂
p
Thank you very much for the explanation! @loud-island-88694 We are still investigating DataHub, its architecture and its ecosystem. We (event-tracking team @ Udemy) would love to contribute to this awesome community as much as possible whenever the ramp-up period is over as I said at our first meeting 🙂
l
👍
p
As a result of this information, maybe we can emit MetadataChangeEvents (MCE) while registering our event schemas to the Confluent schema-registry through event-schema-manager if we would like to prefer push-based approach. @brainy-battery-94512 @some-lighter-73922
l
yes that would work. cc @big-carpet-38439
publishing after registering would work
thankyou 1
b
processing..
p
I just want to clarify that. I left this message for my teammates at Udemy.
event-schema-manager
is a kind of our internal tool that is used for registering avro schemas through GitHub commands. 🙂
and webhook/notification requirement is an open issue on schema-registry repo 🙂 https://github.com/confluentinc/schema-registry/issues/1007. This blog post is suggested in the particular issue.