Hi Team, I am trying to ingest datasets/pipelines ...
# ingestion
h
Hi Team, I am trying to ingest datasets/pipelines using our custom Kafka events to DataHub's Kafka using the Kafka injection recipes which is mentioned in the official docs. Can anyone suggest whether I am doing the right approach or do I need to to rely on Java/Python emitter code to ingest the metadata by checking our custom Kafka. Or is it possible to inject the data without implementing any code and just by creating the ingestion recipes?
m
Could you drop in a link to the docs that you are referring to?
h
@mammoth-bear-12532 https://datahubproject.io/docs/generated/ingestion/sources/kafka - This is the doc which I am referring
m
This doc refers to ingesting Kafka metadata (e.g. the names of Kafka topics, their schemas, etc.)
is that what you are trying to do?
t
@mammoth-bear-12532 What we actually want to do is capture live updates to datasets and reflect the timing in the dataset entities in our datahub instance. Our in-house platform processes custom DAGs and updates datasets. The platform communicates these events internally with kafka. We want to capture those update messages and apply the timing to the entities in near-realtime. So we were thinking that we could use a kafka recipe to grab the messages then use a transformer to update the entities. (Aware now that it is pull and not push) FYI the dataset entities in our datahub instance are emitted to datahub with a custom emitter which reads our DAG and dataset schema information (from a separate source). Since the dataset updates are potentially high volume, we wanted to avoid having another custom emitter to manage.
m
Ah got it. Does your company have a favored stream processing framework (Flink / Kafka streams / Spark) etc? The approach that would work best here is to write a streaming job that reads your custom events and transforms them to the Datahub events. You can either use the python classes or the Java classes to accomplish this.
t
Got it. Thanks very much.
@hallowed-lawyer-5424 We need to change the design. Let’s talk offline.