Hi team, I have the following use-case that I am ...
# getting-started
p
Hi team, I have the following use-case that I am unable to gather any information on. In our system, data moves from Kafka topics to S3 and then to Snowflake. Can somebody tell us what is the method to draw lineage from Kafka to S3 and then to Snowflake tables. FYI, I was able to ingest metadata from these sources independently.
🔍 1
📖 1
l
Hey there 👋 I'm The DataHub Community Support bot. I'm here to help make sure the community can best support you with your request. Let's double check a few things first: ✅ There's a lot of good information on our docs site: www.datahubproject.io/docs, Have you searched there for a solution? ✅ button ✅ It's not uncommon that someone has run into your exact problem before in the community. Have you searched Slack for similar issues? ✅ button Did you find a solution to your issue? ❌ Sorry you weren't able to find a solution. I'm sending you some tips on info you can provide to help the community troubleshoot. Whenever you feel your issue is solved, please react ✅ to your original message to let us know!
m
You will probably have to emit lineage programmatically to datahub. Either by reading kafka connect config to see how the topic is sinked to S3 or use naming convention (if you have). From s3 to snowflake you can read snowflake stages and see and map bucket with tables.
m
What software are you using to move data between these systems?
p
@modern-artist-55754 we do have a strict naming convention which ensures that the topic on kafka, the file name on S3 and the landing tables on Snowflake have exactly the same name. Could you guide me with this point in mind
@mammoth-bear-12532 we use Secor to move data from kafka topic to S3 and then Snowpipe to move data from S3 to Snowflake landing tables
m
If you have strict naming convention, you can create a yaml file to describe the lineage and use file base lineage ingestion https://datahubproject.io/docs/generated/ingestion/sources/file-based-lineage/ Or you can use emitter to emit the lineage independently
m
Cool, you should instrument Secor to emit a DataHub lineage event when it publishes data to S3. That's what we did at LinkedIn with a similar system (Gobblin). Typically these systems have a final publish step, where you can add a hook.