I'd like to use datahub to track downstream consum...
# getting-started
a
I'd like to use datahub to track downstream consumers of data, as well as upstream producers. Say for example I have an airflow job that writes to an s3 bucket. Later, a cron job reads from that s3 bucket and takes some action (e.g. emails a customer, etc.) What's the best way to represent this cron job as a consumer of data? Should it be tracked as a "dataset", even though it doesn't really store data anywhere? Or, is it better to track it using

metadata enrichment

to write a set of tags to the data source for the s3 bucket saying how the data is used? Thanks for any help. I'm sure this is a common problem, but I think I lack the proper nouns to properly search for this; I haven't had much luck so far. -Eli
🔍 1
📖 1
l
Hey there 👋 I'm The DataHub Community Support bot. I'm here to help make sure the community can best support you with your request. Let's double check a few things first: ✅ There's a lot of good information on our docs site: www.datahubproject.io/docs, Have you searched there for a solution? ✅ button ✅ It's not uncommon that someone has run into your exact problem before in the community. Have you searched Slack for similar issues? ✅ button Did you find a solution to your issue? ❌ Sorry you weren't able to find a solution. I'm sending you some tips on info you can provide to help the community troubleshoot. Whenever you feel your issue is solved, please react ✅ to your original message to let us know!
b
@adamant-postman-92176 I would strongly recommend modeling this as a "DataFlow" with a single child "DataJob" for your CRON job!
You can then track individual runs of the cron job using the "DataProcessInstance" entity!
a
Thanks John, I will look into this. Do you know if there is any documentation on these topics? I found an API reference, but if there is something that covers the concepts or a sample to work from, that would be very helpful. I appreciate your help!
b
So typically this understanding lives inside of the connectors - but this overview should help a bit https://datahubproject.io/docs/metadata-modeling/metadata-model/