I'd like to use datahub to track downstream consumers of data, as well as upstream producers.
Say for example I have an airflow job that writes to an s3 bucket. Later, a cron job reads from that s3 bucket and takes some action (e.g. emails a customer, etc.)
What's the best way to represent this cron job as a consumer of data? Should it be tracked as a "dataset", even though it doesn't really store data anywhere? Or, is it better to track it using
to write a set of tags to the data source for the s3 bucket saying how the data is used?
Thanks for any help. I'm sure this is a common problem, but I think I lack the proper nouns to properly search for this; I haven't had much luck so far.
-Eli