Could anyone help me understand stateful ingestion...
# ingestion
a
Could anyone help me understand stateful ingestion and the
bucket_duration
variable? An initial run of a snowflake ingestion I’m trying takes about 3 minutes. If I use stateful ingestion and remove the
ignore_start_time_lineage: true
line, then a rerun takes about 30s. That seems great but what I understood from the docs is that only lineage changes from the past day will be picked up like this. It would be nice if the past few days were checked incase Datahub went down for a few days. Is there a way to configure checking, for example, the past 3 days? I see there’s a
bucket_duration
variable that’s an enum, but what are the accepted values for it? I can’t see any documentation for that.
📖 1
🔍 1
l
Hey there 👋 I'm The DataHub Community Support bot. I'm here to help make sure the community can best support you with your request. Let's double check a few things first: ✅ There's a lot of good information on our docs site: www.datahubproject.io/docs, Have you searched there for a solution? ✅ button ✅ It's not uncommon that someone has run into your exact problem before in the community. Have you searched Slack for similar issues? ✅ button Did you find a solution to your issue? ❌ Sorry you weren't able to find a solution. I'm sending you some tips on info you can provide to help the community troubleshoot. Whenever you feel your issue is solved, please react ✅ to your original message to let us know!
h
Hey @ancient-queen-15575 bucket_duration is not related to lineage and only relevant for usage as is described in the docs. Unfortunately, there is no provision at this point to specify a window in terms of "past x days", however that is definitely being considered. If you are missing intermediate lineage due to any reason (e.g. datahub went down for a few days), You can set absolute start time. e.g.
start_time: "2023-04-21T00:00:00Z"
and ingest missing lineage as a one-off activity.
a
oh sorry, misunderstood bucket duration. So does lineage ingestion only ever check for the last day when using stateful ingestion? Or would it check back to the time of the last checkpoint?
h
does lineage ingestion only ever check for the last day when using stateful ingestion?
thats correct.
Hey @ancient-queen-15575, relative start time support was recently added - https://github.com/datahub-project/datahub/commit/ddcd5109dcbe01aac28347cf34221d65cb5faa30 this was recently merged into master and should be available in next release. Stay tuned.