Chris and I feel we are configured to retry excess...
# contributing-to-airbyte
d
Chris and I feel we are configured to retry excessively today - we do 3 attempts with 3 runs in each attempt - that's 9 'tries' in total. This is not good for rate limits. Errors also take a while to surface. Thinking of pushing this down to 3 attempts of 1 run each. Thoughts?
u
davin for president
u
maybe some users would love the ability to tweak that themselves too!
d
actually.. a user contributed this a while back 😄
u
shoutout to @Vladimir Remar!
u
Having the separate concepts of attempts/retries-per-attempt is a bit confusing.
u
Fully support 1 retry per attempt and tuning the number of attempts
u
☝️ I was about to ask the same thing. is there any difference between e.g. 1 attempt with 3 retries and 3 attempts with 1 retry each?
u
I'm going to rename retries -> runs per attempt to make this clearer. Mauro, the waiting mechanism is different. Runs within an attempt are meant to deal with transient failure, so there is almost no waiting. Attempts are meant to deal with actual failure so there is some naive wait time now.
d
Most transient failures should be retried anyways within connectors.
u
Although I don't think it's very useful today - playing around with the idea of getting rid of runs within an attempt entirely. Wdyt @Jared Rhizor (Airbyte) @charles (original folks on this)
u
I think transient retry logic should live within connectors (and generally does today, at least for API connectors). I think removing the concept would be a nice simplification generally.
u
Great. I'll wait for Charles to chime in. This can be the last thing I do before PTO
u
@s do you agree with the statement that transient retries are something connectors should be responsible for?
u
Are there any sources that have a transient failure but work if you retry within 1 second? That seems like a theoretical scenario that rarely turns up in actual practice. Immediate retries have not shown much value in my past production systems. With airflow dags, I usually shut off automatic retries entirely, and only turn them on very selectively for jobs we know get value from it. This definitely should have some user tuning available, globally and per-job. (Sidenote, I’m working on a json spec abstraction that would make it easier to do shared configuration options across connectors. This might be a second use case for that pattern.)
d
A few sources have transient failures that work with retries within a short time frame (temporary errors on the api side, local networking issues, etc). Immediate retries at the job level don’t imagine to have much chance of success (although @Davin Chia (Airbyte) this is something you might be able to check in Amplitude by looking at the distribution of attempts for failed jobs)
u
Jenny are you saying that we can get rid of runs within an attempt and as long as we have user-tunable attempts we are good?
u
I think so. And the user may want to control the timing between attempts separately by job, while having a global default set. Depending on what system they are interfacing with, the backoff-retry can be wildly different.
u
Makes sense. Probably won't tackle the configurable retry bits now. Will open a ticket so we can follow up on it
u
so we are going to have 3 attempts per job. but each attempt only tries to run once? that seems reasonable to me.
u
Yeap. Get rid of runs within an attempt and only have the attempt concept.
u
@Vladimir Remar and I have been talking a lot about this and makes sense what you propose. The current set-up was messy and that’s why we put both limits as env vars with our set-up being 3 attempts, 1 run