Chris and I feel we are configured to retry excessively toda Airbyte #contributing-to-airbyte

Chris and I feel we are configured to retry excessively today - we do 3 attempts with 3 runs in each attempt - that's 9 'tries' in total. This is not good for rate limits. Errors also take a while to surface. Thinking of pushing this down to 3 attempts of 1 run each. Thoughts?

user

08/19/2021, 2:38 PM

davin for president

user

08/19/2021, 2:39 PM

maybe some users would love the ability to tweak that themselves too!

Davin Chia (Airbyte)

08/19/2021, 2:40 PM

actually.. a user contributed this a while back 😄

user

08/19/2021, 2:40 PM

https://github.com/airbytehq/airbyte/blob/master/.env#L71

user

08/19/2021, 2:40 PM

shoutout to @Vladimir Remar!

user

08/19/2021, 2:52 PM

Having the separate concepts of attempts/retries-per-attempt is a bit confusing.

user

08/19/2021, 2:52 PM

Fully support 1 retry per attempt and tuning the number of attempts

user

08/19/2021, 2:53 PM

☝️ I was about to ask the same thing. is there any difference between e.g. 1 attempt with 3 retries and 3 attempts with 1 retry each?

user

08/19/2021, 2:56 PM

I'm going to rename retries -> runs per attempt to make this clearer. Mauro, the waiting mechanism is different. Runs within an attempt are meant to deal with transient failure, so there is almost no waiting. Attempts are meant to deal with actual failure so there is some naive wait time now.

Davin Chia (Airbyte)

08/19/2021, 2:57 PM

Most transient failures should be retried anyways within connectors.

user

08/19/2021, 2:57 PM

Although I don't think it's very useful today - playing around with the idea of getting rid of runs within an attempt entirely. Wdyt @Jared Rhizor (Airbyte) @charles (original folks on this)

user

08/19/2021, 2:58 PM

I think transient retry logic should live within connectors (and generally does today, at least for API connectors). I think removing the concept would be a nice simplification generally.

user

08/19/2021, 2:58 PM

Great. I'll wait for Charles to chime in. This can be the last thing I do before PTO

user

08/19/2021, 2:58 PM

@s do you agree with the statement that transient retries are something connectors should be responsible for?

user

08/19/2021, 3:12 PM

Are there any sources that have a transient failure but work if you retry within 1 second? That seems like a theoretical scenario that rarely turns up in actual practice. Immediate retries have not shown much value in my past production systems. With airflow dags, I usually shut off automatic retries entirely, and only turn them on very selectively for jobs we know get value from it. This definitely should have some user tuning available, globally and per-job. (Sidenote, I’m working on a json spec abstraction that would make it easier to do shared configuration options across connectors. This might be a second use case for that pattern.)

Davin Chia (Airbyte)

08/19/2021, 3:16 PM

A few sources have transient failures that work with retries within a short time frame (temporary errors on the api side, local networking issues, etc). Immediate retries at the job level don’t imagine to have much chance of success (although @Davin Chia (Airbyte) this is something you might be able to check in Amplitude by looking at the distribution of attempts for failed jobs)

user

08/19/2021, 3:19 PM

Jenny are you saying that we can get rid of runs within an attempt and as long as we have user-tunable attempts we are good?

user

08/19/2021, 3:19 PM

I think so. And the user may want to control the timing between attempts separately by job, while having a global default set. Depending on what system they are interfacing with, the backoff-retry can be wildly different.

user

08/19/2021, 3:20 PM

Makes sense. Probably won't tackle the configurable retry bits now. Will open a ticket so we can follow up on it

user

08/19/2021, 4:53 PM

so we are going to have 3 attempts per job. but each attempt only tries to run once? that seems reasonable to me.

user

08/20/2021, 2:57 AM

Yeap. Get rid of runs within an attempt and only have the attempt concept.

user

08/20/2021, 8:24 AM

@Vladimir Remar and I have been talking a lot about this and makes sense what you propose. The current set-up was messy and that’s why we put both limits as env vars with our set-up being 3 attempts, 1 run

6 Views

Open in Slack

Previous Next