Hi I m looking at the connection builder and trying to set u Airbyte #help-connector-development

Hi, I’m looking at the connection builder, and try...

Gergely Imreh

04/27/2023, 12:46 PM

Hi, I’m looking at the connection builder, and trying to set up an incremental sync. The API I’m talking to returns a time zone, but it seems somewhere that value is being lost in the process. The API in question returns times like

2023-04-27T06:29:50-07:00

which I parse according to the relevant

%Y-%m-%dT%H:%M:%S%z

format. If I put that example value as a starting time, and then look at the request sent by AirByte, the time zone data gets zeroed out (ie. parsed as the wrong time), here for example used as

2023-04-27T06:29:50+0000

in the request (added screenshot of the request info, the first datetime value) Am I missing something? Otherwise I would worry, that the cursor will keep/propagate the wrong times, potentially? Cheers!

Joe Reuter (Airbyte)

04/27/2023, 12:57 PM

Didn't test but I think

%z

is "UTC offset in the form `±HHMM[SS[.ffffff]]`"(https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) So in your case the start time should be

2023-04-27T06:29:50+0700

(without the colon)

Joe Reuter (Airbyte)

04/27/2023, 12:57 PM

Could you try that?

Gergely Imreh

04/27/2023, 1:07 PM

Checking the docs it also says:

Changed in version 3.7: When the
%z
directive is provided to the
strptime()
method, the UTC offsets can have a colon as a separator between hours, minutes and seconds. For example,
'+01:00:00'
will be parsed as an offset of one hour. In addition, providing
'Z'
is identical to
'+00:00'
.

(and it works as per Python repl screenshot on Python 3.9) Time zone is parsed, but then seems to be reset to +00:00 in the request sent by AirByte. Also the API returns times with the colon inside, and would have to parse them for the cursor field, right? Or that is not really used here, only AirByte’s internal tracking of what time intervals were queried? (but then I don’t quite get why the cursor field is needed, but that’s another question, if the timezone is parsed and used correctly none of that would worry me)

Joe Reuter (Airbyte)

04/27/2023, 1:09 PM

Good catch, based on that it should indeed work. @Maxime Carbonneau-Leclerc (Airbyte) do you know what's going on here?

🙏 1

Gergely Imreh

04/27/2023, 1:27 PM

BTW tried the incremental sync and do end up with duplicates, due to the cursor having a time zone that is incorrectly parsed, and thus intervals are queried in an overlapping way, as expected above. Thanks for checking!

Maxime Carbonneau-Leclerc (Airbyte)

04/27/2023, 2:12 PM

Gergely asks good questions and provides good feedback. I like it! So, I’m not 100% sure about the datetime parsing we do but the problem comes from here. Basically, we blindly override the timezone while we should probably only assume UTC if the datetime is naive. @Alexandre Girard (Airbyte) is there some context I’m missing and do you agree with the proposed solution? If you confirm, I can implement the fix

👀 2

Maxime Carbonneau-Leclerc (Airbyte)

04/27/2023, 2:21 PM

Without the fix @Gergely Imreh, it means that start time will need to be UTC else you will have a gap between the start_datetime provided and the one sent to the API

Gergely Imreh

04/27/2023, 2:29 PM

Yeah, that one is easier to handle as it’s user editable indeed! 👍 However as mentioned above, “incremental - append” will behave unexpectedly, producing duplicated records - or missing some, if the API returns non-UTC time, which I am not sure how to work around (except not using incremental for the time being, if the API returns timezone offsets)

Maxime Carbonneau-Leclerc (Airbyte)

04/27/2023, 2:41 PM

This is probably linked to the same issue. The reason is that when we calculate the cursor datetime at the beginning of a stream, we will parse the state using the same datetime parser and therefore timezone is blindly replaced by UTC without adjusting the hours. The fix propose above should fix that as well. To be clear though, incremental does not mean that you won’t have duplicates. This is especially the case when the granularity of the datetime format is big. For example, let’s say our datetime format is

%Y-%m

so we perform a first sync today for the data between 2023-01 and 2023-04. From what datetime the next sync should start? We have two possibility with various drawbacks: • If we start from 2023-04, we will get duplicate records • If we start from 2023-05, then we might miss the last few days of April Since we prefer data completeness over data uniqueness, we choose to use 2023-04 as a start date for the next sync. Now, in your case where the granularity is in seconds, this is very unlikely (although not impossible) that you will have duplication. If you want to avoir duplication at all cost, you will have to specify primary key and have a destination that supports

append_dedup

👍 1

this 1

Alexandre Girard (Airbyte)

04/27/2023, 2:46 PM

you're right @Maxime Carbonneau-Leclerc (Airbyte). we shouldn't replace the timzeone when it's known

Maxime Carbonneau-Leclerc (Airbyte)

04/27/2023, 2:48 PM

Perfect! I’ll take action on this hopefully by the end of the week then and keep you posted @Gergely Imreh Your input is highly valuable since it helps make the product better for everyone so thanks again for that and I’ll keep you posted

Gergely Imreh

04/28/2023, 8:26 AM

Thanks, this makes all sense, and really appreciate the feedback! @Maxime Carbonneau-Leclerc (Airbyte) The duplicates comment makes sense for those buckets, though when the timestamp is seconds level and the volume is not crazy high, that duplication is not really expected in practice (month level: of course there will be:) The

append_dedup

you mention is the “incremental: deduped + history” (rather than the “incremental: append”? Just checking as not obvious which would it be. An append dedup (without history) would be nice, though definitely with incremental deduplication will need to be taken into account at some point in the data pipeline, so definitely 👍 I’m trying to reduce the “avoidable” duplication, so to say. Good discussion as well, definitely clarifies a few things for me too! Cheers.

Maxime Carbonneau-Leclerc (Airbyte)

04/28/2023, 1:02 PM

The append_dedup you mention is the “incremental: deduped + history” (rather than the “incremental: append”?

Exactly

An append dedup (without history) would be nice

This is not on our roadmap for now and I don’t think this is considered a priority for now. Though if we get more and more feedback like yours, we might change this view. For more information regarding the feature itself, see https://docs.airbyte.com/understanding-airbyte/connections/incremental-deduped-history Also, for reference, here is the PR: https://github.com/airbytehq/airbyte/pull/25665

Open in Slack

Previous Next