Gergely Imreh
04/27/2023, 12:46 PM2023-04-27T06:29:50-07:00
which I parse according to the relevant %Y-%m-%dT%H:%M:%S%z
format. If I put that example value as a starting time, and then look at the request sent by AirByte, the time zone data gets zeroed out (ie. parsed as the wrong time), here for example used as 2023-04-27T06:29:50+0000
in the request (added screenshot of the request info, the first datetime value)
Am I missing something? Otherwise I would worry, that the cursor will keep/propagate the wrong times, potentially? Cheers!Joe Reuter (Airbyte)
04/27/2023, 12:57 PM%z
is "UTC offset in the form `±HHMM[SS[.ffffff]]`"(https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) So in your case the start time should be 2023-04-27T06:29:50+0700
(without the colon)Joe Reuter (Airbyte)
04/27/2023, 12:57 PMGergely Imreh
04/27/2023, 1:07 PMChanged in version 3.7: When the(and it works as per Python repl screenshot on Python 3.9) Time zone is parsed, but then seems to be reset to +00:00 in the request sent by AirByte. Also the API returns times with the colon inside, and would have to parse them for the cursor field, right? Or that is not really used here, only AirByte’s internal tracking of what time intervals were queried? (but then I don’t quite get why the cursor field is needed, but that’s another question, if the timezone is parsed and used correctly none of that would worry me)directive is provided to the%z
method, the UTC offsets can have a colon as a separator between hours, minutes and seconds. For example,strptime()
will be parsed as an offset of one hour. In addition, providing'+01:00:00'
is identical to'Z'
.'+00:00'
Joe Reuter (Airbyte)
04/27/2023, 1:09 PMGergely Imreh
04/27/2023, 1:27 PMMaxime Carbonneau-Leclerc (Airbyte)
04/27/2023, 2:12 PMMaxime Carbonneau-Leclerc (Airbyte)
04/27/2023, 2:21 PMGergely Imreh
04/27/2023, 2:29 PMMaxime Carbonneau-Leclerc (Airbyte)
04/27/2023, 2:41 PM%Y-%m
so we perform a first sync today for the data between 2023-01 and 2023-04. From what datetime the next sync should start? We have two possibility with various drawbacks:
• If we start from 2023-04, we will get duplicate records
• If we start from 2023-05, then we might miss the last few days of April
Since we prefer data completeness over data uniqueness, we choose to use 2023-04 as a start date for the next sync. Now, in your case where the granularity is in seconds, this is very unlikely (although not impossible) that you will have duplication. If you want to avoir duplication at all cost, you will have to specify primary key and have a destination that supports append_dedup
Alexandre Girard (Airbyte)
04/27/2023, 2:46 PMMaxime Carbonneau-Leclerc (Airbyte)
04/27/2023, 2:48 PMGergely Imreh
04/28/2023, 8:26 AMappend_dedup
you mention is the “incremental: deduped + history” (rather than the “incremental: append”? Just checking as not obvious which would it be. An append dedup (without history) would be nice, though definitely with incremental deduplication will need to be taken into account at some point in the data pipeline, so definitely 👍 I’m trying to reduce the “avoidable” duplication, so to say.
Good discussion as well, definitely clarifies a few things for me too! Cheers.Maxime Carbonneau-Leclerc (Airbyte)
04/28/2023, 1:02 PMThe append_dedup you mention is the “incremental: deduped + history” (rather than the “incremental: append”?Exactly
An append dedup (without history) would be niceThis is not on our roadmap for now and I don’t think this is considered a priority for now. Though if we get more and more feedback like yours, we might change this view. For more information regarding the feature itself, see https://docs.airbyte.com/understanding-airbyte/connections/incremental-deduped-history Also, for reference, here is the PR: https://github.com/airbytehq/airbyte/pull/25665