Is this your first time deploying Airbyte no OS Version Ins Airbyte #troubleshooting

Is this your first time deploying Airbyte: no OS V...

gunu

03/16/2022, 10:19 AM

Is this your first time deploying Airbyte: no OS Version / Instance: Linux EC2 m5.4xlarge Deployment: Docker Airbyte Version: 0.35.50-alpha Source: Survey Monkey 0.1.7 Destination: Snowflake 0.4.17 Description: Incremental + Dedupe results in duplicates rows for the responses stream

Octavia Squidington III

03/16/2022, 11:14 AM

Octavia Squidington III

03/16/2022, 11:42 AM

Harshith (Airbyte)

03/16/2022, 11:44 AM

Hey what are the primary keys you are using?

gunu

03/16/2022, 8:24 PM

It's preconfigured in the survey monkey connector

gunu

03/16/2022, 8:24 PM

“Source defined”

Marcos Marx (Airbyte)

03/17/2022, 2:13 AM

This is the only record duplicated?

gunu

03/17/2022, 2:22 AM

No. There were a handful. And this was the first subsequent sync. I imagine if I do it again. It will happen again

gunu

03/17/2022, 8:20 AM

are there related tests on this connector for the incremental responses stream?

gunu

03/17/2022, 9:17 PM

do you need any additional information?

gunu

03/17/2022, 9:21 PM

can you also please point me to where the incremental gets applied in the code base? specifically the “source defined” columns

Marcos Marx (Airbyte)

03/17/2022, 10:03 PM

are there related tests on this connector for the incremental responses stream?

should have a test validating that a incremental sync or full refresh produce the correct output, maybe not spcific to duplicate records

gunu

03/18/2022, 12:38 AM

okay. so next steps?

gunu

03/21/2022, 6:47 AM

which stream is this?

Do you think the primary key is right here? id is not the primary key right?

@Harshith (Airbyte) redirecting you back to this thread as it contains much of what your asking. I'm not sure what primary key is defined. it is source defined and I'm asking where this might be defined in order to try investigate further myself.

Harshith (Airbyte)

03/21/2022, 6:48 AM

Yeah as I said I could see that the primary key is

id

you can find this here https://github.com/airbytehq/airbyte/blob/eeb35872c2d30e348709589fd56480c3f9b513f0[…]s/connectors/source-surveymonkey/source_surveymonkey/streams.py

gunu

03/21/2022, 7:05 AM

cheers, i was looking for a link to where this is defined. confirming the

id

is the same in the duplicated rows. however i do not see

start_modified_at

in the full response record

gunu

03/21/2022, 7:06 AM

how does incremental work for this stream? i do not see

start_modified_at

in the full response records

gunu

03/21/2022, 7:07 AM

also confirming that

"date_modified": "2022-01-19T13:18:13+00:00",

is the same for both records

gunu

03/21/2022, 7:09 AM

does deduplicate not occur if there is no normalisation? i.e. if basic normalisation is turned off?

Harshith (Airbyte)

03/21/2022, 7:24 AM

yeah it doesn't. Basic normalisation takes care of the normalisation

gunu

03/21/2022, 8:05 AM

okay so that’s the issue here? i have no normalisation process (only

_AIRBYTE_RAW

tables) and thus the deduplication process (in incremental + dedupe) will not work. however, as for incremental. this duplicate record should still not appear as it is the same data i.e. only records after

date_modified

should be inserted. correct?

Harshith (Airbyte)

03/21/2022, 8:06 AM

Yeah what you said is what ideally should happen. I will take a look at it today

Harshith (Airbyte)

03/21/2022, 8:06 AM

Do you mind creating an issue for this?

gunu

03/21/2022, 8:49 AM

https://github.com/airbytehq/airbyte/issues/11284

gunu

03/21/2022, 8:53 AM

can you see what the issue is? i’m happy to create a PR if there’s a quick fix

Harshith (Airbyte)

03/21/2022, 8:54 AM

Sure will do it and ping back here

gunu

03/22/2022, 1:02 PM

any luck?

Harshith (Airbyte)

03/23/2022, 5:06 AM

Hey I went through the code

Copy code

params = super().request_params(stream_state=stream_state, **kwargs)
        params["sort_order"] = "ASC"
        params["sort_by"] = "date_modified"
        params["per_page"] = 1000  # maybe as user input or bigger value
        since_value = pendulum.parse(stream_state.get(self.cursor_field)) if stream_state.get(self.cursor_field) else self._start_date

        since_value = max(since_value, self._start_date)
        params["start_modified_at"] = since_value.strftime("%Y-%m-%dT%H:%M:%S")
        return params

It doesn't look off for me. Can you help in understanding if there is pattern for duplicate records like it's just happening in the borders?

gunu

03/23/2022, 8:27 AM

i just created a seperate ticket as there appears to be a broader issue occurring, possibly to do with the state of the job in the DB see new issue here

gunu

03/23/2022, 8:28 AM

It doesn’t look off for me. Can you help in understanding if there is pattern for duplicate records like it’s just happening in the borders?

is there a test to ensure this isn’t happening on airbyte’s end/configuration?

gunu

03/28/2022, 9:21 PM

@Harshith (Airbyte) have you managed to make any progress on this?

Harshith (Airbyte)

03/29/2022, 6:03 AM

Hey as of now the credentials we have don't have access to responses thus it would be great if you can help me with info on the duplicates like Is it just happening on the start or end dates or any pattern in them ?

Harshith (Airbyte)

03/29/2022, 6:03 AM

But yeah we are also working towards getting the responses stream up with our credentials

gunu

03/29/2022, 9:05 PM

Aren't responses the main piece of data for the survey monkey connector? that's like having a GitHub connector with no commits or PRs, Google Ads with no actual visits, databases with empty tables. I would think having responses is fundamental to the airbyte test connector. I'll try investigate on my end however this is tricky as I'm hitting our daily API limits without being able to properly test.

gunu

04/02/2022, 12:51 PM

@Harshith (Airbyte) quick question. is it possible that the

cursor_field

is not specific the survey_id? i.e. when a list of survey IDs are provided and it goes through each one to get the responses. the

cursor_field

is leaking into the next survey id?

Harshith (Airbyte)

04/03/2022, 1:13 PM

Checking it

gunu

04/04/2022, 11:24 PM

@Harshith (Airbyte) @Harshith (Airbyte) i think i may have found the issue https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-surveymonkey/source_surveymonkey/streams.py#L263-L275 can you confirm this API pulls responses after this timestamp. and not on or after this timestamp? i.e. are responses being pulled

WHERE timestamp > cursor_field

or is it being pulled as

WHERE timestamp >= cursor_field

gunu

04/05/2022, 12:09 AM

i’ve just confirmed this, its greater than or equal to (

WHERE timestamp >= cursor_field

) e.g. my last response has

"date_modified": "2022-04-04T14:51:18+00:00"

and this gets set as the cursor field. when i apply the updated state

start_modified_at: 2022-04-04T14:51:18

it returns the same response

gunu

04/05/2022, 12:51 AM

https://github.com/airbytehq/airbyte/pull/11720

gunu

04/06/2022, 12:15 PM

@Harshith (Airbyte) @Harshith (Airbyte) are you able to review the PR. I’m not sure if it's the right solution and would appreciate help extending the tests to confirm this works as intended.

gunu

04/07/2022, 6:27 AM

Is there somewhere more appropriate to post this to get feedback

2 Views

Open in Slack

Previous Next