Is this your first time deploying Airbyte: no OS V...
# troubleshooting
g
Is this your first time deploying Airbyte: no OS Version / Instance: Linux EC2 m5.4xlarge Deployment: Docker Airbyte Version: 0.35.50-alpha Source: Survey Monkey 0.1.7 Destination: Snowflake 0.4.17 Description: Incremental + Dedupe results in duplicates rows for the responses stream
o
loading...
loading...
h
Hey what are the primary keys you are using?
g
It's preconfigured in the survey monkey connector
“Source defined”
m
This is the only record duplicated?
g
No. There were a handful. And this was the first subsequent sync. I imagine if I do it again. It will happen again
are there related tests on this connector for the incremental responses stream?
do you need any additional information?
can you also please point me to where the incremental gets applied in the code base? specifically the “source defined” columns
m
are there related tests on this connector for the incremental responses stream?
should have a test validating that a incremental sync or full refresh produce the correct output, maybe not spcific to duplicate records
g
okay. so next steps?
which stream is this?
Do you think the primary key is right here? id is not the primary key right?
@Harshith (Airbyte) redirecting you back to this thread as it contains much of what your asking. I'm not sure what primary key is defined. it is source defined and I'm asking where this might be defined in order to try investigate further myself.
h
g
cheers, i was looking for a link to where this is defined. confirming the
id
is the same in the duplicated rows. however i do not see
start_modified_at
in the full response record
how does incremental work for this stream? i do not see
start_modified_at
in the full response records
also confirming that
"date_modified": "2022-01-19T13:18:13+00:00",
is the same for both records
does deduplicate not occur if there is no normalisation? i.e. if basic normalisation is turned off?
h
yeah it doesn't. Basic normalisation takes care of the normalisation
g
okay so that’s the issue here? i have no normalisation process (only
_AIRBYTE_RAW
tables) and thus the deduplication process (in incremental + dedupe) will not work. however, as for incremental. this duplicate record should still not appear as it is the same data i.e. only records after
date_modified
should be inserted. correct?
h
Yeah what you said is what ideally should happen. I will take a look at it today
Do you mind creating an issue for this?
can you see what the issue is? i’m happy to create a PR if there’s a quick fix
h
Sure will do it and ping back here
g
any luck?
h
Hey I went through the code
Copy code
params = super().request_params(stream_state=stream_state, **kwargs)
        params["sort_order"] = "ASC"
        params["sort_by"] = "date_modified"
        params["per_page"] = 1000  # maybe as user input or bigger value
        since_value = pendulum.parse(stream_state.get(self.cursor_field)) if stream_state.get(self.cursor_field) else self._start_date

        since_value = max(since_value, self._start_date)
        params["start_modified_at"] = since_value.strftime("%Y-%m-%dT%H:%M:%S")
        return params
It doesn't look off for me. Can you help in understanding if there is pattern for duplicate records like it's just happening in the borders?
g
i just created a seperate ticket as there appears to be a broader issue occurring, possibly to do with the state of the job in the DB see new issue here
It doesn’t look off for me. Can you help in understanding if there is pattern for duplicate records like it’s just happening in the borders?
is there a test to ensure this isn’t happening on airbyte’s end/configuration?
@Harshith (Airbyte) have you managed to make any progress on this?
h
Hey as of now the credentials we have don't have access to responses thus it would be great if you can help me with info on the duplicates like Is it just happening on the start or end dates or any pattern in them ?
But yeah we are also working towards getting the responses stream up with our credentials
g
Aren't responses the main piece of data for the survey monkey connector? that's like having a GitHub connector with no commits or PRs, Google Ads with no actual visits, databases with empty tables. I would think having responses is fundamental to the airbyte test connector. I'll try investigate on my end however this is tricky as I'm hitting our daily API limits without being able to properly test.
@Harshith (Airbyte) quick question. is it possible that the
cursor_field
is not specific the survey_id? i.e. when a list of survey IDs are provided and it goes through each one to get the responses. the
cursor_field
is leaking into the next survey id?
h
Checking it
g
@Harshith (Airbyte) @Harshith (Airbyte) i think i may have found the issue https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-surveymonkey/source_surveymonkey/streams.py#L263-L275 can you confirm this API pulls responses after this timestamp. and not on or after this timestamp? i.e. are responses being pulled
WHERE timestamp > cursor_field
or is it being pulled as
WHERE timestamp >= cursor_field
??
i’ve just confirmed this, its greater than or equal to (
WHERE timestamp >= cursor_field
) e.g. my last response has
"date_modified": "2022-04-04T14:51:18+00:00"
and this gets set as the cursor field. when i apply the updated state
start_modified_at: 2022-04-04T14:51:18
it returns the same response
@Harshith (Airbyte) @Harshith (Airbyte) are you able to review the PR. I’m not sure if it's the right solution and would appreciate help extending the tests to confirm this works as intended.
Is there somewhere more appropriate to post this to get feedback