< charles> I m running into some problems with google adword Airbyte #contributing-to-airbyte

Join Slack

<@U01AS8LGX41> I'm running into some problems with...

# contributing-to-airbyte

Jared Rhizor (Airbyte)

11/11/2020, 3:04 AM

@charles I'm running into some problems with google adwords -> postgres

user

11/11/2020, 3:04 AM

Copy code

2020-11-11 03:00:31 ERROR (/tmp/workspace/32/0) LineGobbler(voidCall):69 - Caused by: org.postgresql.util.PSQLException: ERROR: relation "click_performance_report_1605063619055" does not exist

user

11/11/2020, 3:04 AM

however,

CLICK_PERFORMANCE_REPORT_1605063619055

does exist

user

11/11/2020, 3:04 AM

all of the performance reports are all caps

user

11/11/2020, 3:05 AM

campaigns/ad_groups/ads/accounts are all snake case

user

11/11/2020, 3:05 AM

this is without normalization

user

11/11/2020, 3:05 AM

maybe this could be related to the problems @s is seeing with capitalization problems?

user

11/11/2020, 3:07 AM

I think this might be a casing issue

user

11/11/2020, 3:07 AM

capitalization

user

11/11/2020, 3:07 AM

do you think this is in the same area you're already looking into?

user

11/11/2020, 3:07 AM

damn

user

11/11/2020, 3:08 AM

or want me to look into it

user

11/11/2020, 3:08 AM

i think it’s related.

user

11/11/2020, 3:08 AM

but different.

user

11/11/2020, 3:08 AM

sources need to be responsible for passing valid stream and field names.

user

11/11/2020, 3:09 AM

but not all valid stream and field names are going to be valid in a destination

user

11/11/2020, 3:09 AM

and a destination should be responsible for making sure that all valid stream and field names can be handled

user

11/11/2020, 3:09 AM

which it sounds like they don’t right now.

user

11/11/2020, 3:10 AM

it seems like the postgres destination needs to be adjusted to handle this capitalization case.

user

11/11/2020, 3:10 AM

unless we want to take a stance that stream names and fields names are always all caps 🤷

user

11/11/2020, 3:10 AM

i think tha’ts probably a bad idea

user

11/11/2020, 3:11 AM

if everything for postgres is properly quoted, it should support upper case names

user

11/11/2020, 3:14 AM

maybe i’m not understanding the problem.

user

11/11/2020, 3:14 AM

I guess it's a question of if we want to normalize names or support case sensitive names

user

11/11/2020, 3:14 AM

why is it capitalized sometimes and not others in this postgres case?

user

11/11/2020, 3:15 AM

I'm not sure, it might just be that the source streams happen to be that way

user

11/11/2020, 3:15 AM

i’m learning towards not supporting case sensitive names.

user

11/11/2020, 3:15 AM

the source is outputting capitalized streams in my case:

https://user-images.githubusercontent.com/6246757/98760749-c5716700-2388-11eb-9b1d-d58cfde9575f.png▾

user

11/11/2020, 3:15 AM

these are dumb issues to have.

user

11/11/2020, 3:15 AM

woof. that’s so whacky.

user

11/11/2020, 3:16 AM

I think this means we aren't using jooq escaping properly in the destination for names

user

11/11/2020, 3:19 AM

i vote case insensitive.

user

11/11/2020, 3:19 AM

meaning we auto lower case everything?

user

11/11/2020, 3:19 AM

or that we reject uppercase?

user

11/11/2020, 3:19 AM

• sources need to be able to only return and need to be able to acceppt either all caps or all lower case. don’t care which one. • destinations need to be able to convert from all lower case or all upper case to whatever case they need.

user

11/11/2020, 3:20 AM

tu so destination needs to be able to accept

user

11/11/2020, 3:20 AM

what do we want uppercase or lower case?

user

11/11/2020, 3:21 AM

lower case?

user

11/11/2020, 3:21 AM

yes

user

11/11/2020, 3:21 AM

so then sources can only emit lower case stream names and field names

user

11/11/2020, 3:21 AM

why either all caps or upper?

user

11/11/2020, 3:21 AM

I think it has to happen all in the destination though

user

11/11/2020, 3:21 AM

plus whatever restrictions we put on them around whitespace and whatever.

user

11/11/2020, 3:21 AM

destinations assume all inputs are all lower case and adjust them to whatever they need them to be.

user

11/11/2020, 3:22 AM

I think it's safer if the destination is fully responsible for interpreting the stream name

user

11/11/2020, 3:22 AM

user

11/11/2020, 3:22 AM

I think it’s simpler if we just say no matter what a source outputs, the destination will be responsible for handling it correctly

user

11/11/2020, 3:22 AM

although I'm worried that the schema selection won't match then

user

11/11/2020, 3:23 AM

yeah. i’m worried what the airbyte catalog is going to look like if we don’t enforce sanity in the sources.

user

11/11/2020, 3:23 AM

i do agree destinations should be cautious and assume the worse. no disagreement there

user

11/11/2020, 3:23 AM

but i think we also need to enforce what the source is allowed to emit

user

11/11/2020, 3:23 AM

and we should be strict so that we don’t have to handle stupid edge case in our configuration model.

user

11/11/2020, 3:23 AM

yeah, my most recent statement was just w.r.t casing, not special characters

user

11/11/2020, 3:24 AM

i want to enforce casing in airbyte catalog too is what i’m saying

user

11/11/2020, 3:24 AM

i don’t want source to emit any non lowercase characters.

user

11/11/2020, 3:24 AM

is that cool with you?

user

11/11/2020, 3:25 AM

so that means at the python entrypoint.py level we'd want to standardize naming?

user

11/11/2020, 3:25 AM

maybe with https://pypi.org/project/stringcase/

user

11/11/2020, 3:25 AM

that is cool with me. I like the idea of enforcing naming

user

11/11/2020, 3:25 AM

that would probably tide us over on the destination side for release too

user

11/11/2020, 3:26 AM

what do we do if a source has

aBc

and

ABc

streams?

user

11/11/2020, 3:27 AM

for the release I mean

user

11/11/2020, 3:27 AM

deterministically pick one.

user

11/11/2020, 3:27 AM

so i actually don’t think stringcase works.

user

11/11/2020, 3:27 AM

for this

user

11/11/2020, 3:28 AM

i don’t think any of our infrastructure can normalize for the integration

user

11/11/2020, 3:28 AM

the integration needs to do it right.

user

11/11/2020, 3:28 AM

we can't do that for the release though

user

11/11/2020, 3:29 AM

if the field name that the source emits in discover is

fooBar

and then entrypoint.py sneakily normalized it to

foobar

. when

foobar

is passed back the source won’t know that it’s the same thing as

fooBar

user

11/11/2020, 3:29 AM

ah. i see what you’re saying.

user

11/11/2020, 3:29 AM

just as a stop gap.

user

11/11/2020, 3:29 AM

yeah

user

11/11/2020, 3:29 AM

i mean as far as i know there’s only one source that causes problems right now. google sheets.

user

11/11/2020, 3:29 AM

i can fix that one tomorrow

user

11/11/2020, 3:30 AM

sounds like maybe adwords too.

user

11/11/2020, 3:30 AM

can we just fix those two and not mess with entrypoint.py?

user

11/11/2020, 3:30 AM

and salesforce? or was that something else

user

11/11/2020, 3:30 AM

I think it can also happen in sql sources too

user

11/11/2020, 3:30 AM

depending on the data

user

11/11/2020, 3:30 AM

yeah…. i think if we put a stop gap thing in entrypoint.py we will cause new bugs

user

11/11/2020, 3:31 AM

yeah...

user

11/11/2020, 3:31 AM

i think we try our best to get by with what we have now. fix the sources we know are really bad and then gun for doing the right thing after release as a top priority.

user

11/11/2020, 3:31 AM

is it actually hard to support all casings in the jdbc destinations?

user

11/11/2020, 3:32 AM

is there a reason we can't just quote all table names and call it a day?

user

11/11/2020, 3:32 AM

you’re just talking for sources?

user

11/11/2020, 3:32 AM

no in the destinations

user

11/11/2020, 3:32 AM

basically support all cases

user

11/11/2020, 3:32 AM

🤦‍♀️ right. dumb question.

user

11/11/2020, 3:33 AM

i don’t know enough off the top of my head to know if that will work.

user

11/11/2020, 3:33 AM

yeah I don't want to make any decisions on this at this stage

user

11/11/2020, 3:33 AM

going to sleep on it and revisit in the morning

user

11/11/2020, 3:33 AM

🤝

user

11/11/2020, 3:33 AM

maybe poke around a bit to see how we're quoting

user

11/11/2020, 3:33 AM

🥨 🥨 🥨 🥨

user

11/11/2020, 3:33 AM

agreed.

user

11/11/2020, 4:28 PM

thought about this some more this morning. i think alot of my thinking was wrong.

user

11/11/2020, 4:28 PM

i think we should avoid touching sources at all pre-release if we can.

user

11/11/2020, 4:29 PM

i think investing in getting destinations to normalize stuff is generally worthwhile.

user

11/11/2020, 4:31 PM

priority of work from my point of view now is as follows: • (pre-release) add to standard tests a data set that includes streams / fields with spaces and weird casing in the name. get all destinations (with normalization) to success given these cases • (pre-release or right after release) add to sync worker something that applies a normalization to stream and field names before they get to destinations (remove white space, stripe out special characters) • (post-release) at least highly encourage that sources conform to a certain set of rules but maybe don’t fail on them.

user

11/11/2020, 4:32 PM

i could be convinced that i have the order of steps 1 and 2 wrong. what do you think.

user

11/11/2020, 4:33 PM

Pre release we can add quoting to identifiers in the Postgres destination too

user

11/11/2020, 4:34 PM

I think it’s easier to support case sensitivity short term than it is to consider name normalization

user

11/11/2020, 4:34 PM

i came to this ordering because, no matter what destinations will need to do some name normalization from what comes in no matter what. Either it receives any possible string or it receives a string that is already normalized to whatever standard airbyte sets, but not necessarily what the db supports). it will also be easy to test this and give us high confidence it’s working in the standard tests.

user

11/11/2020, 4:35 PM

i have meeting with zazmic and need to review their code now, but i can work on adding the extra data set to the standard tests once i’m done unless someone beats me to it.

user

11/11/2020, 4:35 PM

jared, sounds like you’re focusing on fixing the pg stuf?

user

11/11/2020, 4:36 PM

Yeah I think it’s a short fix, I’ll start on it in 10-15min

user

11/11/2020, 4:36 PM

user

11/11/2020, 4:52 PM

Pre release we can add quoting to identifiers in the Postgres destination too

That's what i end up doing in the normalization pieces by DBT with the default

quoting

parameters set to true in the sources.yml file but we can actually play with those settings to disable quoting or not...

user

11/11/2020, 5:17 PM

https://github.com/airbytehq/airbyte/pull/908

user

11/11/2020, 5:59 PM

going to publish postgres with this change

user

11/11/2020, 5:59 PM

hotfixing the current version

user

11/11/2020, 6:00 PM

user

11/11/2020, 6:05 PM

done

user

11/11/2020, 6:12 PM

Exception in thread "main" org.jooq.exception.DataAccessException: SQL [insert into streamWithCamelCase_1605118306833 (ab_id, data, emitted_at) values (?, cast(? as jsonb), cast(? as timestamp with time zone))]; ERROR: relation "streamwithcamelcase_1605118306833" does not exist

user

11/11/2020, 6:12 PM

i’m getting this error. should your change have fixed this?

user

11/11/2020, 6:17 PM

or that was just for whitspacing?

user

11/11/2020, 6:27 PM

@Jared Rhizor (Airbyte)

user

11/11/2020, 6:27 PM

might fix this

user

11/11/2020, 6:28 PM

did you pull the latest version?

user

11/11/2020, 6:28 PM

yes

user

11/11/2020, 6:28 PM

I would expect the logged sql to include the quotes then

user

11/11/2020, 6:28 PM

Copy code

→ docker image ls airbyte/destination-postgres:0.1.1
REPOSITORY                     TAG                 IMAGE ID            CREATED             SIZE
airbyte/destination-postgres   0.1.1               2d00d190bca9        28 minutes ago      440MB

user

11/11/2020, 6:29 PM

Can you verify you're on that image?

user

11/11/2020, 6:29 PM

And also, is the test you're running based off of a dev image or 0.1.1? If it's off of dev, have you merged master into your test branch?

user

11/11/2020, 6:33 PM

yes to all of those things.

user

11/11/2020, 6:33 PM

i’m using :dev tag

user

11/11/2020, 6:34 PM

and your change is included

user

11/11/2020, 6:35 PM

oh this is the insert

user

11/11/2020, 6:35 PM

so not impacted by my change

user

11/11/2020, 6:35 PM

and this is postgres?

user

11/11/2020, 6:35 PM

I thought insert would quote

user

11/11/2020, 6:35 PM

maybe there's some jooq setting

user

11/11/2020, 6:36 PM

By default, jOOQ will always generate quoted names for all identifiers

user

11/11/2020, 6:37 PM

i’m going to get the new standard tests in a working state and then we can just divide up destinations and get them passing.

user

11/11/2020, 7:07 PM

do you want me to include field names that include special charactrers. e.g.

field*name*with*asterisk*

or is that overkill for right now?

user

11/11/2020, 7:07 PM

I think case sensitivity with _ is all we've seen and all we reasonably need to support

user

11/11/2020, 7:08 PM

google sheets regularly has spaces too.

user

11/11/2020, 7:08 PM

I'm fine failing if anyone does something

super$nonstandard!@

user

11/11/2020, 7:08 PM

user

11/11/2020, 7:08 PM

hmm

user

11/11/2020, 7:08 PM

i’ll do spaces but not asterisk.

user

11/11/2020, 7:08 PM

google sheets probably has special characters too

user

11/11/2020, 7:08 PM

cool?

user

11/11/2020, 7:08 PM

oh yeah. i guess $ could be a thing

user

11/11/2020, 7:09 PM

i mean how much exta work is it to normalize everything that’s not a normal character to an underscore for our existing destinations?

user

11/11/2020, 7:09 PM

if it’s little extra work then we should just do it now.

user

11/11/2020, 7:09 PM

maybe?

user

11/11/2020, 7:09 PM

I mean we can replace special chars with an

user

11/11/2020, 7:10 PM

but then it’s ambiguous because

x$*y == x*$y

user

11/11/2020, 7:10 PM

are there any destinations that don't support special characters?

user

11/11/2020, 7:10 PM

snowflake/postgres do?

user

11/11/2020, 7:10 PM

i don’t know.

user

11/11/2020, 7:10 PM

bigquery does too

user

11/11/2020, 7:11 PM

and i guess the filesystem is pretty tolerant for the csv stuff.

user

11/11/2020, 7:11 PM

yeah

user

11/11/2020, 7:11 PM

I guess we should just add some special character tests then

user

11/11/2020, 7:12 PM

agreed

user

11/11/2020, 7:15 PM

user

11/11/2020, 7:15 PM

i do this.

user

11/11/2020, 7:23 PM

okay PR ready https://github.com/airbytehq/airbyte/pull/911

user

11/11/2020, 7:23 PM

@s re-requested your reivew since it changed kinda substantially from when you looked at in an hour ago.

user

11/11/2020, 7:23 PM

plan is to merge this. it will break integration tests for bq, snowflake, and pg, and then we split them up and fix them.

Open in Slack

Previous Next