https://linen.dev logo
j

Jared Rhizor (Airbyte)

11/11/2020, 3:04 AM
@charles I'm running into some problems with google adwords -> postgres
u

user

11/11/2020, 3:04 AM
Copy code
2020-11-11 03:00:31 ERROR (/tmp/workspace/32/0) LineGobbler(voidCall):69 - Caused by: org.postgresql.util.PSQLException: ERROR: relation "click_performance_report_1605063619055" does not exist
u

user

11/11/2020, 3:04 AM
however,
CLICK_PERFORMANCE_REPORT_1605063619055
does exist
u

user

11/11/2020, 3:04 AM
all of the performance reports are all caps
u

user

11/11/2020, 3:05 AM
campaigns/ad_groups/ads/accounts are all snake case
u

user

11/11/2020, 3:05 AM
this is without normalization
u

user

11/11/2020, 3:05 AM
maybe this could be related to the problems @s is seeing with capitalization problems?
u

user

11/11/2020, 3:07 AM
I think this might be a casing issue
u

user

11/11/2020, 3:07 AM
capitalization
u

user

11/11/2020, 3:07 AM
do you think this is in the same area you're already looking into?
u

user

11/11/2020, 3:07 AM
damn
u

user

11/11/2020, 3:08 AM
or want me to look into it
u

user

11/11/2020, 3:08 AM
i think it’s related.
u

user

11/11/2020, 3:08 AM
but different.
u

user

11/11/2020, 3:08 AM
sources need to be responsible for passing valid stream and field names.
u

user

11/11/2020, 3:09 AM
but not all valid stream and field names are going to be valid in a destination
u

user

11/11/2020, 3:09 AM
and a destination should be responsible for making sure that all valid stream and field names can be handled
u

user

11/11/2020, 3:09 AM
which it sounds like they don’t right now.
u

user

11/11/2020, 3:10 AM
it seems like the postgres destination needs to be adjusted to handle this capitalization case.
u

user

11/11/2020, 3:10 AM
unless we want to take a stance that stream names and fields names are always all caps 🤷
u

user

11/11/2020, 3:10 AM
i think tha’ts probably a bad idea
u

user

11/11/2020, 3:11 AM
if everything for postgres is properly quoted, it should support upper case names
u

user

11/11/2020, 3:14 AM
maybe i’m not understanding the problem.
u

user

11/11/2020, 3:14 AM
I guess it's a question of if we want to normalize names or support case sensitive names
u

user

11/11/2020, 3:14 AM
why is it capitalized sometimes and not others in this postgres case?
u

user

11/11/2020, 3:15 AM
I'm not sure, it might just be that the source streams happen to be that way
u

user

11/11/2020, 3:15 AM
i’m learning towards not supporting case sensitive names.
u

user

11/11/2020, 3:15 AM
the source is outputting capitalized streams in my case:

https://user-images.githubusercontent.com/6246757/98760749-c5716700-2388-11eb-9b1d-d58cfde9575f.png

u

user

11/11/2020, 3:15 AM
these are dumb issues to have.
u

user

11/11/2020, 3:15 AM
woof. that’s so whacky.
u

user

11/11/2020, 3:16 AM
I think this means we aren't using jooq escaping properly in the destination for names
u

user

11/11/2020, 3:19 AM
i vote case insensitive.
u

user

11/11/2020, 3:19 AM
meaning we auto lower case everything?
u

user

11/11/2020, 3:19 AM
or that we reject uppercase?
u

user

11/11/2020, 3:19 AM
• sources need to be able to only return and need to be able to acceppt either all caps or all lower case. don’t care which one. • destinations need to be able to convert from all lower case or all upper case to whatever case they need.
u

user

11/11/2020, 3:20 AM
tu so destination needs to be able to accept
u

user

11/11/2020, 3:20 AM
what do we want uppercase or lower case?
u

user

11/11/2020, 3:21 AM
lower case?
u

user

11/11/2020, 3:21 AM
yes
u

user

11/11/2020, 3:21 AM
so then sources can only emit lower case stream names and field names
u

user

11/11/2020, 3:21 AM
why either all caps or upper?
u

user

11/11/2020, 3:21 AM
I think it has to happen all in the destination though
u

user

11/11/2020, 3:21 AM
plus whatever restrictions we put on them around whitespace and whatever.
u

user

11/11/2020, 3:21 AM
destinations assume all inputs are all lower case and adjust them to whatever they need them to be.
u

user

11/11/2020, 3:22 AM
I think it's safer if the destination is fully responsible for interpreting the stream name
u

user

11/11/2020, 3:22 AM
^
u

user

11/11/2020, 3:22 AM
I think it’s simpler if we just say no matter what a source outputs, the destination will be responsible for handling it correctly
u

user

11/11/2020, 3:22 AM
although I'm worried that the schema selection won't match then
u

user

11/11/2020, 3:23 AM
yeah. i’m worried what the airbyte catalog is going to look like if we don’t enforce sanity in the sources.
u

user

11/11/2020, 3:23 AM
i do agree destinations should be cautious and assume the worse. no disagreement there
u

user

11/11/2020, 3:23 AM
but i think we also need to enforce what the source is allowed to emit
u

user

11/11/2020, 3:23 AM
and we should be strict so that we don’t have to handle stupid edge case in our configuration model.
u

user

11/11/2020, 3:23 AM
yeah, my most recent statement was just w.r.t casing, not special characters
u

user

11/11/2020, 3:24 AM
i want to enforce casing in airbyte catalog too is what i’m saying
u

user

11/11/2020, 3:24 AM
i don’t want source to emit any non lowercase characters.
u

user

11/11/2020, 3:24 AM
is that cool with you?
u

user

11/11/2020, 3:25 AM
so that means at the python entrypoint.py level we'd want to standardize naming?
u

user

11/11/2020, 3:25 AM
u

user

11/11/2020, 3:25 AM
that is cool with me. I like the idea of enforcing naming
u

user

11/11/2020, 3:25 AM
that would probably tide us over on the destination side for release too
u

user

11/11/2020, 3:26 AM
what do we do if a source has
aBc
and
ABc
streams?
u

user

11/11/2020, 3:27 AM
for the release I mean
u

user

11/11/2020, 3:27 AM
deterministically pick one.
u

user

11/11/2020, 3:27 AM
so i actually don’t think stringcase works.
u

user

11/11/2020, 3:27 AM
for this
u

user

11/11/2020, 3:28 AM
i don’t think any of our infrastructure can normalize for the integration
u

user

11/11/2020, 3:28 AM
the integration needs to do it right.
u

user

11/11/2020, 3:28 AM
we can't do that for the release though
u

user

11/11/2020, 3:29 AM
if the field name that the source emits in discover is
fooBar
and then entrypoint.py sneakily normalized it to
foobar
. when
foobar
is passed back the source won’t know that it’s the same thing as
fooBar
.
u

user

11/11/2020, 3:29 AM
ah. i see what you’re saying.
u

user

11/11/2020, 3:29 AM
just as a stop gap.
u

user

11/11/2020, 3:29 AM
yeah
u

user

11/11/2020, 3:29 AM
i mean as far as i know there’s only one source that causes problems right now. google sheets.
u

user

11/11/2020, 3:29 AM
i can fix that one tomorrow
u

user

11/11/2020, 3:30 AM
sounds like maybe adwords too.
u

user

11/11/2020, 3:30 AM
can we just fix those two and not mess with entrypoint.py?
u

user

11/11/2020, 3:30 AM
and salesforce? or was that something else
u

user

11/11/2020, 3:30 AM
I think it can also happen in sql sources too
u

user

11/11/2020, 3:30 AM
depending on the data
u

user

11/11/2020, 3:30 AM
yeah…. i think if we put a stop gap thing in entrypoint.py we will cause new bugs
u

user

11/11/2020, 3:31 AM
yeah...
u

user

11/11/2020, 3:31 AM
i think we try our best to get by with what we have now. fix the sources we know are really bad and then gun for doing the right thing after release as a top priority.
u

user

11/11/2020, 3:31 AM
is it actually hard to support all casings in the jdbc destinations?
u

user

11/11/2020, 3:32 AM
is there a reason we can't just quote all table names and call it a day?
u

user

11/11/2020, 3:32 AM
you’re just talking for sources?
u

user

11/11/2020, 3:32 AM
no in the destinations
u

user

11/11/2020, 3:32 AM
basically support all cases
u

user

11/11/2020, 3:32 AM
🤦‍♀️ right. dumb question.
u

user

11/11/2020, 3:33 AM
i don’t know enough off the top of my head to know if that will work.
u

user

11/11/2020, 3:33 AM
yeah I don't want to make any decisions on this at this stage
u

user

11/11/2020, 3:33 AM
going to sleep on it and revisit in the morning
u

user

11/11/2020, 3:33 AM
🤝
u

user

11/11/2020, 3:33 AM
maybe poke around a bit to see how we're quoting
u

user

11/11/2020, 3:33 AM
🥨 🥨 🥨 🥨
u

user

11/11/2020, 3:33 AM
agreed.
u

user

11/11/2020, 4:28 PM
thought about this some more this morning. i think alot of my thinking was wrong.
u

user

11/11/2020, 4:28 PM
i think we should avoid touching sources at all pre-release if we can.
u

user

11/11/2020, 4:29 PM
i think investing in getting destinations to normalize stuff is generally worthwhile.
u

user

11/11/2020, 4:31 PM
priority of work from my point of view now is as follows: • (pre-release) add to standard tests a data set that includes streams / fields with spaces and weird casing in the name. get all destinations (with normalization) to success given these cases • (pre-release or right after release) add to sync worker something that applies a normalization to stream and field names before they get to destinations (remove white space, stripe out special characters) • (post-release) at least highly encourage that sources conform to a certain set of rules but maybe don’t fail on them.
u

user

11/11/2020, 4:32 PM
i could be convinced that i have the order of steps 1 and 2 wrong. what do you think.
u

user

11/11/2020, 4:33 PM
Pre release we can add quoting to identifiers in the Postgres destination too
u

user

11/11/2020, 4:34 PM
I think it’s easier to support case sensitivity short term than it is to consider name normalization
u

user

11/11/2020, 4:34 PM
i came to this ordering because, no matter what destinations will need to do some name normalization from what comes in no matter what. Either it receives any possible string or it receives a string that is already normalized to whatever standard airbyte sets, but not necessarily what the db supports). it will also be easy to test this and give us high confidence it’s working in the standard tests.
u

user

11/11/2020, 4:35 PM
i have meeting with zazmic and need to review their code now, but i can work on adding the extra data set to the standard tests once i’m done unless someone beats me to it.
u

user

11/11/2020, 4:35 PM
jared, sounds like you’re focusing on fixing the pg stuf?
u

user

11/11/2020, 4:36 PM
Yeah I think it’s a short fix, I’ll start on it in 10-15min
u

user

11/11/2020, 4:36 PM
kk
u

user

11/11/2020, 4:52 PM
Pre release we can add quoting to identifiers in the Postgres destination too
That's what i end up doing in the normalization pieces by DBT with the default
quoting
parameters set to true in the sources.yml file but we can actually play with those settings to disable quoting or not...
u

user

11/11/2020, 5:59 PM
going to publish postgres with this change
u

user

11/11/2020, 5:59 PM
hotfixing the current version
u

user

11/11/2020, 6:00 PM
tu
u

user

11/11/2020, 6:05 PM
done
u

user

11/11/2020, 6:12 PM
Exception in thread "main" org.jooq.exception.DataAccessException: SQL [insert into streamWithCamelCase_1605118306833 (ab_id, data, emitted_at) values (?, cast(? as jsonb), cast(? as timestamp with time zone))]; ERROR: relation "streamwithcamelcase_1605118306833" does not exist
u

user

11/11/2020, 6:12 PM
i’m getting this error. should your change have fixed this?
u

user

11/11/2020, 6:17 PM
or that was just for whitspacing?
u

user

11/11/2020, 6:27 PM
@Jared Rhizor (Airbyte)
u

user

11/11/2020, 6:27 PM
might fix this
u

user

11/11/2020, 6:28 PM
did you pull the latest version?
u

user

11/11/2020, 6:28 PM
yes
u

user

11/11/2020, 6:28 PM
I would expect the logged sql to include the quotes then
u

user

11/11/2020, 6:28 PM
Copy code
→ docker image ls airbyte/destination-postgres:0.1.1
REPOSITORY                     TAG                 IMAGE ID            CREATED             SIZE
airbyte/destination-postgres   0.1.1               2d00d190bca9        28 minutes ago      440MB
u

user

11/11/2020, 6:29 PM
Can you verify you're on that image?
u

user

11/11/2020, 6:29 PM
And also, is the test you're running based off of a dev image or 0.1.1? If it's off of dev, have you merged master into your test branch?
u

user

11/11/2020, 6:33 PM
yes to all of those things.
u

user

11/11/2020, 6:33 PM
i’m using :dev tag
u

user

11/11/2020, 6:34 PM
and your change is included
u

user

11/11/2020, 6:35 PM
oh this is the insert
u

user

11/11/2020, 6:35 PM
so not impacted by my change
u

user

11/11/2020, 6:35 PM
and this is postgres?
u

user

11/11/2020, 6:35 PM
I thought insert would quote
u

user

11/11/2020, 6:35 PM
maybe there's some jooq setting
u

user

11/11/2020, 6:36 PM
By default, jOOQ will always generate quoted names for all identifiers
u

user

11/11/2020, 6:37 PM
i’m going to get the new standard tests in a working state and then we can just divide up destinations and get them passing.
u

user

11/11/2020, 7:07 PM
do you want me to include field names that include special charactrers. e.g.
field*name*with*asterisk*
or is that overkill for right now?
u

user

11/11/2020, 7:07 PM
I think case sensitivity with _ is all we've seen and all we reasonably need to support
u

user

11/11/2020, 7:08 PM
google sheets regularly has spaces too.
u

user

11/11/2020, 7:08 PM
I'm fine failing if anyone does something
super$nonstandard!@
u

user

11/11/2020, 7:08 PM
oh
u

user

11/11/2020, 7:08 PM
hmm
u

user

11/11/2020, 7:08 PM
i’ll do spaces but not asterisk.
u

user

11/11/2020, 7:08 PM
google sheets probably has special characters too
u

user

11/11/2020, 7:08 PM
cool?
u

user

11/11/2020, 7:08 PM
oh yeah. i guess $ could be a thing
u

user

11/11/2020, 7:09 PM
i mean how much exta work is it to normalize everything that’s not a normal character to an underscore for our existing destinations?
u

user

11/11/2020, 7:09 PM
if it’s little extra work then we should just do it now.
u

user

11/11/2020, 7:09 PM
maybe?
u

user

11/11/2020, 7:09 PM
I mean we can replace special chars with an
_
u

user

11/11/2020, 7:10 PM
but then it’s ambiguous because
x$*y == x*$y
u

user

11/11/2020, 7:10 PM
are there any destinations that don't support special characters?
u

user

11/11/2020, 7:10 PM
snowflake/postgres do?
u

user

11/11/2020, 7:10 PM
i don’t know.
u

user

11/11/2020, 7:10 PM
bigquery does too
u

user

11/11/2020, 7:11 PM
and i guess the filesystem is pretty tolerant for the csv stuff.
u

user

11/11/2020, 7:11 PM
yeah
u

user

11/11/2020, 7:11 PM
I guess we should just add some special character tests then
u

user

11/11/2020, 7:12 PM
agreed
u

user

11/11/2020, 7:15 PM
ok
u

user

11/11/2020, 7:15 PM
i do this.
u

user

11/11/2020, 7:23 PM
u

user

11/11/2020, 7:23 PM
@s re-requested your reivew since it changed kinda substantially from when you looked at in an hour ago.
u

user

11/11/2020, 7:23 PM
plan is to merge this. it will break integration tests for bq, snowflake, and pg, and then we split them up and fix them.