Hey everybody, I have a question regarding the da...
# ask-community-for-troubleshooting
p
Hey everybody, I have a question regarding the data types of the parquet format in the S3 destination. As source I want to sync data from a PostgreSQL database. The PostgreSQL DDL looks like this:
CREATE
*TABLE* public.task_events (
id
serial4
NOT
*NULL*,
task_id
int4
*NULL*,
event_name
*varchar*(20) *NULL*,
ts
timestamp
NOT
*NULL*,
*CONSTRAINT* task_events_pkey *PRIMARY*
*KEY* (id)
);
CREATE
*INDEX* ix_task_events_task_id *ON* public.task_events *USING* btree (task_id);
CREATE
*INDEX* ix_task_events_ts *ON* public.task_events *USING* btree (ts);
When I obtain the jsonschema from Airbyte API I receive these types (looking good to me):
print(json_schema['properties']['id'])
{'type': 'number', 'airbyte_type': 'integer'}
print(json_schema['properties']['ts'])
{'type': 'string', 'format': 'date-time', 'airbyte_type': 'timestamp_without_timezone'}
print(json_schema['properties']['task_id'])
{'type': 'number', 'airbyte_type': 'integer'}
print(json_schema['properties']['event_name'])
{'type': 'string'}
Now I check the .parquet files written to S3 by Airbyte in AWS Glue and it has these types:
id double
ts struct
task_id double
event_name string
So I guess the S3/Parquet destination converted id and task_id to double instead of int This leads to IDs like this β€œ4.2108168E7” instead of this β€œ42108168" I hope someone can help with this. Thanks in advance! πŸ™‚
πŸ’― 1
πŸ‘ 2
f
I’d love to see this issue resolved, currently we have to build our own JSON to Parquet conversion for Airbyte-based data sources, which works, but I’d prefer to rely on your solution!
u
Hi, Floyd! Sorry to hear you're running into this issue. It looks like the types are not being converted correctly in Parquet for other users as well, I've found these discussions/issues: https://github.com/airbytehq/airbyte/issues/17564 https://github.com/airbytehq/airbyte/pull/14362#discussion_r928033781 In your case it looks like int4 should be converting correctly to int32. I'm looking into why that's happening, but while I do that I'd say keep using the converter you've built!
πŸ™ 1
f
Hey Nataly, thank you! I’m glad to hear that it’s being discussed, I’ll watch the issues to see if we could migrate our Parquet conversion to Airbyte in the future πŸ™‚
p
Thank you for answering Nataly! Do you think this issue is covered by the Github links you provided or should I create a separate issue?
u
Create a separate one and tag me in it as @natalyjazzviolin ! I'll triage and bring it to the attention of the team!
πŸ™ 1
p
Thank you again Nataly, here is the issue https://github.com/airbytehq/airbyte/issues/18922
u
Got it, thanks!
j
Hi, I'm having the same issue for timezoned postgresql datetimes : 1. they are converted to struct field {member0: datetime, member1: string} in parquet 2. There's a loss of informations about the timezone, that I'm expecting to find at 'member1' but all values are null.