Raghunadh Nittala
06/08/2023, 2:32 PMCREATE TABLE sink_table_s3 (
event_id STRING NOT NULL,
event_type STRING NOT NULL,
event_name STRING NOT NULL,
eventId STRING NOT NULL,
eventName STRING NOT NULL,
`date` STRING
) PARTITIONED BY (eventId, eventName, `date`) WITH (
'connector' = 'filesystem',
'path' = '<path>',
'format' = 'parquet',
'auto-compaction' = 'true'
);
Insert query:
INSERT INTO sink_table_s3
SELECT event_id, event_type, event_name,
DATE_FORMAT(proc_time, 'yyyy-MM-dd') AS `date`, event_id AS eventId, event_name AS eventName
FROM source_table;
I’m adding eventId, eventTime just to make sure those columns are also available in the Parquet file in S3. How can we avoid small files being created?Martijn Visser
06/08/2023, 3:44 PMMartijn Visser
06/08/2023, 3:45 PMsap1ens
06/08/2023, 3:51 PMWe are creating 1 parquet file per day in S3.
PARTITIONED BY (eventId, eventName,So it actually looks like a file per eventId, eventName and date? This seems like a lot of files.)date
Raghunadh Nittala
06/08/2023, 4:09 PMMartijn Visser
06/08/2023, 4:10 PMRaghunadh Nittala
06/08/2023, 4:10 PMRaghunadh Nittala
06/08/2023, 4:11 PMMartijn Visser
06/08/2023, 4:14 PMMartijn Visser
06/08/2023, 4:15 PMOnly files in a single checkpoint are compacted, that is, at least the same number of files as the number of checkpoints is generated.
Raghunadh Nittala
06/09/2023, 12:02 AMsap1ens
06/09/2023, 3:06 AMsap1ens
06/09/2023, 3:07 AMRaghunadh Nittala
06/09/2023, 12:15 PM