Hey Everyone, We’re using Flink SQL to save data t...
# troubleshooting
r
Hey Everyone, We’re using Flink SQL to save data to S3, by using ‘filesystem’ connector. We’re added the partitions in the table as well. The issue we see is there are so many small files being created in S3. We are creating 1 parquet file per day in S3. Sink table query:
Copy code
CREATE TABLE sink_table_s3 (
  event_id STRING NOT NULL,
  event_type STRING NOT NULL,
  event_name STRING NOT NULL,
  eventId STRING NOT NULL,
  eventName STRING NOT NULL,
  `date` STRING
) PARTITIONED BY (eventId, eventName, `date`) WITH (
  'connector' = 'filesystem', 
  'path' = '<path>', 
  'format' = 'parquet',
  'auto-compaction' = 'true'
);
Insert query:
Copy code
INSERT INTO sink_table_s3 
SELECT event_id, event_type, event_name, 
DATE_FORMAT(proc_time, 'yyyy-MM-dd') AS `date`, event_id AS eventId, event_name AS eventName
FROM source_table;
I’m adding eventId, eventTime just to make sure those columns are also available in the Parquet file in S3. How can we avoid small files being created?
m
Don’t forget to enable checkpointing in case you haven’t done that
s
We are creating 1 parquet file per day in S3.
PARTITIONED BY (eventId, eventName,
date
)
So it actually looks like a file per eventId, eventName and date? This seems like a lot of files.
r
I already enabled compaction, by setting ‘auto-compaction’ to true. Checkpointing is also enabled for every 5 mins.
m
So then it will take 5 mins before compaction is happening, are checking after that time or in between?
r
@sap1ens - In our case, One eventId and one eventName can have lot of data and thats why multiple files are being created. But the size of the the files is very less ~ 5 KB.
I’m checking the files getting created the next day as I’m partitioning by date.
m
Partitioning by date doesn’t mean that files will be compacted by date
Only files in a single checkpoint are compacted, that is, at least the same number of files as the number of checkpoints is generated.
r
Okay, then how do I fix this? As it is a bulk format type, does rolling policy help?
s
It’s pretty hard to “fix” without using a lakehouse format like Iceberg, Hudi or Delta.
Theoretically you can compact older data that’s “finalized” by running a batch job and changing partition location at the metastore layer (if you use that)
r
Thanks for the info, I’ll check this