How can I use rewriteDataFiles method with sort te...
# random
a
How can I use rewriteDataFiles method with sort technique, I have column name timestamp in iceberg table by which i need to sort
Copy code
RewriteDataFilesActionResult result = Actions.forTable(table)
        .rewriteDataFiles()
        .execute();
Also can i delete orphan files using Flink?
p
I think you need to specify
.option
Copy code
RewriteDataFilesActionResult result = Actions.forTable(table)
        .option("strategy", "sort")
        .option("sort_order", "id DESC NULLS LAST,name ASC NULLS FIRST")
        .rewriteDataFiles()
        .execute();
Read this guide about which options can be passed into this action: https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_data_files It's for spark, but flink should support the same I think, iceberg docs are behind what's implemented. Regarding deleting orphan files using Flink, you can't delete them using Flink, it's not implemented, so you will need to calculate the difference between what is in your table folder like so: 1. get all the files from all your snapshots (https://iceberg.apache.org/docs/latest/flink-queries/#all-data-files)
Copy code
sql> SELECT * FROM prod.db.table$all_data_files;
2) list the files from S3/HDFS in your table's folder 3) calculate the difference. This difference will be the files that you will need to delete. Then you take these files, iterate over them one by one (I guess you can parallelize this as well) and remove them using the S3/HDFS API. I think you may get more help if you use iceberg workspace instead of Flink.
a
can i rewrite file for specific day? like i want to rewrite file file for yesterday, and i'll write scheduler for the same
p
In spark actions there's this:
Copy code
Table table = ...
SparkActions
    .get()
    .rewriteDataFiles(table)
    .filter(Expressions.equal("date", "2020-08-18"))
    .option("target-file-size-bytes", Long.toString(500 * 1024 * 1024)) // 500 MB
    .execute();
See if Flink also has the same thing