Does anybody know if Flink SQL Parquet tables are ...
# troubleshooting
m
Does anybody know if Flink SQL Parquet tables are compatible with Spark parquet tables?
m
Thank you Martijn. however, what is not clear to me is that it says, 'compatible with Apache Hive, but different with Apache Spark", which does not mean incompatible. I read that table to mean that there is compatibility, but the types are slightly different.
m
From what I’ve heard, the Parquet files that are generated by Flink can’t be read by Spark, only by Hive
m
i think it would be a good thing if they were fully compatible with each other.
it would increase the interoperability of flink.
m
Definitely, but it still requires someone to open the Pr and we need to find someone who can review it. And it needs to be configurable, so have compatibility with both
n
The only 2 differences we saw when replacing Spark with Flink were: • Flink doesn’t add the partition fields and values to the Parquet files while Spark does. • (probably unrelated to your query but nonetheless) Spark adds a
_symlink_format_manifest
while Flink doesn’t. This is used in most cases when importing to something like Redshift. In one of our end-to-end tests we use Spark to assert that the Flink produced data is correct and it works fine.
m
Hi @Nicholas Erasmus. One odd thing about Flink's Parquet interface is that the SQL api does not require Avro, whereas the DataStream API does require Avro. Which Flink Parquet API did you use? Also, what do you mean by "Flink doesn’t add the partition fields and values to the Parquet files"? I don't understand what that means. I noticed that when I created Parquet files with Flink SQL API is that the partition name became part of the file path name. For example, if the portian was by lastName, then the partition might be
lastName=Villalobos/part-586e1629-8de8-4071-989f-c763657fad3b-0-107
. So in that sense, the partition and partition value is added to the file. Is that what you mean?
n
@Marco Villalobos I should’ve probably mentioned that we are using the Flink Delta Connector.
So in that sense, the partition and partition value is added to the file. Is that what you mean? (edited)
yes exactly. So it is added in the sense you mention, but the file itself isn’t identical with what Spark produces. I’ve only worked a tiny bit with Spark while migrating our service over to Flink, but these differences did cause issues for our downstream customers.