Does anybody know if Flink SQL Parquet tables are compatible Apache Flink #troubleshooting

Join Slack

Does anybody know if Flink SQL Parquet tables are ...

# troubleshooting

Marco Villalobos

09/22/2023, 6:12 PM

Does anybody know if Flink SQL Parquet tables are compatible with Spark parquet tables?

Martijn Visser

09/22/2023, 6:14 PM

That’s documented at https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/connectors/table/formats/parquet/#data-type-mapping

Marco Villalobos

09/22/2023, 6:18 PM

Thank you Martijn. however, what is not clear to me is that it says, 'compatible with Apache Hive, but different with Apache Spark", which does not mean incompatible. I read that table to mean that there is compatibility, but the types are slightly different.

Martijn Visser

09/22/2023, 6:45 PM

From what I’ve heard, the Parquet files that are generated by Flink can’t be read by Spark, only by Hive

Marco Villalobos

09/22/2023, 6:56 PM

i think it would be a good thing if they were fully compatible with each other.

Marco Villalobos

09/22/2023, 6:57 PM

it would increase the interoperability of flink.

Martijn Visser

09/22/2023, 6:57 PM

Definitely, but it still requires someone to open the Pr and we need to find someone who can review it. And it needs to be configurable, so have compatibility with both

Nicholas Erasmus

09/29/2023, 9:22 AM

The only 2 differences we saw when replacing Spark with Flink were: • Flink doesn’t add the partition fields and values to the Parquet files while Spark does. • (probably unrelated to your query but nonetheless) Spark adds a

_symlink_format_manifest

while Flink doesn’t. This is used in most cases when importing to something like Redshift. In one of our end-to-end tests we use Spark to assert that the Flink produced data is correct and it works fine.

Marco Villalobos

10/03/2023, 8:47 PM

Hi @Nicholas Erasmus. One odd thing about Flink's Parquet interface is that the SQL api does not require Avro, whereas the DataStream API does require Avro. Which Flink Parquet API did you use? Also, what do you mean by "Flink doesn’t add the partition fields and values to the Parquet files"? I don't understand what that means. I noticed that when I created Parquet files with Flink SQL API is that the partition name became part of the file path name. For example, if the portian was by lastName, then the partition might be

lastName=Villalobos/part-586e1629-8de8-4071-989f-c763657fad3b-0-107

. So in that sense, the partition and partition value is added to the file. Is that what you mean?

Nicholas Erasmus

10/04/2023, 10:36 AM

@Marco Villalobos I should’ve probably mentioned that we are using the Flink Delta Connector.

So in that sense, the partition and partition value is added to the file. Is that what you mean? (edited)

yes exactly. So it is added in the sense you mention, but the file itself isn’t identical with what Spark produces. I’ve only worked a tiny bit with Spark while migrating our service over to Flink, but these differences did cause issues for our downstream customers.

Open in Slack

Previous Next