I want to write data in parquet format from a stream job and Apache Flink #troubleshooting

I want to write data in parquet format from a stre...

Marco Villalobos

05/19/2023, 4:19 AM

I want to write data in parquet format from a stream job and then read it with the SQL API and SQL remote gateway or SQL Client (different job / session). The SQL API has no discussion about Avro or Protobuf, when dealing with Parquet, but the Stream API requires it in the Sink File Connector. Does this mean I must convert the stream to a table and sink it with SQL API? Please advise.

Martijn Visser

05/22/2023, 12:42 PM

Is there any reason why you don't immediately read the Parquet file in your SQL application?

Marco Villalobos

05/23/2023, 5:09 PM

Yes, the reason is that they don't exists. Thus, I want a stream job to create the parquet formatted data. And then I want to use an SQL Remote Gateway to read the data saved in Parquet.

Martijn Visser

05/23/2023, 7:13 PM

I see that I indeed wrote “read” instead of “write”, basically I was wondering why you wouldn’t build your entire pipeline in SQL. I don’t know what is your source and the business logic that you need to apply, but I could imagine that would make things easier

Marco Villalobos

05/23/2023, 8:28 PM

Oh, I can answer that too. This idea to read our "old" data with SQL is a new initiative. So, first, we need to transform the old data into parquet (that's the stream job I am talking about above). Then, there is another task to change our current real data stream pipeline to also save the new data as parquet format.

Marco Villalobos

05/23/2023, 8:29 PM

I am just trying to find out if Parquet written by the the Flink DataStream API (which depends on Avro or Protocol Buffers) is compatible to be read with the Flink SQL API (which has no such dependency), and vice-versa.

Open in Slack

Previous Next