hey folks another flink question for simpler data pipelines Apache Flink #random

hey folks, another flink question. for simpler dat...

Felix Angell

09/01/2023, 10:56 AM

hey folks, another flink question. for simpler data pipelines in an organisation has anyone found a preference over Flink SQL vs using something like PyFlink? we're tempted around FlinkSQL due to everyone being familiar with it + having less 'power' in using Python and also the fact that there are no performance overheads as opposed to PyFlink. sadly, flink-java is not really an option here beyond maybe abstracting away some common functionality in a configuration driven approach.

Felix Angell

09/01/2023, 10:57 AM

the goal we have is to unlock stateful stream processing as a more feasible and easily accessible thing for people in the tech and data orgs

Giannis Polyzos

09/01/2023, 11:13 AM

My personal suggestion is always use Flink SQL, unless there is something quite sophisticated that requires lower level primitives

Felix Angell

09/01/2023, 11:13 AM

interesting 😄 can i ask why?

Giannis Polyzos

09/01/2023, 11:15 AM

It’s simple, easy to use and understood by many different teams. Unless there is a use case that requires more fine grained control over time or state management, sql should be the way to go

Giannis Polyzos

09/01/2023, 11:16 AM

Cdc and lakehouse use cases also become extremely simple.

Felix Angell

09/01/2023, 11:16 AM

that's a fair point! i'll do some research into how feasible this is.

Cdc and lakehouse use cases also become extremely simple.

can i ask why is this the case again sorry?

Giannis Polyzos

09/01/2023, 11:20 AM

Flink cdc 2.0 + has gone lots of work to provide the best cdc support out there. Now combine it with lakehouse technologies like Apache Paimon (prev. Flink table store) you can easily implement easily data synchronisation and other use cases. Assume you have a database with 100, 1000 or more tables and you want to synchronise across different databases and cheap storage. You just write a single CREATE DATABASE AS statement and the framework handles everything automatically for u. For data engineers, analysts, scientists etc. that are familiar with sql, this is powerful.

🙌 2

Felix Angell

09/01/2023, 11:21 AM

thank you Giannis this is really insightful. i'll do some more reading/research!

🙏 1

Giannis Polyzos

09/01/2023, 11:22 AM

It’s on my backlog, will be sharing more ~end September / early October around these. 😄

Felix Angell

09/01/2023, 11:22 AM

i'll keep my eyes peeled 👀

😁 1

Felix Angell

09/01/2023, 11:24 AM

is it difficult to use FlinkSQL when we're sourcing/sinking to Kafka using Protobuf do you know?

Giannis Polyzos

09/01/2023, 11:32 AM

Tbh personally I have only used json and avro, but overall I haven’t seen something that indicates it’s hard. But a quick PoC should verify this

Felix Angell

09/01/2023, 11:34 AM

👍 i'll give this a look! thanks again

Stephen Pendleton

09/01/2023, 1:37 PM

Pyflink also supports SQL and I have also sourced/sink data to protobuf using it.

Felix Angell

09/01/2023, 1:38 PM

i've been looking into it briefly and it seems a format of

protobuf

for a table connector is only available in 1.17 😕

Felix Angell

09/01/2023, 1:41 PM

i was thinking i could do something like table env <-> data stream and then re-serialize from bytes at the cost of larger messages being shared across operators?

Stephen Pendleton

09/01/2023, 1:49 PM

I'm not following. Protobuf has been supported with table connectors for a while now.

'connector' = 'kafka',\

'property-version' = 'universal',\

'topic' = 'mytopic',\

'properties.bootstrap.servers' = 'kafka:9092',\

'scan.startup.mode' = 'earliest-offset',\

'format' = 'protobuf',\

'protobuf.message-class-name' = 'com.foo.Message',\

'protobuf.ignore-parse-errors' = 'false'\

Felix Angell

09/01/2023, 1:50 PM

huh ok, i found out from this jira ticket to add pb support it says 1.16

Stephen Pendleton

09/01/2023, 1:53 PM

Yes..1.16.0, but the current release is 1.17.1. I am not sure why people would not choose pyflink/SQL for most basic usecases nowadays. Everyone knows python and SQL.

Felix Angell

09/01/2023, 1:54 PM

ah right sorry i forgot to clarify. we're using AWS managed flink which is pinned to version 1.15.x

👍 1

Felix Angell

09/01/2023, 1:56 PM

personally i'd rather opt for PyFlink but only since I am more familiar with the concepts than the table api + the testing is useful to build a pipeline 😬

Stephen Pendleton

09/01/2023, 1:56 PM

AWS definitely doesn't support pyflink very well/at all. I was never able to get that to work using their managed services.

Felix Angell

09/01/2023, 1:57 PM

oh believe me i've been through that one 😆 our biggest flink pipeline in the company runs on PyFlink and it took a while. that one was pinned to version 1.13.2

Felix Angell

09/01/2023, 1:57 PM

i think Flink SQL might be the option here i just need to figure out how the protobuf serde is going to go

Felix Angell

11/10/2023, 2:01 PM

Hey Giannis, what do you think is the ideal way to abstract the deployment mechanism of deploying flink sql apps? is there any guidance around this?

5 Views

Open in Slack

Previous Next