hey folks, another flink question. for simpler dat...
# random
f
hey folks, another flink question. for simpler data pipelines in an organisation has anyone found a preference over Flink SQL vs using something like PyFlink? we're tempted around FlinkSQL due to everyone being familiar with it + having less 'power' in using Python and also the fact that there are no performance overheads as opposed to PyFlink. sadly, flink-java is not really an option here beyond maybe abstracting away some common functionality in a configuration driven approach.
the goal we have is to unlock stateful stream processing as a more feasible and easily accessible thing for people in the tech and data orgs
g
My personal suggestion is always use Flink SQL, unless there is something quite sophisticated that requires lower level primitives
f
interesting ๐Ÿ˜„ can i ask why?
g
Itโ€™s simple, easy to use and understood by many different teams. Unless there is a use case that requires more fine grained control over time or state management, sql should be the way to go
Cdc and lakehouse use cases also become extremely simple.
f
that's a fair point! i'll do some research into how feasible this is.
Cdc and lakehouse use cases also become extremely simple.
can i ask why is this the case again sorry?
g
Flink cdc 2.0 + has gone lots of work to provide the best cdc support out there. Now combine it with lakehouse technologies like Apache Paimon (prev. Flink table store) you can easily implement easily data synchronisation and other use cases. Assume you have a database with 100, 1000 or more tables and you want to synchronise across different databases and cheap storage. You just write a single CREATE DATABASE AS statement and the framework handles everything automatically for u. For data engineers, analysts, scientists etc. that are familiar with sql, this is powerful.
๐Ÿ™Œ 2
f
thank you Giannis this is really insightful. i'll do some more reading/research!
๐Ÿ™ 1
g
Itโ€™s on my backlog, will be sharing more ~end September / early October around these. ๐Ÿ˜„
f
i'll keep my eyes peeled ๐Ÿ‘€
๐Ÿ˜ 1
is it difficult to use FlinkSQL when we're sourcing/sinking to Kafka using Protobuf do you know?
g
Tbh personally I have only used json and avro, but overall I havenโ€™t seen something that indicates itโ€™s hard. But a quick PoC should verify this
f
๐Ÿ‘ i'll give this a look! thanks again
s
Pyflink also supports SQL and I have also sourced/sink data to protobuf using it.
f
i've been looking into it briefly and it seems a format of
protobuf
for a table connector is only available in 1.17 ๐Ÿ˜•
i was thinking i could do something like table env <-> data stream and then re-serialize from bytes at the cost of larger messages being shared across operators?
s
I'm not following. Protobuf has been supported with table connectors for a while now.
'connector' = 'kafka',\
'property-version' = 'universal',\
'topic' = 'mytopic',\
'properties.bootstrap.servers' = 'kafka:9092',\
'scan.startup.mode' = 'earliest-offset',\
'format' = 'protobuf',\
'protobuf.message-class-name' = 'com.foo.Message',\
'protobuf.ignore-parse-errors' = 'false'\
f
huh ok, i found out from this jira ticket to add pb support it says 1.16
s
Yes..1.16.0, but the current release is 1.17.1. I am not sure why people would not choose pyflink/SQL for most basic usecases nowadays. Everyone knows python and SQL.
f
ah right sorry i forgot to clarify. we're using AWS managed flink which is pinned to version 1.15.x
๐Ÿ‘ 1
personally i'd rather opt for PyFlink but only since I am more familiar with the concepts than the table api + the testing is useful to build a pipeline ๐Ÿ˜ฌ
s
AWS definitely doesn't support pyflink very well/at all. I was never able to get that to work using their managed services.
f
oh believe me i've been through that one ๐Ÿ˜† our biggest flink pipeline in the company runs on PyFlink and it took a while. that one was pinned to version 1.13.2
i think Flink SQL might be the option here i just need to figure out how the protobuf serde is going to go
Hey Giannis, what do you think is the ideal way to abstract the deployment mechanism of deploying flink sql apps? is there any guidance around this?