Hi whatever is wrong with my sql insert statement ``` INSERT Apache Flink #troubleshooting

Hi whatever is wrong with my sql insert statement?...

Jacob Jona Fahlenkamp

09/18/2024, 11:52 AM

Hi whatever is wrong with my sql insert statement?

Copy code

INSERT INTO etdr_kafka
SELECT *
from etdr_mongo
DISTRIBUTE BY `pi`
SORT BY ts ASC;

I get this error:

Copy code

Flink SQL> [ERROR] Could not execute SQL statement. Reason:
org.apache.flink.sql.parser.impl.ParseException: Encountered "BY" at line 5, column 12.
Was expecting one of:
    <EOF>
    "EXCEPT" ...
    "FETCH" ...
    "GROUP" ...
    "HAVING" ...
    "INTERSECT" ...
    "LIMIT" ...
    "OFFSET" ...
    "ORDER" ...
    "MINUS" ...
    "TABLESAMPLE" ...
    "UNION" ...
    "WHERE" ...
    "WINDOW" ...
    "(" ...
    ";" ...
    "," ...
    "NATURAL" ...
    "JOIN" ...
    "INNER" ...
    "LEFT" ...
    "RIGHT" ...
    "FULL" ...
    "CROSS" ...
    "OUTER" ...

D. Draco O'Brien

09/18/2024, 12:59 PM

It’s because DISTRIBUTE BY is not standard SQL but and not supported by Flink. To do the operation without optimization in standard SQL it would be something like this

Copy code

INSERT INTO etdr_kafka
SELECT * FROM etdr_mongo;

D. Draco O'Brien

09/18/2024, 1:06 PM

So I believe DISTRIBUTE BY does not exist in Flink SQL at all.

Jacob Jona Fahlenkamp

09/18/2024, 1:09 PM

you're right it seems to be only available in the hive dialect: https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/hive-compatibility/hive-dialect/queries/sort-cluster-distribute-by/

D. Draco O'Brien

09/18/2024, 1:10 PM

So you could do something similar in Flink Table API

D. Draco O'Brien

09/18/2024, 1:12 PM

Although you can’t use DISTRIBUTE BY directly, you can influence data distribution when defining the sink, especially if the sink connector supports partitioning keys. Here’s an example where you might choose a field to partition by when defining the sink table:

Copy code

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

public class FlinkTableIndirectDistributionExample {

    public static void main(String[] args) throws Exception {

        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

        // Assume you've defined and registered the source table "etdr_mongo"

        // Define the sink table with partitioning keys implicitly influencing distribution
        // This is hypothetical and depends on the sink connector supporting such configuration
        tableEnv.executeSql(
            "CREATE TABLE etdr_kafka ("
            + "column1 STRING, "
            + "column2 INT, "
            + "pi STRING, "
            + "ts TIMESTAMP"
            + ") WITH ("
            + "'connector' = 'kafka', "
            + "'topic' = 'your-topic', "
            + "'properties.bootstrap.servers' = 'localhost:9092', "
            + "'key.format' = '...', " // Key format for partitioning if applicable
            + "'key.fields' = 'pi', " // Implicitly distributes data by 'pi'
            + "'format' = 'json'"
            + ")"
        );

        // Insert into the sink table
        tableEnv.executeSql(
            "INSERT INTO etdr_kafka SELECT * FROM etdr_mongo"
        );

        env.execute("Flink Table Indirect Distribution Example");
    }
}

D. Draco O'Brien

09/18/2024, 1:12 PM

It would need to be customized to your specific use case. It can also be done using Data Stream API which is at a lower level.

Jacob Jona Fahlenkamp

09/18/2024, 1:44 PM

I am loading data from mongo to kafka. I want to sort it by timestamp. The bottleneck is sorting if I use normal order by, because it sorts globally and cannot be parallel. I only need the data to be sorted per partition in kafka. So I was hoping to achieve that with the sql statement I tried initially. I guess I have to use datastream API and even then I have to derive the partition index myself, and already keyBy the partition index in flink.

D. Draco O'Brien

09/19/2024, 3:56 AM

Yes, I think its either that or use an intermediate Kafka topic during processing but that would not be as efficient.

2 Views

Open in Slack

Previous Next