hi all... trying to write from Apache Flink <1.18....
# troubleshooting
g
hi all... trying to write from Apache Flink 1.18.1 to Paimon on Hadoop HDFS (3.3.5). the "reference/metadata objects/schema's created" and bucket0's created. but never any data... and if I look at docker compose logs, i see the following
Copy code
flink-taskmanager-2      | 2024-08-05 18:50:07,618 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: t_f_avro_salescompleted_x[16] (1/1)#251 (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251) switched from CREATED to DEPLOYING.
flink-taskmanager-2      | 2024-08-05 18:50:07,618 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Loading JAR files for task Source: t_f_avro_salescompleted_x[16] (1/1)#251 (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251) [DEPLOYING].
flink-taskmanager-2      | 2024-08-05 18:50:07,620 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: t_f_avro_salescompleted_x[16] (1/1)#251 (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251) switched from DEPLOYING to INITIALIZING.
devlab-flink-jobmanager  | 2024-08-05 18:50:07,621 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: t_f_avro_salescompleted_x[16] (1/1) (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251) switched from DEPLOYING to INITIALIZING.
devlab-flink-jobmanager  | 2024-08-05 18:50:07,633 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Source Source: t_f_avro_salescompleted_x[16] registering reader for parallel task 0 (#251) @ 172.19.0.12
flink-taskmanager-2      | 2024-08-05 18:50:07,648 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: t_f_avro_salescompleted_x[16] (1/1)#251 (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251) switched from INITIALIZING to RUNNING.
devlab-flink-jobmanager  | 2024-08-05 18:50:07,648 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: t_f_avro_salescompleted_x[16] (1/1) (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251) switched from INITIALIZING to RUNNING.
devlab-flink-jobmanager  | 2024-08-05 18:50:07,997 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: t_f_avro_salescompleted_x[16] (1/1) (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251) switched from RUNNING to CANCELING.
devlab-flink-jobmanager  | 2024-08-05 18:50:07,997 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 0 (#251) of source Source: t_f_avro_salescompleted_x[16].
flink-taskmanager-2      | 2024-08-05 18:50:07,997 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to cancel task Source: t_f_avro_salescompleted_x[16] (1/1)#251 (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251).
flink-taskmanager-2      | 2024-08-05 18:50:07,997 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: t_f_avro_salescompleted_x[16] (1/1)#251 (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251) switched from RUNNING to CANCELING.
flink-taskmanager-2      | 2024-08-05 18:50:07,997 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Triggering cancellation of task code Source: t_f_avro_salescompleted_x[16] (1/1)#251 (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251).
flink-taskmanager-2      | 2024-08-05 18:50:07,998 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: t_f_avro_salescompleted_x[16] (1/1)#251 (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251) switched from CANCELING to CANCELED.
flink-taskmanager-2      | 2024-08-05 18:50:07,998 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Freeing task resources for Source: t_f_avro_salescompleted_x[16] (1/1)#251 (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251).
flink-taskmanager-2      | 2024-08-05 18:50:08,001 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Un-registering task and sending final execution state CANCELED to JobManager for task Source: t_f_avro_salescompleted_x[16] (1/1)#251 e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251.
devlab-flink-jobmanager  | 2024-08-05 18:50:08,002 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: t_f_avro_salescompleted_x[16] (1/1) (e2c7bb3077c1d5c22e0bccba65bfc32b_bc764cd8ddf7a0cff126f51c16239658_0_251) switched from CANCELING to CANCELED.
@D. Draco O'Brien.
d
Ok, so logs indicate that the task responsible for reading from the source t_f_avro_salescompleted_x and presumably writing to Paimon is being cancelled right away after it starts running.
it switches from RUNNING to CANCELLED almost immediately
I don’t see errors in the log that caused this but you might check preceding log entries for errors or exceptions
I would check Flink job configurations for timeouts to make sure they are long enough to prevent early cancellation.
Also check the jobs restart strategy and failure threshold settings
g
ok, will google. these are the same Flink job and task manager that i was using to write to iceberg on S3/MinIO. I can't imagine it being permission as the various table definition objects are created..
d
How about availability of t_f_avro_salescompleted_x - are you sure data can be read ok from it?
beyond these factors you should do standard checks on resource constraints CPU, memory etc as insufficient resources could also cause this type of failure
g
will go look... the difference between this one and the previous is the output stream/leg, the input, aka topics and then flink virtual tables are all the same.
d
I will take a quick look at Flink 1.18.1 to see if there are any known issues with HDFS 3.3.5 but I think that’s supposed to be compatible.
g
that seems to be the sweet combination/versions.
d
Probably worth a quick check if there are any known issues. Resource constraints are a bit unlikely given its more or less the same setup as before
g
the dif of this build is pushing out to a hadoop dfs cluster, so thats extra resources, previous version was to a single node/container MinIO S3 service, but can't see it being resources, this MBP is not exactly slow.
d
source is Kafka right?
g
yes, Kafka topic => Flink virtual table => paimon on hdfs
d
So your quite sure that the sales_completed_source is available and producing data right?
if not you would write a job just to test this aspect out
g
interesting, originally i thought somethine wrong with CTAS, so tried the CTAT where 1=2, followed by the insert into <>
but now thinking nothing wrong with CTAS... it's something else
busy rebuilding stack... docker had a issue... the resource saver is not our friend. will try and breakdown the steps a bit and see... this is all 1000 foot into a project/PoV
d
Btw, do you have Paimon logging setup as well?
Maybe it’s worth a quick look at Paimon configuration as well to doublecheck this setup …
g
paimon is just a open table format, stored on hdfs, not much to enable.
d
well … are you using Paimon connector?
I guess it’s the Hadoop connector or what?
g
no... using jar file in flink lib, and then the catalog definition as per above sql file and then the table creates as per above.
d
I think you might need the Flink connector for Hadoop ..
Copy code
<dependencies>
    <dependency>
        <groupId>org.apache.paimon</groupId>
        <artifactId>paimon-flink</artifactId>
        <version>${paimon.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-hadoop-fs</artifactId>
        <version>${flink.version}</version>
    </dependency>
</dependencies>
when using Paimon
g
will need to dig, from the paimon quick start for flink it did not imply anything like that.
some serious google time...
prob have to "recognise" i've been scanning over the paimon bits and havent read in detail...
d
ok, are you using Paimon also as the Catalog store for Flink?
g
yes..
as the first try...
for paimon i'm uding the hdfs catalog
for flink i'm using the previous proven hms
d
I see. According to https://www.alibabacloud.com/help/en/flink/developer-reference/apache-paimon-connector it would appear that you only need the connector in the case where you are using another catalog.
g
this could be the cause, but it's a far fetch one.
d
Yeah well I breaking it up a bit into separate jobs will show what’s happening
g
figured, get it first to work like this before i try and move catalog into hms.
been building up the stack slowly, making sure one bit works before used in a 2nd bit
d
Yep I am curious to find out how well it supports catalogs.
👍 1
g
i got a 90% work done to create a 2 node hive environment, a hive server and a meta server backed by postgres. once i got the simple hdfs cat working i will try the stand alone hms, and if that works then try my hive envionment.
that way i know if it's me (most prob) or the tech...
especially as alllllot of this is brand new to me.
d
Looking forward to seeing the results on that.
g
heheheheh, 😉 my small little blog has turned into a multipart monster,
right now trying to automate more of the deployment/building steps...
will post link to project git here shortly, iv anyone is bored .