windwheel
08/30/2024, 1:12 AMflink-connector-clickhouse
.
Due to the historical reasons of the company's framework, pyflink has to be used.
Since in batch mode, pyflink1.16.1 only supports writing udf through pandas udf.
. I wrote a multi-column switching function, which ran stably in SQL. I performed memory-based tuning twice.
Unfortunately, the parameters of my first tuning were lost. But I roughly remember that they were adjusted
taskmanager.memory.process.size: 4gb
taskmanager.memory.network.fraction: 0.1
taskmanager.memory.managed.fraction: 0.4
make it effective
But when I tuned for the second time, when pyflink executed the over window aggration
operator
The operator is always in the INITIALIZING state and no data flows in. The parameters are as follows
taskmanager.memory.process.size: 4gb
taskmanager.memory.network.fraction: 0.3
taskmanager.memory.managed.fraction: 0.45
taskmanager.memory.jvm-overhead.fraction: 0.1
taskmanager.memory.framework.off-heap.size: 128mb
taskmanager.memory.managed.consumer-weights: OPERATOR:60,STATE_BACKEND:60,PYTHON:40
Since there is too little information about pyflink on the Internet, after reading the source code, judge based on the logs
Log: Obtained shared Python process of size 536870920 bytes
It may be that the python interpreter process estimated by pyflink based on managed memory requires too much memory.
The machine does not have too much memory and cannot start the python interpreter process.
What makes me curious is that the total memory of managed memory is configured as 4g. Why is the estimated memory so large?
What configuration will I have to do to make it receive data properly and send it downstream?
Slack Conversation