I have 4 task manager each one is on separate mach...
# troubleshooting
a
I have 4 task manager each one is on separate machine and one job manager All task managers are reporting metrics to job manager as i see in web ui Except one task manager so if i have custom metrics like metric event it will be shown in graph for all tasks that runs except for this task manager All the configs is the same , we usually copy paste the config
a
Have you confirmed tasks are being sent to slots on that taskmanager and it is actually doing work?
a
Yess this machine do the most of work
👍 1
Am using flink v1.12
a
One thing we can say is that’s an old version, but that’s probably not the source of your problem. I wish you luck 🙇‍♂️
k
Seems very unlikely this is a Flink bug, but I assume you checked the logs on that one server, as well as the JM logs for any unusual messages.
d
Make sure everything is on same version of Flink v1.12, how are CPU, Memory and Diskspace doing?
a
Even it didn’t report memory’s metrics . Yes nothing unordinary in the logs
d
ok, can you restart the task manager service and show the logs from that point onward?
Also whats your deployment again? you are on k8s? or docker?
a
Standalone cluster
I restarted many times and checked the logs . Nothing unordinary
d
ok, but do you see anything related to metrics? sending_metrics or reporting_metrics?
with proper log levels it should at least be reporting the attempt to send the metrics right?
If the taskmanager is setup to send metrics then it will log the attempt to send even if it never reaches the job manager. At least I think thats how it would work
or just anything in the taskmanager logs that contains the words metrics or metric?
a
If you meant “no metrics will be reported” .. it is the same for all logs in task managers
d
ok .. well …
from those logs it does not look like metrics are properly configured.
a
Yes but it is the same for task managers that worked right
d
Basically if it attempts to send metrics it will look like
Copy code
DEBUG org.apache.flink.runtime.metrics.MetricRegistryImpl       - Reporting metrics to '<akka.tcp://flink@jobmanager-host>:port/user/taskmanager_0'.
and that’s even if it never makes it to the JobManager
You should be seeing this if its properly configured.
there also might be some metric gathering activities like INFO org.apache.flink.metrics.groups.TaskManagerMetricGroup - Metric ‘metric.name’ updated to ‘value’. ``````
You might not see acks but you should see something like
Copy code
DEBUG org.apache.flink.runtime.jobmaster.JobMaster                - Received metric report from TaskManager with ID=taskmanager-id.
Do you see these for all accept one task-manager in the logs? Its a debug level message.