Hello all I m currently dealing with an issue where the time Apache Flink #troubleshooting

Hello all! I’m currently dealing with an issue wh...

Daniel Suh

09/12/2024, 3:13 PM

Hello all! I’m currently dealing with an issue where the time it takes for garbage collection on a single TM is increasing at a much much higher rate than all the other task managers. I believe this is causing the TM to be occupied with garbage collection, leading to latency in my application. Does anyone have any helpful pointers on where the issue might be? Why does garbage collection on this specific TM take so much longer than the other TMs?

D. Draco O'Brien

09/12/2024, 5:13 PM

These issues are common for distributed Java apps here are some things you can do: 1. Ensure that the workload distribution among Task Managers (TMs) is balanced. If one TM is processing a significantly larger amount of data or running more complex tasks, it could generate more garbage, thus requiring more frequent and longer GC cycles. 2. Check if there’s a memory leak in the code executed by that specific TM. A memory leak would continuously allocate memory without proper release, leading to excessive garbage over time and longer GC times. Use memory profilers for this. 3. The JVM settings for each TM, especially those related to garbage collection, can greatly influence GC behavior. Check if the problematic TM has different or suboptimal JVM flags set compared to others. For instance, the type of GC algorithm used (e.g., G1GC, CMS, ParallelGC) and heap size settings could impact GC duration quite a bit. 4. Use tools like VisualVM, JConsole, or YourKit to monitor the JVM of the affected TM. These can provide insights into the memory usage patterns, GC activities, and help identify any spikes or anomalies. 5. Enable detailed GC logging for the JVM of the troublesome TM. Analyzing these logs can reveal patterns such as frequent full GCs or long pause times, indicating potential issues. 6. Review recent code changes or specific tasks executed by this TM. New code or libraries introduced might have different memory allocation patterns that are affecting GC. 7. Take a heap dump during peak activity and analyzing it using tools like Eclipse MAT or VisualVM. This can help identify large objects or object retention issues that contribute to longer GC cycles. 8. Experiment with adjusting JVM settings. For example, if using G1GC, you might want to tune the heap region size (-XX:G1HeapRegionSize), the maximum pause time goal (-XX:MaxGCPauseMillis), or the heap size itself to better suit your workload. Tuning garbage collection is a balance between throughput, latency, and memory utilization, and what works best can vary widely based on the specific application and its workload characteristics and the available HW resources.

D. Draco O'Brien

09/12/2024, 5:14 PM

A bit of trial and error really, but you should be able to optimize your settings. You might get more specific feedback by providing more details about your Flink deployment? K8s, local deployment, docker-compose etc. and available memory requirements and settings.

Daniel Suh

09/12/2024, 7:09 PM

Thank you for you recommendations, profiling in our AWS environments has proven to be difficult, so I’m currently working on deploying the app locally. Hopefully this will help expose the issues we’re seeing. We are deploying using K8s on EKS. Currently running with 40 TMs, 3 tasks per TM, for 120 Parallelism. Each TM has 4CPU and 30GB

D. Draco O'Brien

09/13/2024, 7:03 AM

yes, perhaps you can use localstack

Open in Slack

Previous Next