Hello I have a Pinot deployment done through K8s wherein I m Apache Pinot #troubleshooting

Hello, I have a Pinot deployment done through K8s...

Pedro Silva

05/24/2021, 1:31 PM

Hello, I have a Pinot deployment done through K8s wherein I'm progressively adding fields to a realtime table. This deployment is very basic (1 server instance only, with 6GB for java Heap, the pod has a memory limit of 7GB, 100GB persistent storage, deep storage for segments has been enabled) but I'm getting multiple server restarts because the pod keeps getting killed with OutOfMemory errors while ingesting data and creating segments. It seems the cause is not the JVM itself but off-heap memory maps, please see the following image for more details.

Pedro Silva

05/24/2021, 1:31 PM

My question is how can I size the deployment of the server adequately, particularly how can I manage this off-heap usage?

Daniel Lavoie

05/24/2021, 1:35 PM

Server naturally uses off-heap

Daniel Lavoie

05/24/2021, 1:35 PM

My rule of thumb is leave 50% of the container memory to off-heap

Daniel Lavoie

05/24/2021, 1:36 PM

then adjust based on your metrics. You’ll see if with the metrics you have if you have room for more heap and if it would benefit given the GC patterns

Daniel Lavoie

05/24/2021, 1:37 PM

Table configs will have an impact on your off heap usage so I would start from the 50% rule of thumb then optimized based on the specific of the tables running on the system.

Pedro Silva

05/24/2021, 1:39 PM

Thank you for the feedback Daniel. What metrics in particular should I take a look at?

Daniel Lavoie

05/24/2021, 1:41 PM

standard container and jvm metrics

Daniel Lavoie

05/24/2021, 1:42 PM

Used memory from the k8s metrics with the JVM heap usage from JMX exporter

Daniel Lavoie

05/24/2021, 1:42 PM

These two will tell you about much is available for non heap

Pedro Silva

05/24/2021, 1:42 PM

From the grafana chart I put above it seems Pinot is using >2x the memory for offheap than JVM rather than ~100%. With such a high amount of memory mapped usage does it make sense to add more servers with the 50% for offheap rule?

Daniel Lavoie

05/24/2021, 1:43 PM

This is not the metric I meant

Daniel Lavoie

05/24/2021, 1:44 PM

This represents the size of the datastored mapped on disk.

Daniel Lavoie

05/24/2021, 1:44 PM

What we want is the size of the in-memory pointers

Daniel Lavoie

05/24/2021, 1:44 PM

A server can have TB of memory mapped data.

Daniel Lavoie

05/24/2021, 1:45 PM

Also

Daniel Lavoie

05/24/2021, 1:45 PM

What usually cause OOM from k8s is that your JVM settings are too high for the actual k8s resource request.

Daniel Lavoie

05/24/2021, 1:46 PM

Having too much elements off-heaps will not cause OOM, but disk swapping.

Daniel Lavoie

05/24/2021, 1:47 PM

OOM is because the heap is using memory it thinks is made available to the pod, but isn’t, hence the OOM from k8s.

Daniel Lavoie

05/24/2021, 1:47 PM

Tuning down the heap size usually fix that issue

Daniel Lavoie

05/24/2021, 1:47 PM

You could confirm that by monitoring K8s resource request vs physical usage

Pedro Silva

05/24/2021, 1:51 PM

Ok, so if I understand correctly. The processes running in my server pod are trying to use more memory than the pod has available (limit: 7GB)

Daniel Lavoie

05/24/2021, 1:51 PM

Exactly

Pedro Silva

05/24/2021, 1:51 PM

The JVM usage is as follows:

Pedro Silva

05/24/2021, 1:52 PM

The memory of the pod (reddish line):

Pedro Silva

05/24/2021, 1:52 PM

Goes to 175%. Meaning that my pod's limit should be at ~12.5GB (7*1.75)

Daniel Lavoie

05/24/2021, 1:53 PM

Just reduce the heap

Pedro Silva

05/24/2021, 1:53 PM

Does that sound about right Daniel?

Pedro Silva

05/24/2021, 1:53 PM

To half of the limit of the pod?

Daniel Lavoie

05/24/2021, 1:53 PM

Yes

Pedro Silva

05/24/2021, 1:53 PM

Isn't there a concern that such a small heap won't be enough to hold the segments in memory for fast querying?

Daniel Lavoie

05/24/2021, 1:54 PM

offheap is fast

👍 1

Pedro Silva

05/24/2021, 1:54 PM

Thank you for the assistance, I will try out this config.

Pedro Silva

05/24/2021, 1:56 PM

If I may make one more question, in what scenarios do you want to increase the pod memory? When does it make sense to scale the servers vertically (more memory) vs horizontally (more servers)?

Daniel Lavoie

05/24/2021, 1:58 PM

Having more servers means you reduce the impact of redistributing or redownloading segments

Mayank

05/24/2021, 1:58 PM

What is the event rate from your event stream, and how many partitions do you have

Mayank

05/24/2021, 1:59 PM

Parts or consuming segment are on direct memory that can OOM if not enough (unlike mmap).

Pedro Silva

05/24/2021, 1:59 PM

16 partitions. It is a scheduled cron, outputting daily ~50M entries. We are in the process of removing the cron job to an event-based stream.

Daniel Lavoie

05/24/2021, 1:59 PM

Parts or consuming segment are on direct memory that can OOM if not enough (unlike mmap).

That will cause a JVM OOM, not a K8S OOM

Mayank

05/24/2021, 2:00 PM

Oh sorry, long thread, I assumed it was JVM OOM.

Pedro Silva

05/24/2021, 2:00 PM

How can I distinguish them, in K8s? Via the pod logs?

Daniel Lavoie

05/24/2021, 2:00 PM

Yes,

👍 1

Daniel Lavoie

05/24/2021, 2:00 PM

Heap OOM will be observed from the log as usually dreaded Heap exception

Daniel Lavoie

05/24/2021, 2:01 PM

the K8S OOM will just kill your container without no other mention than a

OOM

message in the pods events

Mayank

05/24/2021, 2:05 PM

Could we confirm if it is JVM or pod OOM? Also if it is JVM, is it heap or direct memory OOM?

Pedro Silva

05/24/2021, 2:06 PM

Copy code

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  [thread 140238581155584 also had an error]
SIGBUS (0x7) at pc=0x00007f8c70a013c2, pid=9, tid=0x00007f8bd167c700
#
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Core dump written. Default location: /opt/pinot/core or core.9
#
[thread 140238499997440 also had an error]
# An error report file with more information is saved as:
# /opt/pinot/hs_err_pid9.log
[thread 140237039417088 also had an error]
[thread 140238498944768 also had an error]
[thread 140238586418944 also had an error]
[thread 140238396491520 also had an error]
[thread 140238582208256 also had an error]
[thread 140238587471616 also had an error]
#
# If you would like to submit a bug report, please visit:
#   <http://bugreport.java.com/bugreport/crash.jsp>
#
Aborted (core dumped)

Got this with a 3GB java heap and a pod memory request of 6GB (limit 7GB)

Daniel Lavoie

05/24/2021, 2:06 PM

Oh!

Daniel Lavoie

05/24/2021, 2:06 PM

That’s a JVM OOM

Daniel Lavoie

05/24/2021, 2:06 PM

Good call Mayank

Pedro Silva

05/24/2021, 2:06 PM

134 error code, yh, no logs though

Mayank

05/24/2021, 2:07 PM

What does the hs_err.log say?

Mayank

05/24/2021, 2:08 PM

Likely it is because all 16 partitions are consuming at burst and try to allocate memory at the same time. if it is direct memory, then it is during consumption. If it is heap than it is segment generation happening in parallel

Mayank

05/24/2021, 2:09 PM

For direct memory, you could reduce partitions and increase footprint of jvm. For heap you can limit number of parallel segments to be generated

Mayank

05/24/2021, 2:11 PM

You can also throttle the event rate instead of pumping 50M records per burst

Pedro Silva

05/24/2021, 2:11 PM

Do you mean reducing the number of partitions in kafka?

Pedro Silva

05/24/2021, 2:12 PM

Or the

segmentPartitionConfig

config of the pinot table?

Mayank

05/24/2021, 2:15 PM

Kafka. But let’s first find out if heap or direct memory

Pedro Silva

05/24/2021, 2:15 PM

I'm trying to access the log file but the particular fs path where it is stored is not in a pvc

Mayank

05/24/2021, 2:16 PM

Anything in server log?

Pedro Silva

05/24/2021, 2:18 PM

Exception in thread "HitExecutionView__12__49__20210524T1141Z" java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code

Off-heap memory I assume?

Pedro Silva

05/24/2021, 2:18 PM

Full trace:

Copy code

Exception in thread "HitExecutionView__12__49__20210524T1141Z" java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
	at org.apache.pinot.segment.local.function.InbuiltFunctionEvaluator$FunctionExecutionNode.execute(InbuiltFunctionEvaluator.java:116)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007f51c5052416, pid=9, tid=0x00007f50a5553700
#
[thread 139984278378240 also had an error]
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Core dump written. Default location: /opt/pinot/core or core.9
#
# An error report file with more information is saved as:
# /opt/pinot/hs_err_pid9.log
	at org.apache.pinot.segment.local.function.InbuiltFunctionEvaluator.evaluate(InbuiltFunctionEvaluator.java:87)
	at org.apache.pinot.segment.local.recordtransformer.ExpressionTransformer.transform(ExpressionTransformer.java:95)
	at org.apache.pinot.segment.local.recordtransformer.CompositeTransformer.transform(CompositeTransformer.java:82)
	at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.processStreamEvents(LLRealtimeSegmentDataManager.java:509)
	at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.consumeLoop(LLRealtimeSegmentDataManager.java:416)
	at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:556)
	at java.lang.Thread.run(Thread.java:748)
[thread 139984280483584 also had an error]
#
# If you would like to submit a bug report, please visit:
#   <http://bugreport.java.com/bugreport/crash.jsp>
#
Aborted (core dumped)

Pedro Silva

05/24/2021, 2:19 PM

Copy code

LLRealtimeSegmentDataManager$PartitionConsumer.run

Appears to be in the partition consumer

Mayank

05/24/2021, 2:20 PM

Are you using any groovy functions?

Pedro Silva

05/24/2021, 2:21 PM

Yes

Mayank

05/24/2021, 2:21 PM

What does it do

Pedro Silva

05/24/2021, 2:22 PM

Parses a time string in a wierd format from a json field, to the java standard and returns the milliseconds since epoch of that time.

Mayank

05/24/2021, 2:22 PM

Ok. My guess is the heap OOM

Mayank

05/24/2021, 2:22 PM

Do you have logs from server?

Mayank

05/24/2021, 2:23 PM

In server config there is a way to limit number of segments being flushed in parallel.

Pedro Silva

05/24/2021, 2:23 PM

That is the last line in log from the server.

Mayank

05/24/2021, 2:24 PM

Yeah but can you check if there was segment generation happening around that time

Mayank

05/24/2021, 2:26 PM

At high level 3GB for 16 partitions getting 50M events in a burst is low for heap as well as direct memory

Pedro Silva

05/24/2021, 2:26 PM

I can't see anything referring to segment generation: https://pastebin.com/D8bxdZeA

Daniel Lavoie

05/24/2021, 2:26 PM

I think the relevent logs are only available inside the pod (stdout only shows WARN). The INFO file is lost on restart with the default configs.

Mayank

05/24/2021, 2:27 PM

If you want to optimize cost then you can throttle the burst so consumption event rate is lower, reduce num parritojis in Kafka once max event rate is low, and limit number of parallel segment generations

Pedro Silva

05/24/2021, 2:27 PM

I see some INFO level logs in stdout

Mayank

05/24/2021, 2:27 PM

Or else just add more vms 😀

Pedro Silva

05/24/2021, 2:28 PM

At this point, simply having a formula that lets me know the memory requirements and size up the servers, would be good enough

Mayank

05/24/2021, 2:29 PM

Yeah there is a real-time provision tool in the docs

Mayank

05/24/2021, 2:29 PM

Have you tried it

Pedro Silva

05/24/2021, 2:29 PM

https://github.com/apache/incubator-pinot/blob/master/pinot-tools/src/main/java/or[…]inot/tools/admin/command/RealtimeProvisioningHelperCommand.java ?

Mayank

05/24/2021, 2:29 PM

Yes

Mayank

05/24/2021, 2:29 PM

Also there is doc about it

Mayank

05/24/2021, 2:30 PM

It might not take care of your bursty nature of events, but let’s see what it proposes

Pedro Silva

05/24/2021, 2:30 PM

I'll take a look, get back to you soon. Thank you both so much for the help

Pedro Silva

05/24/2021, 2:31 PM

Is this tool available as a image, within an image?

Daniel Lavoie

05/24/2021, 2:32 PM

I think you should find it within the pinot image as a standalone script in

bin

Pedro Silva

05/24/2021, 2:55 PM

Is the tool meant to take a long time? Been running for 5m.

Copy code

docker run --rm -v /home/pedro/dev/Pinot:/tmp/volume apachepinot/pinot:release-0.7.1 RealtimeProvisioningHelper \
-ingestionRate 1000 \
-numPartitions 16 \
-retentionHours 720 \
-numRows 50000000 \
-tableConfigFile /tmp/volume/specs/tables/HitExecutionView_REALTIME.json \
-schemaWithMetadataFile /tmp/volume/specs/schemas/HitExecutionView.json
Executing command: RealtimeProvisioningHelper -tableConfigFile /tmp/volume/specs/tables/HitExecutionView_REALTIME.json -numPartitions 16 -pushFrequency null -numHosts 2,4,6,8,10,12,14,16 -numHours 2,3,4,5,6,7,8,9,10,11,12 -schemaWithMetadataFile /tmp/volume/specs/schemas/HitExecutionView.json -numRows 50000000 -ingestionRate 1000 -maxUsableHostMemory 48G -retentionHours 720

Open in Slack

Previous Next