https://pinot.apache.org/ logo
p

Pedro Silva

05/24/2021, 1:31 PM
Hello, I have a Pinot deployment done through K8s wherein I'm progressively adding fields to a realtime table. This deployment is very basic (1 server instance only, with 6GB for java Heap, the pod has a memory limit of 7GB, 100GB persistent storage, deep storage for segments has been enabled) but I'm getting multiple server restarts because the pod keeps getting killed with OutOfMemory errors while ingesting data and creating segments. It seems the cause is not the JVM itself but off-heap memory maps, please see the following image for more details.
My question is how can I size the deployment of the server adequately, particularly how can I manage this off-heap usage?
d

Daniel Lavoie

05/24/2021, 1:35 PM
Server naturally uses off-heap
My rule of thumb is leave 50% of the container memory to off-heap
then adjust based on your metrics. You’ll see if with the metrics you have if you have room for more heap and if it would benefit given the GC patterns
Table configs will have an impact on your off heap usage so I would start from the 50% rule of thumb then optimized based on the specific of the tables running on the system.
p

Pedro Silva

05/24/2021, 1:39 PM
Thank you for the feedback Daniel. What metrics in particular should I take a look at?
d

Daniel Lavoie

05/24/2021, 1:41 PM
standard container and jvm metrics
Used memory from the k8s metrics with the JVM heap usage from JMX exporter
These two will tell you about much is available for non heap
p

Pedro Silva

05/24/2021, 1:42 PM
From the grafana chart I put above it seems Pinot is using >2x the memory for offheap than JVM rather than ~100%. With such a high amount of memory mapped usage does it make sense to add more servers with the 50% for offheap rule?
d

Daniel Lavoie

05/24/2021, 1:43 PM
This is not the metric I meant
This represents the size of the datastored mapped on disk.
What we want is the size of the in-memory pointers
A server can have TB of memory mapped data.
Also
What usually cause OOM from k8s is that your JVM settings are too high for the actual k8s resource request.
Having too much elements off-heaps will not cause OOM, but disk swapping.
OOM is because the heap is using memory it thinks is made available to the pod, but isn’t, hence the OOM from k8s.
Tuning down the heap size usually fix that issue
You could confirm that by monitoring K8s resource request vs physical usage
p

Pedro Silva

05/24/2021, 1:51 PM
Ok, so if I understand correctly. The processes running in my server pod are trying to use more memory than the pod has available (limit: 7GB)
d

Daniel Lavoie

05/24/2021, 1:51 PM
Exactly
p

Pedro Silva

05/24/2021, 1:51 PM
The JVM usage is as follows:
The memory of the pod (reddish line):
Goes to 175%. Meaning that my pod's limit should be at ~12.5GB (7*1.75)
d

Daniel Lavoie

05/24/2021, 1:53 PM
Just reduce the heap
p

Pedro Silva

05/24/2021, 1:53 PM
Does that sound about right Daniel?
To half of the limit of the pod?
d

Daniel Lavoie

05/24/2021, 1:53 PM
Yes
p

Pedro Silva

05/24/2021, 1:53 PM
Isn't there a concern that such a small heap won't be enough to hold the segments in memory for fast querying?
d

Daniel Lavoie

05/24/2021, 1:54 PM
offheap is fast
👍 1
p

Pedro Silva

05/24/2021, 1:54 PM
Thank you for the assistance, I will try out this config.
If I may make one more question, in what scenarios do you want to increase the pod memory? When does it make sense to scale the servers vertically (more memory) vs horizontally (more servers)?
d

Daniel Lavoie

05/24/2021, 1:58 PM
Having more servers means you reduce the impact of redistributing or redownloading segments
m

Mayank

05/24/2021, 1:58 PM
What is the event rate from your event stream, and how many partitions do you have
Parts or consuming segment are on direct memory that can OOM if not enough (unlike mmap).
p

Pedro Silva

05/24/2021, 1:59 PM
16 partitions. It is a scheduled cron, outputting daily ~50M entries. We are in the process of removing the cron job to an event-based stream.
d

Daniel Lavoie

05/24/2021, 1:59 PM
Parts or consuming segment are on direct memory that can OOM if not enough (unlike mmap).
That will cause a JVM OOM, not a K8S OOM
m

Mayank

05/24/2021, 2:00 PM
Oh sorry, long thread, I assumed it was JVM OOM.
p

Pedro Silva

05/24/2021, 2:00 PM
How can I distinguish them, in K8s? Via the pod logs?
d

Daniel Lavoie

05/24/2021, 2:00 PM
Yes,
👍 1
Heap OOM will be observed from the log as usually dreaded Heap exception
the K8S OOM will just kill your container without no other mention than a
OOM
message in the pods events
m

Mayank

05/24/2021, 2:05 PM
Could we confirm if it is JVM or pod OOM? Also if it is JVM, is it heap or direct memory OOM?
p

Pedro Silva

05/24/2021, 2:06 PM
Copy code
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  [thread 140238581155584 also had an error]
SIGBUS (0x7) at pc=0x00007f8c70a013c2, pid=9, tid=0x00007f8bd167c700
#
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Core dump written. Default location: /opt/pinot/core or core.9
#
[thread 140238499997440 also had an error]
# An error report file with more information is saved as:
# /opt/pinot/hs_err_pid9.log
[thread 140237039417088 also had an error]
[thread 140238498944768 also had an error]
[thread 140238586418944 also had an error]
[thread 140238396491520 also had an error]
[thread 140238582208256 also had an error]
[thread 140238587471616 also had an error]
#
# If you would like to submit a bug report, please visit:
#   <http://bugreport.java.com/bugreport/crash.jsp>
#
Aborted (core dumped)
Got this with a 3GB java heap and a pod memory request of 6GB (limit 7GB)
d

Daniel Lavoie

05/24/2021, 2:06 PM
Oh!
That’s a JVM OOM
Good call Mayank
p

Pedro Silva

05/24/2021, 2:06 PM
134 error code, yh, no logs though
m

Mayank

05/24/2021, 2:07 PM
What does the hs_err.log say?
Likely it is because all 16 partitions are consuming at burst and try to allocate memory at the same time. if it is direct memory, then it is during consumption. If it is heap than it is segment generation happening in parallel
For direct memory, you could reduce partitions and increase footprint of jvm. For heap you can limit number of parallel segments to be generated
You can also throttle the event rate instead of pumping 50M records per burst
p

Pedro Silva

05/24/2021, 2:11 PM
Do you mean reducing the number of partitions in kafka?
Or the
segmentPartitionConfig
config of the pinot table?
m

Mayank

05/24/2021, 2:15 PM
Kafka. But let’s first find out if heap or direct memory
p

Pedro Silva

05/24/2021, 2:15 PM
I'm trying to access the log file but the particular fs path where it is stored is not in a pvc
m

Mayank

05/24/2021, 2:16 PM
Anything in server log?
p

Pedro Silva

05/24/2021, 2:18 PM
Exception in thread "HitExecutionView__12__49__20210524T1141Z" java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
Off-heap memory I assume?
Full trace:
Copy code
Exception in thread "HitExecutionView__12__49__20210524T1141Z" java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
	at org.apache.pinot.segment.local.function.InbuiltFunctionEvaluator$FunctionExecutionNode.execute(InbuiltFunctionEvaluator.java:116)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007f51c5052416, pid=9, tid=0x00007f50a5553700
#
[thread 139984278378240 also had an error]
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Core dump written. Default location: /opt/pinot/core or core.9
#
# An error report file with more information is saved as:
# /opt/pinot/hs_err_pid9.log
	at org.apache.pinot.segment.local.function.InbuiltFunctionEvaluator.evaluate(InbuiltFunctionEvaluator.java:87)
	at org.apache.pinot.segment.local.recordtransformer.ExpressionTransformer.transform(ExpressionTransformer.java:95)
	at org.apache.pinot.segment.local.recordtransformer.CompositeTransformer.transform(CompositeTransformer.java:82)
	at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.processStreamEvents(LLRealtimeSegmentDataManager.java:509)
	at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.consumeLoop(LLRealtimeSegmentDataManager.java:416)
	at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:556)
	at java.lang.Thread.run(Thread.java:748)
[thread 139984280483584 also had an error]
#
# If you would like to submit a bug report, please visit:
#   <http://bugreport.java.com/bugreport/crash.jsp>
#
Aborted (core dumped)
Copy code
LLRealtimeSegmentDataManager$PartitionConsumer.run
Appears to be in the partition consumer
m

Mayank

05/24/2021, 2:20 PM
Are you using any groovy functions?
p

Pedro Silva

05/24/2021, 2:21 PM
Yes
m

Mayank

05/24/2021, 2:21 PM
What does it do
p

Pedro Silva

05/24/2021, 2:22 PM
Parses a time string in a wierd format from a json field, to the java standard and returns the milliseconds since epoch of that time.
m

Mayank

05/24/2021, 2:22 PM
Ok. My guess is the heap OOM
Do you have logs from server?
In server config there is a way to limit number of segments being flushed in parallel.
p

Pedro Silva

05/24/2021, 2:23 PM
That is the last line in log from the server.
m

Mayank

05/24/2021, 2:24 PM
Yeah but can you check if there was segment generation happening around that time
At high level 3GB for 16 partitions getting 50M events in a burst is low for heap as well as direct memory
p

Pedro Silva

05/24/2021, 2:26 PM
I can't see anything referring to segment generation: https://pastebin.com/D8bxdZeA
d

Daniel Lavoie

05/24/2021, 2:26 PM
I think the relevent logs are only available inside the pod (stdout only shows WARN). The INFO file is lost on restart with the default configs.
m

Mayank

05/24/2021, 2:27 PM
If you want to optimize cost then you can throttle the burst so consumption event rate is lower, reduce num parritojis in Kafka once max event rate is low, and limit number of parallel segment generations
p

Pedro Silva

05/24/2021, 2:27 PM
I see some INFO level logs in stdout
m

Mayank

05/24/2021, 2:27 PM
Or else just add more vms 😀
p

Pedro Silva

05/24/2021, 2:28 PM
At this point, simply having a formula that lets me know the memory requirements and size up the servers, would be good enough
m

Mayank

05/24/2021, 2:29 PM
Yeah there is a real-time provision tool in the docs
Have you tried it
m

Mayank

05/24/2021, 2:29 PM
Yes
Also there is doc about it
It might not take care of your bursty nature of events, but let’s see what it proposes
p

Pedro Silva

05/24/2021, 2:30 PM
I'll take a look, get back to you soon. Thank you both so much for the help
Is this tool available as a image, within an image?
d

Daniel Lavoie

05/24/2021, 2:32 PM
I think you should find it within the pinot image as a standalone script in
bin
p

Pedro Silva

05/24/2021, 2:55 PM
Is the tool meant to take a long time? Been running for 5m.
Copy code
docker run --rm -v /home/pedro/dev/Pinot:/tmp/volume apachepinot/pinot:release-0.7.1 RealtimeProvisioningHelper \
-ingestionRate 1000 \
-numPartitions 16 \
-retentionHours 720 \
-numRows 50000000 \
-tableConfigFile /tmp/volume/specs/tables/HitExecutionView_REALTIME.json \
-schemaWithMetadataFile /tmp/volume/specs/schemas/HitExecutionView.json
Executing command: RealtimeProvisioningHelper -tableConfigFile /tmp/volume/specs/tables/HitExecutionView_REALTIME.json -numPartitions 16 -pushFrequency null -numHosts 2,4,6,8,10,12,14,16 -numHours 2,3,4,5,6,7,8,9,10,11,12 -schemaWithMetadataFile /tmp/volume/specs/schemas/HitExecutionView.json -numRows 50000000 -ingestionRate 1000 -maxUsableHostMemory 48G -retentionHours 720