<@U011KCFDVD2> it will be useful to know a few thi...
# troubleshooting
s
@srisudha it will be useful to know a few things about your use case. How many partitions are you ingesting? What is a rough ingestion rate? How many segments did any one partition (say, 0) make before you reached OOM? Realtime completed segments respect the
loadMode
setting, so if you. have set that to
HEAP
, I suggest you move it to
MMAP
and restart your servers. Realtime servers have a setting
pinot.server.instance.realtime.alloc.offheap
. Setting this to
true
makes sure that we use as little heap as possible during consumption, and memory-map files for the rest. If you do not want memory map (and want to use direct memory instead), you can set
pinot.server.instance.realtime.alloc.offheap.direct
to
true
but I don't think you have set this config. If you have, then please remove it.
s
Hi @Subbu Subramaniam sorry i dint see this message earlier.. We have 3 partitions.. Ingestion rate is 5k.. Replica per partition is 3.. Load mode is mmap... V want mmap itself.. And don't want to go with off heap..
And first set of segments which is 6 of them got created successfully on all three servers.. During the time of seg creation in the second time.. OOM has happened on one server ... Other servers were fine.. And they went ahead with their segment creation..
And one more thing setting the flag.. That u mentioned.. pinot.server.instance.realtime.alloc.off heap to true makes sure v use off heap right? Wont it be memory efficient if v go with just mmap and leave this as false??
@Subbu Subramaniam
s
@srisudha If you have 3 replicas and 3 partitions, you should have created 9 segments in the first round. Not sure how you got only 6. The offheap setting I mentioned is for consuming segments only. Once the segments are committed (completed segments), they move to honor your loadMode setting. Your OOM stack (if I recollect right) seems to be from offheap direct memory. You can take another look at the stack to confirm it. It looks like you need to either increase number of servers or increase memory per server regardless. If you can give the following information I may be able to help further: 1. Your tableconfig 2. The command line and the output of the realtime provisioning tool 3. The OOM stack 4. The jvm arguments
s
For #2 here is the output from clt - Memory used per host numHosts --> 2 |3 | numHours 12 --------> 4.81G |2.88G | 24 --------> 8.08G |4.85G | Optimal segment size numHosts --> 2 |3 | numHours 12 --------> 49.36M |49.36M | 24 --------> 98.72M |98.72M | Consuming memory numHosts --> 2 |3 | numHours 12 --------> 3.6G |2.16G | 24 --------> 7.12G |4.27G |
For #3 OOM stack
java.lang.reflect.InvocationTargetException: null at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252] at org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331) [pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97) [pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49) [pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] Caused by: java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:694) ~[?:1.8.0_252] at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) ~[?:1.8.0_252] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[?:1.8.0_252] at org.apache.pinot.core.segment.memory.PinotByteBuffer.allocateDirect(PinotByteBuffer.java:41) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.segment.memory.PinotDataBuffer.allocateDirect(PinotDataBuffer.java:116) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.io.writer.impl.DirectMemoryManager.allocateInternal(DirectMemoryManager.java:53) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.io.readerwriter.RealtimeIndexOffHeapMemoryManager.allocate(RealtimeIndexOffHeapMemoryManager.java:79) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.io.readerwriter.impl.FixedByteSingleColumnSingleValueReaderWriter.addBuffer(FixedByteSingleColumnSingleValueReaderWriter.java:179) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.io.readerwriter.impl.FixedByteSingleColumnSingleValueReaderWriter.<init>(FixedByteSingleColumnSingleValueReaderWriter.java:71) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.indexsegment.mutable.MutableSegmentImpl.<init>(MutableSegmentImpl.java:273) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.<init>(LLRealtimeSegmentDataManager.java:1206) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:262) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:132) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:164) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline(SegmentOnlineOfflineStateModelFactory.java:88) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] ... 12 more 2020/06/02 091007.649 ERROR [StateModel] [HelixTaskExecutor-message_handle_thread] Default rollback method invoked on error. Error Code: ERROR 2020/06/02 091008.049 ERROR [HelixTask] [HelixTaskExecutor-message_handle_thread] Message execution failed. msgId: 15ddbe9a-0b8c-4b07-a732-c24b4ab1d5bd, errorMsg: java.lang.reflect.InvocationTargetException
For #4 jvm args..
jvmOpts: "-Xms4G -Xmx4G -XX:MaxDirectMemorySize=10g"
Ram : 26GB ; 3 servers, 3 partitions, 3 replicas for each partition and 7 core for each server
And yes, you are right. I should be seeing 9 segments after first round of segment creation, i will double check
s
Please include the arguments you gave to realtime provisioning tools command (maybe it is a good improvement to just print out the arguments)
@srisudha is it possible that the sample segment you provided for realtime tool is not a representative sample? Or, was your table config changed after you ran the command? @Neha Pawar it looks like they have 90d retention (assume realtime only table) but the tool points use 10g of memory with 3 servers, but it looks like they run out of memory in the second segment build of a partition.
can you re-run the command with the segment that you have generated now, and the table config as you have it now? thanks.
s
Hi @Subbu Subramaniam the 6 partition is per server.. What it means is consuming plus completed number of segments after first seg creation step would be 6 per server. In other words total of 9 seg s completed.
s
@Subbu Subramaniam Provisioning tool cmd, output and table config file
s
Do you also have offline push for this table? Or, is it a realtime only table? Your table config indicates daily offline push, so it is assumed that we need to retain things in memory for a maximum of 72 hours. Also, the tool computes retained memory, not allocated memory (could be an improvement in the tool), so if you start to use the mmap setting, things will get better. Do you think the segment you provided is a valid sample ? If not, can you run it with one of the more recent segments generated? If you dont have offline push at all, then the equation is way off. You need to specify the argument
-retentionHours 2160
since you have 90d retention. And then the number of hosts will be many more
s
Oh! Which part of the config says to retain data for 72 hours.. Ours is a realtime use case. No offline push. V r using load mode as mmap and other config required to enable mmap ??
Segment is the most recent one
V executed it today..
n
@srisudha if you do not provide
-retentionHours
option, the default value is 72. So you need to provide option
-retentionHours 2160
when you run the command
s
Our table is realtime only. Was not aware of retentionHours config, was assuming it will pick that from table config. Will try with it👍
s
@Neha Pawar we should print out all default values before using them
@srisudha I had mentioned the configuration
pinot.server.instance....
a while ago for consumingsegments
s
removed the segmentpush configs and provided the "-retentionHours 2160"
s
OK, so now you need 29 or 31G of memory per host.
This is if u use mmap for consuming segments
pinot.server.instance.realtime.alloc.offheap=true
The dafault is false, in which case yuo are using direct memory for
CONSUMING
segments (and then using MMAP for loading them when they are completed, as per your config)
Another point: I am not sure why you specify
realtime.segment.flush.autotune.initialRows
please remove it, and let the default ramp up do its course.
Your flush threshold time is set to 240h. Did you intend to set it to
24h
? I recommend setting it to 24h
You should also use the recently introduced rounding functions to transform your time in milliseconds to be rounded to (say) the nearest minute. That will decrease the dictionary size a lot. @Neha Pawar can help modify your schema to do that.
You should use https://github.com/apache/incubator-pinot/pull/5575 when you are ready to move to a new release. It will be available in 0.5.0 .Until then, if you can change your input values to kafka for timestamp be rounded to the nearest minute (or 5 minutes) that will work. Or, if that works for you , change the time column to be in minutesSinceEpoch and use the rounding funciton already available to do that. (this is there in 0.4.0)
Or, use the groovy function for now, and then move to use the rounding function.
n
Copy code
"dateTimeFieldSpecs": [
        {
          "name": "roundedTimeSinceEpoch",
          "dataType": "LONG",
          "format": "1:MILLSECONDS:EPOCH",
          "granularity": "15:MINUTES",
          "transformFunction": "Groovy({(timeSinceEpoch/900000)*900000}, timeSinceEpoch)"
        }
      ]
s
@Subbu Subramaniam just to clarify.. The retention time of 3 months is for all completed segments .. Now as i understand the tool is trying to say that to load all of the segments in memory it needs 30 GB correct.. But using mmap this should decrease drastically.. Isn't it.. V don't need all the segments all the time in memory.. Can you clarify on this please. And whatever configuration u gave pinot.server.instance.realtime.aloc.offheap true makes sure v use off heap for consuming seg.. So in this case as per tool v would be needing 5 GB per host.. Am i reading it right..? And if v use mmap then wont this reduce..
s
The tool calculates total resident memory, not mapped memory. In your case (since you dont have any offline segments) resident memory is same as mapped memory. It computes the memory needed per server is 30G with segment size set to 100M with 3 hosts. (according to the output that Shaounak posted)
you can choose to have this 30G as direct (which you have said you dont want to have) or mmaped which is the suggested and recommended way
Note that the took is an approximation. It has no way of predicting what is needed, so it makes some assumptions based on the sample segmeht you provide
s
@Subbu Subramaniam
pinot.server.instance.realtime.alloc.offheap=true
I set the flag but when i created the table in the consumer folder under table directory there were files for all 9 consuming segments(3 partition * 3 replicas) each of 512mb. few doubts 1. why its a 512mb file? 2. why all replica files are present in each server even if that server doesnt have that consumer?
s
1. 512 MB files are mapped, but not all of that is used. Try the command
du -sh
to see how much is actually used. We map 512MB at a time and allocate from that to minimize fragmentation and limit the number of file handles. 2. How did you get the impression that all replicas are in each server? Each consuming segment may have multiple files if your allocation exceeds 512MB. So you will see files with extensions
.0
,
.1
,
.2
, etc. created as needed.
s
@Subbu Subramaniam i shared an image
From the azure storage of one server which is a snap shot from the consumers folder , all of these are consuming segments and it shows all 9 consuming segments on one server.. Same is the case on second server as well..
And we had 4gb of persistent volume..
At this point , few of the consuming segments were in error state
And the exception from servers indicated that There is no space on disk ..
One more thing mmap was also 4 gb
s
then u need more disk.
s
Yes that is given .. But the point while debugging we found those consumer folders have all 9 consuming seg ..
s
@Subbu Subramaniam yes they were extensions of single consumer.. Got the clarity now. thanks a lot for all the inputs and suggestions!
s
Thanks a lot! @Subbu Subramaniam and @Mayank
To conclude this thread.. By default consuming segment would be using direct buffers .. And for our setting 10 GB as direct buffers is lesss and hence OOM was coming up. By changing consuming segments to use MMAP , and making sure the secondary storage has enough space.. Solves the memory issue..
👍 2
s
@srisudha and @Shounak Kulkarni looks like you got your use case up and running. That is great. I have created https://github.com/apache/incubator-pinot/issues/5588 that will hopefully make the next experience better. I have a request of you. Can you please (1) Add other points to the issue that you think would have made the onboarding better/easier (2) Help us with fixing the issue (code change is minimal) (3) Write a blog about using pinot Thanks
s
Yes Subbu definitely. Already started with the blog will publish it soon😁
m
Nice! What's the best case qps and throughput you were able to achieve @srisudha Looking forward to the blog @Shounak Kulkarni.
s
Qps and throughput are same as before. 5000 ingestion plus 3000 query on a 3 server setup described above.. 95 percentile is 4.5, 3.7, 7.5 ms.there are two. Brokers 95 percentile is 11.4, 10.4 ms... Now that OOM issue is resolved.. V still have few more resiliency test failures to debug..
And we have few more backlog items on pinot to keep us going..
Just to add this is over 200 million records and about 6 hours PT.
👍 1
s
What is PT?
s
performance testing