< srisudha> it will be useful to know a few things about you Apache Pinot #troubleshooting

<@U011KCFDVD2> it will be useful to know a few thi...

Subbu Subramaniam

06/12/2020, 5:37 PM

@srisudha it will be useful to know a few things about your use case. How many partitions are you ingesting? What is a rough ingestion rate? How many segments did any one partition (say, 0) make before you reached OOM? Realtime completed segments respect the

loadMode

setting, so if you. have set that to

HEAP

, I suggest you move it to

MMAP

and restart your servers. Realtime servers have a setting

pinot.server.instance.realtime.alloc.offheap

. Setting this to

true

makes sure that we use as little heap as possible during consumption, and memory-map files for the rest. If you do not want memory map (and want to use direct memory instead), you can set

pinot.server.instance.realtime.alloc.offheap.direct

true

but I don't think you have set this config. If you have, then please remove it.

srisudha

06/14/2020, 9:12 PM

Hi @Subbu Subramaniam sorry i dint see this message earlier.. We have 3 partitions.. Ingestion rate is 5k.. Replica per partition is 3.. Load mode is mmap... V want mmap itself.. And don't want to go with off heap..

srisudha

06/14/2020, 9:13 PM

And first set of segments which is 6 of them got created successfully on all three servers.. During the time of seg creation in the second time.. OOM has happened on one server ... Other servers were fine.. And they went ahead with their segment creation..

srisudha

06/15/2020, 5:59 AM

And one more thing setting the flag.. That u mentioned.. pinot.server.instance.realtime.alloc.off heap to true makes sure v use off heap right? Wont it be memory efficient if v go with just mmap and leave this as false??

srisudha

06/15/2020, 3:04 PM

@Subbu Subramaniam

Subbu Subramaniam

06/15/2020, 3:58 PM

@srisudha If you have 3 replicas and 3 partitions, you should have created 9 segments in the first round. Not sure how you got only 6. The offheap setting I mentioned is for consuming segments only. Once the segments are committed (completed segments), they move to honor your loadMode setting. Your OOM stack (if I recollect right) seems to be from offheap direct memory. You can take another look at the stack to confirm it. It looks like you need to either increase number of servers or increase memory per server regardless. If you can give the following information I may be able to help further: 1. Your tableconfig 2. The command line and the output of the realtime provisioning tool 3. The OOM stack 4. The jvm arguments

srisudha

06/15/2020, 5:34 PM

For #2 here is the output from clt - Memory used per host numHosts --> 2 |3 | numHours 12 --------> 4.81G |2.88G | 24 --------> 8.08G |4.85G | Optimal segment size numHosts --> 2 |3 | numHours 12 --------> 49.36M |49.36M | 24 --------> 98.72M |98.72M | Consuming memory numHosts --> 2 |3 | numHours 12 --------> 3.6G |2.16G | 24 --------> 7.12G |4.27G |

srisudha

06/15/2020, 5:35 PM

For #3 OOM stack

srisudha

06/15/2020, 5:35 PM

java.lang.reflect.InvocationTargetException: null at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252] at org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331) [pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97) [pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49) [pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] Caused by: java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:694) ~[?:1.8.0_252] at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) ~[?:1.8.0_252] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[?:1.8.0_252] at org.apache.pinot.core.segment.memory.PinotByteBuffer.allocateDirect(PinotByteBuffer.java:41) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.segment.memory.PinotDataBuffer.allocateDirect(PinotDataBuffer.java:116) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.io.writer.impl.DirectMemoryManager.allocateInternal(DirectMemoryManager.java:53) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.io.readerwriter.RealtimeIndexOffHeapMemoryManager.allocate(RealtimeIndexOffHeapMemoryManager.java:79) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.io.readerwriter.impl.FixedByteSingleColumnSingleValueReaderWriter.addBuffer(FixedByteSingleColumnSingleValueReaderWriter.java:179) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.io.readerwriter.impl.FixedByteSingleColumnSingleValueReaderWriter.<init>(FixedByteSingleColumnSingleValueReaderWriter.java:71) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.indexsegment.mutable.MutableSegmentImpl.<init>(MutableSegmentImpl.java:273) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.<init>(LLRealtimeSegmentDataManager.java:1206) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:262) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:132) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:164) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline(SegmentOnlineOfflineStateModelFactory.java:88) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ec03154343df4831e33092a247505ef0af3d9daf] ... 12 more 2020/06/02 091007.649 ERROR [StateModel] [HelixTaskExecutor-message_handle_thread] Default rollback method invoked on error. Error Code: ERROR 2020/06/02 091008.049 ERROR [HelixTask] [HelixTaskExecutor-message_handle_thread] Message execution failed. msgId: 15ddbe9a-0b8c-4b07-a732-c24b4ab1d5bd, errorMsg: java.lang.reflect.InvocationTargetException

srisudha

06/15/2020, 5:36 PM

For #4 jvm args..

srisudha

06/15/2020, 5:39 PM

jvmOpts: "-Xms4G -Xmx4G -XX:MaxDirectMemorySize=10g"

srisudha

06/15/2020, 5:40 PM

Ram : 26GB ; 3 servers, 3 partitions, 3 replicas for each partition and 7 core for each server

srisudha

06/15/2020, 5:41 PM

And yes, you are right. I should be seeing 9 segments after first round of segment creation, i will double check

Subbu Subramaniam

06/15/2020, 6:57 PM

Please include the arguments you gave to realtime provisioning tools command (maybe it is a good improvement to just print out the arguments)

Subbu Subramaniam

06/15/2020, 8:09 PM

@srisudha is it possible that the sample segment you provided for realtime tool is not a representative sample? Or, was your table config changed after you ran the command? @Neha Pawar it looks like they have 90d retention (assume realtime only table) but the tool points use 10g of memory with 3 servers, but it looks like they run out of memory in the second segment build of a partition.

Subbu Subramaniam

06/15/2020, 8:11 PM

can you re-run the command with the segment that you have generated now, and the table config as you have it now? thanks.

srisudha

06/16/2020, 12:40 PM

Hi @Subbu Subramaniam the 6 partition is per server.. What it means is consuming plus completed number of segments after first seg creation step would be 6 per server. In other words total of 9 seg s completed.

Shounak Kulkarni

06/16/2020, 1:21 PM

@Subbu Subramaniam Provisioning tool cmd, output and table config file

provisioningToolOutput.txt

Subbu Subramaniam

06/16/2020, 4:05 PM

Do you also have offline push for this table? Or, is it a realtime only table? Your table config indicates daily offline push, so it is assumed that we need to retain things in memory for a maximum of 72 hours. Also, the tool computes retained memory, not allocated memory (could be an improvement in the tool), so if you start to use the mmap setting, things will get better. Do you think the segment you provided is a valid sample ? If not, can you run it with one of the more recent segments generated? If you dont have offline push at all, then the equation is way off. You need to specify the argument

-retentionHours 2160

since you have 90d retention. And then the number of hosts will be many more

srisudha

06/16/2020, 4:10 PM

Oh! Which part of the config says to retain data for 72 hours.. Ours is a realtime use case. No offline push. V r using load mode as mmap and other config required to enable mmap ??

srisudha

06/16/2020, 4:15 PM

Segment is the most recent one

srisudha

06/16/2020, 4:15 PM

V executed it today..

Neha Pawar

06/16/2020, 4:24 PM

@srisudha if you do not provide

-retentionHours

option, the default value is 72. So you need to provide option

-retentionHours 2160

when you run the command

Shounak Kulkarni

06/16/2020, 4:27 PM

Our table is realtime only. Was not aware of retentionHours config, was assuming it will pick that from table config. Will try with it👍

Subbu Subramaniam

06/16/2020, 4:39 PM

@Neha Pawar we should print out all default values before using them

Subbu Subramaniam

06/16/2020, 4:39 PM

@srisudha I had mentioned the configuration

pinot.server.instance....

a while ago for consumingsegments

Shounak Kulkarni

06/16/2020, 5:10 PM

removed the segmentpush configs and provided the "-retentionHours 2160"

provisioningToolOutput2.txt

Subbu Subramaniam

06/16/2020, 6:22 PM

OK, so now you need 29 or 31G of memory per host.

Subbu Subramaniam

06/16/2020, 6:27 PM

This is if u use mmap for consuming segments

pinot.server.instance.realtime.alloc.offheap=true

The dafault is false, in which case yuo are using direct memory for

CONSUMING

segments (and then using MMAP for loading them when they are completed, as per your config)

Subbu Subramaniam

06/16/2020, 6:29 PM

Another point: I am not sure why you specify

realtime.segment.flush.autotune.initialRows

please remove it, and let the default ramp up do its course.

Subbu Subramaniam

06/16/2020, 6:30 PM

Your flush threshold time is set to 240h. Did you intend to set it to

24h

? I recommend setting it to 24h

Subbu Subramaniam

06/16/2020, 6:31 PM

You should also use the recently introduced rounding functions to transform your time in milliseconds to be rounded to (say) the nearest minute. That will decrease the dictionary size a lot. @Neha Pawar can help modify your schema to do that.

Subbu Subramaniam

06/16/2020, 9:04 PM

You should use https://github.com/apache/incubator-pinot/pull/5575 when you are ready to move to a new release. It will be available in 0.5.0 .Until then, if you can change your input values to kafka for timestamp be rounded to the nearest minute (or 5 minutes) that will work. Or, if that works for you , change the time column to be in minutesSinceEpoch and use the rounding funciton already available to do that. (this is there in 0.4.0)

Subbu Subramaniam

06/16/2020, 9:12 PM

Or, use the groovy function for now, and then move to use the rounding function.

Neha Pawar

06/16/2020, 9:16 PM

Copy code

"dateTimeFieldSpecs": [
        {
          "name": "roundedTimeSinceEpoch",
          "dataType": "LONG",
          "format": "1:MILLSECONDS:EPOCH",
          "granularity": "15:MINUTES",
          "transformFunction": "Groovy({(timeSinceEpoch/900000)*900000}, timeSinceEpoch)"
        }
      ]

srisudha

06/17/2020, 1:31 AM

@Subbu Subramaniam just to clarify.. The retention time of 3 months is for all completed segments .. Now as i understand the tool is trying to say that to load all of the segments in memory it needs 30 GB correct.. But using mmap this should decrease drastically.. Isn't it.. V don't need all the segments all the time in memory.. Can you clarify on this please. And whatever configuration u gave pinot.server.instance.realtime.aloc.offheap true makes sure v use off heap for consuming seg.. So in this case as per tool v would be needing 5 GB per host.. Am i reading it right..? And if v use mmap then wont this reduce..

Subbu Subramaniam

06/17/2020, 2:47 AM

The tool calculates total resident memory, not mapped memory. In your case (since you dont have any offline segments) resident memory is same as mapped memory. It computes the memory needed per server is 30G with segment size set to 100M with 3 hosts. (according to the output that Shaounak posted)

Subbu Subramaniam

06/17/2020, 2:47 AM

you can choose to have this 30G as direct (which you have said you dont want to have) or mmaped which is the suggested and recommended way

Subbu Subramaniam

06/17/2020, 2:48 AM

Note that the took is an approximation. It has no way of predicting what is needed, so it makes some assumptions based on the sample segmeht you provide

Shounak Kulkarni

06/17/2020, 2:13 PM

@Subbu Subramaniam

pinot.server.instance.realtime.alloc.offheap=true

I set the flag but when i created the table in the consumer folder under table directory there were files for all 9 consuming segments(3 partition * 3 replicas) each of 512mb. few doubts 1. why its a 512mb file? 2. why all replica files are present in each server even if that server doesnt have that consumer?

Subbu Subramaniam

06/17/2020, 3:38 PM

1. 512 MB files are mapped, but not all of that is used. Try the command

du -sh

to see how much is actually used. We map 512MB at a time and allocate from that to minimize fragmentation and limit the number of file handles. 2. How did you get the impression that all replicas are in each server? Each consuming segment may have multiple files if your allocation exceeds 512MB. So you will see files with extensions

.0

.1

.2

, etc. created as needed.

srisudha

06/17/2020, 7:37 PM

@Subbu Subramaniam i shared an image

srisudha

06/17/2020, 7:39 PM

From the azure storage of one server which is a snap shot from the consumers folder , all of these are consuming segments and it shows all 9 consuming segments on one server.. Same is the case on second server as well..

srisudha

06/17/2020, 7:39 PM

And we had 4gb of persistent volume..

srisudha

06/17/2020, 7:40 PM

At this point , few of the consuming segments were in error state

srisudha

06/17/2020, 7:42 PM

And the exception from servers indicated that There is no space on disk ..

srisudha

06/17/2020, 7:42 PM

One more thing mmap was also 4 gb

Subbu Subramaniam

06/17/2020, 8:06 PM

then u need more disk.

srisudha

06/18/2020, 12:40 AM

Yes that is given .. But the point while debugging we found those consumer folders have all 9 consuming seg ..

Shounak Kulkarni

06/18/2020, 9:16 AM

@Subbu Subramaniam yes they were extensions of single consumer.. Got the clarity now. thanks a lot for all the inputs and suggestions!

srisudha

06/18/2020, 9:36 AM

Thanks a lot! @Subbu Subramaniam and @Mayank

srisudha

06/18/2020, 10:25 AM

To conclude this thread.. By default consuming segment would be using direct buffers .. And for our setting 10 GB as direct buffers is lesss and hence OOM was coming up. By changing consuming segments to use MMAP , and making sure the secondary storage has enough space.. Solves the memory issue..

👍 2

Subbu Subramaniam

06/18/2020, 3:54 PM

@srisudha and @Shounak Kulkarni looks like you got your use case up and running. That is great. I have created https://github.com/apache/incubator-pinot/issues/5588 that will hopefully make the next experience better. I have a request of you. Can you please (1) Add other points to the issue that you think would have made the onboarding better/easier (2) Help us with fixing the issue (code change is minimal) (3) Write a blog about using pinot Thanks

Shounak Kulkarni

06/18/2020, 4:26 PM

Yes Subbu definitely. Already started with the blog will publish it soon😁

Mayank

06/18/2020, 4:30 PM

Nice! What's the best case qps and throughput you were able to achieve @srisudha Looking forward to the blog @Shounak Kulkarni.

srisudha

06/18/2020, 10:41 PM

Qps and throughput are same as before. 5000 ingestion plus 3000 query on a 3 server setup described above.. 95 percentile is 4.5, 3.7, 7.5 ms.there are two. Brokers 95 percentile is 11.4, 10.4 ms... Now that OOM issue is resolved.. V still have few more resiliency test failures to debug..

srisudha

06/18/2020, 10:42 PM

And we have few more backlog items on pinot to keep us going..

srisudha

06/18/2020, 10:58 PM

Just to add this is over 200 million records and about 6 hours PT.

👍 1

Subbu Subramaniam

06/20/2020, 8:19 PM

What is PT?

Shounak Kulkarni

06/23/2020, 5:21 AM

performance testing

Open in Slack

Previous Next