hello! we’re trying to batch ingest segments into ...
# troubleshooting
r
hello! we’re trying to batch ingest segments into our pinot instance, but we are finding that some segments are in a bad state. the stack trace we see from the debug/tables/{tablename} endpoint is like so:
java.lang.IllegalArgumentException: newLimit > capacity: (604 > 28)\n\tat java.base/java.nio.Buffer.createLimitException(Buffer.java:372)\n\tat java.base/java.nio.Buffer.limit(Buffer.java:346)\n\tat java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:1107)\n\tat java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:235)\n\tat java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:67)\n\tat org.apache.pinot.segment.spi.memory.PinotByteBuffer.view(PinotByteBuffer.java:303)\n\tat org.apache.pinot.segment.spi.memory.PinotDataBuffer.view(PinotDataBuffer.java:379)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.BaseChunkSVForwardIndexReader.<init>(BaseChunkSVForwardIndexReader.java:97)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.FixedByteChunkSVForwardIndexReader.<init>(FixedByteChunkSVForwardIndexReader.java:37)\n\tat org.apache.pinot.segment.local.segment.index.readers.DefaultIndexReaderProvider.newForwardIndexReader(DefaultIndexReaderProvider.java:97)\n\tat org.apache.pinot.segment.spi.index.IndexingOverrides$Default.newForwardIndexReader(IndexingOverrides.java:184)\n\tat org.apache.pinot.segment.local.segment.index.column.PhysicalColumnIndexContainer.<init>(PhysicalColumnIndexContainer.java:166)\n\tat org.apache.pinot.segment.local.indexsegment.immutable.ImmutableSegmentLoader.load(ImmutableSegmentLoader.java:181)\n\tat org.apache.pinot.segment.local.indexsegment.immutable.ImmutableSegmentLoader.load(ImmutableSegmentLoader.java:121)\n\tat org.apache.pinot.segment.local.indexsegment.immutable.ImmutableSegmentLoader.load(ImmutableSegmentLoader.java:91)\n\tat org.apache.pinot.core.data.manager.offline.OfflineTableDataManager.addSegment(OfflineTableDataManager.java:52)\n\tat org.apache.pinot.core.data.manager.BaseTableDataManager.addOrReplaceSegment(BaseTableDataManager.java:373)\n\tat org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addOrReplaceSegment(HelixInstanceDataManager.java:355)\n\tat org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:162)\n\tat jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404)\n\tat org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331)\n\tat org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97)\n\tat org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n
@Luis Fernandez and I were wondering what this
capacity
value (28, according to the trace) might be? thanks!
r
hi, looks like an integer overflow
the default raw forward index format is v2, which only only supports 2GB per column
you can try v3 or v4 which support larger sizes
2
l
how to do that 😄
and also what does it mean does it mean one of the values in the noDictionaryColumns is just too big?
one of the values of the columns in noDictionaryColumns*
s
Is this column configured as a noDictionaryColumn ?
You can configure v3 as follows
Copy code
"fieldConfigList": [
            {
                "encodingType": "RAW",
                "name": "columnName",
                "properties": {
                    "deriveNumDocsPerChunkForRawIndex": "true",
                    "rawIndexWriterVersion": "3"
                }
            }
        ]
l
yes it’s well most of them
these columns are just counts
Copy code
"noDictionaryColumns": [
        "click_count",
        "order_count",
        "impression_count",
        "cost",
        "revenue"
      ],
s
Also, make sure to add the column in noDictionaryColumns list in the indexingConfig section of the table config
Copy code
"noDictionaryColumns": [
                "columnName"
            ],
ideally it should not be needed in both places but yea config cleanup is needed
I think you just need to setup
fieldConfigList
then
What is the type of this column ?
l
type is int
for all those columns
cool thank you, we are just trying to understand what in particular caused it to have that exception cause it’s a new one to us
s
The v3 especially was introduced since we were hitting 2GB limit on STRING type columns. Since you are hitting this on an INT column, it possibly means that you have 500 million rows in a single segment ?
which may not necessarily be optimal
btw, v3 will work for both fixed and variable width.. I am just curious that there is a need to use it on INT / fixed width columns
cc @Richard Startin
r
👀 i think we’re seeing that generally the number of rows in our segments is around 200k, i’d be pretty surprised if one segment had >500mill rows
r
are any of these multi value?
l
none of them
s
seems like a different problem to me then
in fact the problem is happening during read / segment load which potentially implies there is no need to bump the version from v2 to v3 because then the segment generation should have ideally failed initially as the overflow would have resulted in a negative capacity (at least that's what I have seen in the past whenever there is a need to go from v2 to v3)
r
I will look in to this on Monday
r
hi! following up on this- we changed our ingested files to be parquet files (instead of JSONL), and we are no longer experiencing those intermittent errors ingesting segments 👍
r
thanks for the feedback on this. I wasn't able to reproduce
did you have a JSON index by any chance?
r
nope!