hello friends me again, with a different issue: we...
# troubleshooting
l
hello friends me again, with a different issue: we are executing queries in pinot and are getting the following exception at query time, not for all of them just a few:
Copy code
QueryExecutionError:\njava.lang.IndexOutOfBoundsException\n	at java.base/java.nio.Buffer.checkBounds(Buffer.java:714)\n	at java.base/java.nio.DirectByteBuffer.get(DirectByteBuffer.java:288)\n	at org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReader.getStringCompressed(VarByteChunkSVForwardIndexReader.java:81)\n	at org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReader.getString(VarByteChunkSVForwardIndexReader.java:61)
query looks like this:
Copy code
SELECT SUM(impression_count) as imp_count, stemmed_query FROM query_metrics WHERE user_id = xxx AND product_id = xxx AND serve_time BETWEEN 1660622400 AND 1661227199 GROUP BY stemmed_query ORDER BY impression_count LIMIT 100000
stats:
Copy code
"numServersQueried": 2,
    "numServersResponded": 2,
    "numSegmentsQueried": 11,
    "numSegmentsProcessed": 10,
    "numSegmentsMatched": 10,
    "numConsumingSegmentsQueried": 1,
    "numDocsScanned": 16241,
    "numEntriesScannedInFilter": 5862,
    "numEntriesScannedPostFilter": 64964,
    "numGroupsLimitReached": false,
    "totalDocs": 77203847,
    "timeUsedMs": 133,
    "offlineThreadCpuTimeNs": 0,
    "realtimeThreadCpuTimeNs": 0,
    "offlineSystemActivitiesCpuTimeNs": 0,
    "realtimeSystemActivitiesCpuTimeNs": 0,
    "offlineResponseSerializationCpuTimeNs": 0,
    "realtimeResponseSerializationCpuTimeNs": 0,
    "offlineTotalCpuTimeNs": 0,
    "realtimeTotalCpuTimeNs": 0,
    "segmentStatistics": [],
    "traceInfo": {},
    "minConsumingFreshnessTimeMs": 1661283161852,
    "numRowsResultSet": 100
m
Hmm, are you using no-dictionary column?
l
in the select statement? yes
these 2 are not dictionary columns
SUM(impression_count) as imp_count, stemmed_query
m
Is this real-time table?
l
it’s a real-time yes
m
And do you have upsert? Trying to triage what might be causing the issue
@Jackie for any thoughts
l
no upsert
j
@Luis Fernandez Which version are you running? And what compression type did you use for the no-dictionary column?
l
running
0.10.0
I just configured
"noDictionaryColumns":
i guess default compression? Snappy?
j
Is it a dimension?
l
it’s a dimension
should be a dictionary column in that case right?
j
That is fine. The default compression type for dimension is snappy
Could the index for
stemmed_query
very large?
Might be related to this isue: https://github.com/apache/pinot/issues/8701
🌟 1
l
oh i’m asking should we make the stemmed_query a dictionary column or would it cause it to fail too
but this def looks like the issue
in noDictionaryColumns we usually just leave the metrics columns that we want to aggregate so was just wondering that
What would be your recommendation?
also just to check again the exception happens in the
VarByteChunkSVForwardIndexReader
but we should change the version of the Writer? just confused cause there are different versions of the Reader too.
j
If this is indeed the issue, that means the writer wrote the corrupted data, which caused the exception on the reader side
If the
stemmed_query
is almost all unique, no dictionary should have better performance
You may try the newer version of the writer and see if it solves the problem
l
how do we configure that?
and this wouldn’t fix the existing records right? or would it
j
You may refer to this part: https://docs.pinot.apache.org/configuration-reference/table#field-config-list, and change the
rawIndexWriterVersion
to 3 or 4 (4 might not be available in the version you are running)
r
👋 hi! wanted to follow up on this thread. i’ve tried
rawIndexWriterVersion
both 3 and 4 for our text column, but we’re still receiving the same error as originally posted
j
I believe the issue is from the
stemmed_query
column. Can you try
select DISTINCTCOUNTHLL(stemmed_query) FROM query_metrics
and see if you get the same exception?
If sharing the data is not possible, you may also try directly creating the column index using
SingleValueVarByteRawIndexCreator
then read it with
VarByteChunkSVForwardIndexReader
and see if you can reproduce the issue
r
select DISTINCTCOUNTHLL(stemmed_query) FROM query_metrics
gave the same exception, yes
j
@Jackie 👋 we can reproduce this issue and
DISTINCTCOUNTHLL
doesn't fix the problem
When I check some shops that cause this issue, their
stemmed_query
doesn't look particularly problematic
For example, I was expecting them to be extremely long or have weird characters like (
%
or
!
) but they don't...
j
DISTINCTCOUNTHLL
won't fix the problem. I asked you to try it to validate if
stemmed_query
caused the issue.
The issue could be due to the size of it
j
I think
stemmed_query
is likely the cause because when I remove that line in the query, it worked well.
j
Do you want to have a quick debug session?
j
Oh you mean through video chat or something?
j
Yeah, zoom
j
ah gotcha. Let me ask the team first!
cool! How about directly from slack? And thanks!
j
Do you have the raw data available? Want to see if we can reproduce it using the low level index creator and reader
j
Yeah. we have the raw data.
If you mean raw data being the data in pinot table?
j
Yeah, I'll need all the
stemmed_query
values
j
Yep. We have all the values
j
Cool. Do you have the Pinot source code available? Want to write a small java program to debug it
How many segments do you have? We want to find the segment that is having problem
j
there are 201 segments 👀
What kind of Pinot source code do you need exactly?
We can voice maybe? That might be easier.
j
Sharing screen will be easier if that is okay
j
Yeah that should be good!
How about like in an hour? at 4PM EST? I'd like to try something else before we dig more into it..
j
Sure
thankyou 1
Sorry I'll run a little late. Will ping you when I'm ready
j
Sounds good! Thank you!
j
@Jinny Cho I can talk now
j
Awesome!
hmm it looks like there's no video chat here 😅 I'll try to set up a quick zoom
Hmm Just want to give you heads up that we created a new table with an updated
rawIndexWriterVersion
. But it still doesn't fix the problem. 🤔
j
What is the maximum length of the
stemmed_query
?
If possible, you can try with the latest release
0.11.0
which has more error handling and the V4 raw index. If that is not possible, you may also add
deriveNumDocsPerChunkForRawIndex: true
along with the
rawIndexWriterVersion
and see if it fixes the problem
l
i think it’s configured to the default (?)
Copy code
{
  "name": "stemmed_query",
  "dataType": "STRING"
}
and we added this
Copy code
"fieldConfigList": [
      {
        "name": "stemmed_query",
        "encodingType": "RAW",
        "indexType": "TEXT",
        "indexTypes": [
          "TEXT"
        ],
        "properties": {
          "rawIndexWriterVersion": "3"
        }
      }
    ],
so add that under properties yes?
do we need to create the table again for that?
j
Yeah. something like this?
Copy code
"fieldConfigList": [
      {
        "name": "stemmed_query",
        "encodingType": "RAW",
        "indexType": "TEXT",
        "indexTypes": [
          "TEXT"
        ],
        "properties": {
          "rawIndexWriterVersion": "4",
          "deriveNumDocsPerChunkForRawIndex": true
        }
      }
    ],
j
Oh, you didn't change the
maxLength
of the field? then it means the maximum length is 512
l
yeah that’s correct @Jackie
j
yeah. Maybe we should do that.
But IIRC when I checked some specific problematic cases, the length of the stemmed queries was far below 512 length.
do we need to create the table again for that?
Yeah. Have the same question. Do we need to create the table again to add
deriveNumDocsPerChunkForRawIndex
or upgrade
rawIndexWriterVersion
to V4?
j
Yes, you'll need to recreate the table to add them
Can you upgrade to the latest release? There are several changes introduced in this release, and it is a little bit hard for me to track the old code
l
"deriveNumDocsPerChunkForRawIndex": true
this we can do with rawIndexWriterVersion V3 yes?
always can do
git checkout release-0.10.0
to check that older code
j
Yes, but we lost the new introduced error handling code..
"deriveNumDocsPerChunkForRawIndex": true
is available in V3 yes
🌟 1
l
another question, what happens in pinot if we try to ingest a record that’s greater than the default does it error out or does it ingest it anyhow?
j
You mean the default
maxLength
? Pinot will truncate the value to the max length and ingest it
🍷 1
j
cool
j
Then we can directly try V4. V4 doesn't require
deriveNumDocsPerChunkForRawIndex
j
Nice..! Then I'll try V4 now.
l
seems like we are getting the following error:
Copy code
[
  {
    "message": "QueryExecutionError:\nnet.jpountz.lz4.LZ4Exception: Error decoding offset 588892 of input buffer\n\tat net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:70)\n\tat net.jpountz.lz4.LZ4DecompressorWithLength.decompress(LZ4DecompressorWithLength.java:145)\n\tat org.apache.pinot.segment.local.io.compression.LZ4WithLengthDecompressor.decompress(LZ4WithLengthDecompressor.java:44)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReaderV4$CompressedReaderContext.processChunkAndReadFirstValue(VarByteChunkSVForwardIndexReaderV4.java:241)",
    "errorCode": 200
  },
  {
    "message": "QueryExecutionError:\nnet.jpountz.lz4.LZ4Exception: Error decoding offset 529140 of input buffer\n\tat net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:70)\n\tat net.jpountz.lz4.LZ4DecompressorWithLength.decompress(LZ4DecompressorWithLength.java:145)\n\tat org.apache.pinot.segment.local.io.compression.LZ4WithLengthDecompressor.decompress(LZ4WithLengthDecompressor.java:44)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReaderV4$CompressedReaderContext.processChunkAndReadFirstValue(VarByteChunkSVForwardIndexReaderV4.java:241)",
    "errorCode": 200
  }
]
do you have any clue as to why this may be?
we are also getting this:
Copy code
[
  {
    "message": "QueryExecutionError:\njava.lang.IllegalArgumentException: newPosition > limit: (540794 > 528446)\n\tat java.base/java.nio.Buffer.createPositionException(Buffer.java:318)\n\tat java.base/java.nio.Buffer.position(Buffer.java:293)\n\tat java.base/java.nio.ByteBuffer.position(ByteBuffer.java:1094)\n\tat java.base/java.nio.MappedByteBuffer.position(MappedByteBuffer.java:226)",
    "errorCode": 200
  }
]
j
Interesting... So the issue might be from the LZ4 compression
👀 1
Can you try switching it to SNAPPY?
j
ah instead of murmur?
l
instead of raw
m
LZ4
j
You may add
"compressionCodec": "SNAPPY"
to the field config
j
gotcha! yeah we can try that
it probably requires the whole new creation of the table just to be clear?
j
You may either wipe the current table, or create a new one. For real-time table, dropping the existing one and creating a new one might be simpler
j
Hi team. It took some time to propagate full data. So when we tried V4 & Snappy encoding.. we still get the same decoding problem (got the following error messages)
Copy code
[
  {
    "message": "QueryExecutionError:\nnet.jpountz.lz4.LZ4Exception: Error decoding offset 123239 of input buffer\n\tat net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:70)\n\tat net.jpountz.lz4.LZ4DecompressorWithLength.decompress(LZ4DecompressorWithLength.java:145)\n\tat org.apache.pinot.segment.local.io.compression.LZ4WithLengthDecompressor.decompress(LZ4WithLengthDecompressor.java:44)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReaderV4$CompressedReaderContext.processChunkAndReadFirstValue(VarByteChunkSVForwardIndexReaderV4.java:241)",
    "errorCode": 200
  },
  {
    "message": "java.net.UnknownHostException: pinot-server-0.pinot-server-headless.pinot.svc.cluster.local: Name or service not known\n\tat java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)\n\tat java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)\n\tat java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519)\n\tat java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)",
    "errorCode": 425
  },
  {
    "message": "1 servers [pinot-server-0_R] not responded",
    "errorCode": 427
  }
]
Should we consider SNAPPY decoding as well? At least snappy encoding didn't help this problem.
(We also tried V3 + deriveNumDocsPerChunkForRawIndex = true, and it didn't work)
l
hey friends just bumping this thread we are still facing this issue, it’s not many errors but whoever clients get this error cannot load the UI properly we have tried a number of things but nothing has worked out so far. Anyone that has an idea how can we fix this?
Copy code
[
  {
    "message": "QueryExecutionError:\nnet.jpountz.lz4.LZ4Exception: Error decoding offset 123239 of input buffer\n\tat net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:70)\n\tat net.jpountz.lz4.LZ4DecompressorWithLength.decompress(LZ4DecompressorWithLength.java:145)\n\tat org.apache.pinot.segment.local.io.compression.LZ4WithLengthDecompressor.decompress(LZ4WithLengthDecompressor.java:44)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReaderV4$CompressedReaderContext.processChunkAndReadFirstValue(VarByteChunkSVForwardIndexReaderV4.java:241)",
    "errorCode": 200
  },
  {
    "message": "java.net.UnknownHostException: pinot-server-0.pinot-server-headless.pinot.svc.cluster.local: Name or service not known\n\tat java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)\n\tat java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)\n\tat java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519)\n\tat java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)",
    "errorCode": 425
  },
  {
    "message": "1 servers [pinot-server-0_R] not responded",
    "errorCode": 427
  }
]
m
Did you try snappy?
l
we tried but apparently we still saw that exception ^ which is weird cause it still says LZ4
we would try Snappy with any of the versions? V3/V4?
m
If it says LZ4, then it is likely LZ4. Try v3
j
2 follow-up questions • Is there anything that we need to be aware of before using SNAPPY encoding? I read that it could be slower than LZ4? • How do we check what encoding we use? I think it's possible that V3 silently uses LZ4 even if we set to use Snappy?
l
Copy code
[
  {
    "message": "QueryExecutionError:\njava.lang.IndexOutOfBoundsException\n\tat java.base/java.nio.Buffer.checkBounds(Buffer.java:714)\n\tat java.base/java.nio.DirectByteBuffer.get(DirectByteBuffer.java:288)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReader.getStringCompressed(VarByteChunkSVForwardIndexReader.java:81)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReader.getString(VarByteChunkSVForwardIndexReader.java:61)",
    "errorCode": 200
  }
]
with v3 + SNAPPY
different exception
m
@Jackie ^^
l
w v4 and snappy I see this:
Copy code
[
  {
    "message": "QueryExecutionError:\nProcessingException(errorCode:450, message:InternalError:\njava.lang.NullPointerException\n\tat org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.mergeResults(GroupByOrderByCombineOperator.java:236)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:119)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:50)",
    "errorCode": 200
  }
]
and this for any query
Copy code
[
  {
    "message": "QueryExecutionError:\njava.lang.IllegalArgumentException: newPosition > limit: (511477 > 505284)\n\tat java.base/java.nio.Buffer.createPositionException(Buffer.java:318)\n\tat java.base/java.nio.Buffer.position(Buffer.java:293)\n\tat java.base/java.nio.ByteBuffer.position(ByteBuffer.java:1094)\n\tat java.base/java.nio.MappedByteBuffer.position(MappedByteBuffer.java:226)",
    "errorCode": 200
  }
]
j
Can you check the server log and see if there are related
ERROR
logs?
Also, can you share some example value of
stemmed_query
if possible?
l
what do you want me to try wit hfirst
v3,v4 and SNAPPY (?)
which version
will try with v3 first.
Copy code
"fieldConfigList": [{
        "name": "stemmed_query",
        "encodingType": "RAW",
        "indexType": "TEXT",
        "compressionCodec": "SNAPPY",
        "indexTypes": [
          "TEXT"
        ],
        "properties": {
          "rawIndexWriterVersion": "3"
        }
      }],
example of stemmed_queries:
Copy code
glow in the dark starbucks tumbler
square foam block 3 inch
kemono fursuit eyes
minnie mouse 2nd birthday decoration
in memory gifts mom
custom signs for home
unicorn design bundle
polo shirt men vintage
hochzeits gastgeschenke
8x10 birthday cards
as shown above this is what we are seeing with v3 + SNAPPY
Copy code
[
  {
    "message": "QueryExecutionError:\njava.lang.IndexOutOfBoundsException\n\tat java.base/java.nio.Buffer.checkBounds(Buffer.java:714)\n\tat java.base/java.nio.DirectByteBuffer.get(DirectByteBuffer.java:288)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReader.getStringCompressed(VarByteChunkSVForwardIndexReader.java:81)\n\tat org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReader.getString(VarByteChunkSVForwardIndexReader.java:61)",
    "errorCode": 200
  }
]
in the server we see the following:
Copy code
Caught exception while processing query: QueryContext{_tableName='product_query_metrics_REALTIME', _selectExpressions=[sum(impression_count), sum(click_count), sum(order_count), stemmed_query], _aliasList=[impression_count, click_count, order_count, null], _filter=(user_id = '123123' AND product_id = '123123123' AND serve_time BETWEEN '1663214400' AND '1663819199'), _groupByExpressions=[stemmed_query], _havingFilter=null, _orderByExpressions=[sum(impression_count) ASC], _limit=100000, _offset=0, _queryOptions={responseFormat=sql, groupByMode=sql, timeoutMs=9999}, _debugOptions=null, _brokerRequest=BrokerRequest(querySource:QuerySource(tableName:product_query_metrics_REALTIME), pinotQuery:PinotQuery(dataSource:DataSource(tableName:product_query_metrics_REALTIME), selectList:[Expression(type:FUNCTION, functionCall:Function(operator:AS, operands:[Expression(type:FUNCTION, functionCall:Function(operator:SUM, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:impression_count))])), Expression(type:IDENTIFIER, identifier:Identifier(name:impression_count))])), Expression(type:FUNCTION, functionCall:Function(operator:AS, operands:[Expression(type:FUNCTION, functionCall:Function(operator:SUM, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:click_count))])), Expression(type:IDENTIFIER, identifier:Identifier(name:click_count))])), Expression(type:FUNCTION, functionCall:Function(operator:AS, operands:[Expression(type:FUNCTION, functionCall:Function(operator:SUM, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:order_count))])), Expression(type:IDENTIFIER, identifier:Identifier(name:order_count))])), Expression(type:IDENTIFIER, identifier:Identifier(name:stemmed_query))], filterExpression:Expression(type:FUNCTION, functionCall:Function(operator:AND, operands:[Expression(type:FUNCTION, functionCall:Function(operator:EQUALS, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:user_id)), Expression(type:LITERAL, literal:<Literal longValue:32466758>)])), Expression(type:FUNCTION, functionCall:Function(operator:EQUALS, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:product_id)), Expression(type:LITERAL, literal:<Literal longValue:1261981010>)])), Expression(type:FUNCTION, functionCall:Function(operator:BETWEEN, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:serve_time)), Expression(type:LITERAL, literal:<Literal longValue:1663214400>), Expression(type:LITERAL, literal:<Literal longValue:1663819199>)]))])), groupByList:[Expression(type:IDENTIFIER, identifier:Identifier(name:stemmed_query))], orderByList:[Expression(type:FUNCTION, functionCall:Function(operator:ASC, operands:[Expression(type:FUNCTION, functionCall:Function(operator:SUM, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:impression_count))]))]))], limit:100000, queryOptions:{responseFormat=sql, groupByMode=sql, timeoutMs=9999}))}
trace:
Copy code
java.lang.IndexOutOfBoundsException: null
	at java.nio.Buffer.checkBounds(Buffer.java:714) ~[?:?]
	at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:288) ~[?:?]
	at org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReader.getStringCompressed(VarByteChunkSVForwardIndexReader.java:81) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReader.getString(VarByteChunkSVForwardIndexReader.java:61) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.segment.local.segment.index.readers.forward.VarByteChunkSVForwardIndexReader.getString(VarByteChunkSVForwardIndexReader.java:35) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.common.DataFetcher$ColumnValueReader.readStringValues(DataFetcher.java:515) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.common.DataFetcher.fetchStringValues(DataFetcher.java:204) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.common.DataBlockCache.getStringValuesForSVColumn(DataBlockCache.java:243) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.operator.docvalsets.ProjectionBlockValSet.getStringValuesSV(ProjectionBlockValSet.java:94) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.query.aggregation.groupby.NoDictionarySingleColumnGroupKeyGenerator.generateKeysForBlock(NoDictionarySingleColumnGroupKeyGenerator.java:100) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.query.aggregation.groupby.DefaultGroupByExecutor.process(DefaultGroupByExecutor.java:123) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.operator.query.AggregationGroupByOrderByOperator.getNextBlock(AggregationGroupByOrderByOperator.java:109) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.operator.query.AggregationGroupByOrderByOperator.getNextBlock(AggregationGroupByOrderByOperator.java:46) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:49) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.processSegments(GroupByOrderByCombineOperator.java:137) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.operator.combine.BaseCombineOperator$1.runJob(BaseCombineOperator.java:100) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.core.util.trace.TraceRunnable.run(TraceRunnable.java:40) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at shaded.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at shaded.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at shaded.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
r
it looks like there might just be an off by one bug somewhere in the V4 index. I wrote it so I'll try to find some time to take a look.
l
just to clarify yes we are having issues with V4 Index as well, but this is V3 + SNAPPY
r
yes that's what I saw sent to the channel
🙏 1
I will add some more testing for V4 to isolate any issues in its implementation
l
do you have any idea as to what may be happening in general 😄 i just want to understand the issue
r
how large is the data? If it's >= 4GB it will overflow
which will cause weird stuff like this
l
like the index you mean? i can get that from the segment information if I ssh into the server yes?
r
yes, but i don't recall if there are any logs or metrics which will tell you how large the column is
l
there’s this index_map file but i don’t think that’s what you need is it? the size of the index?
r
the fact you have problems either way makes me think it's not particular to any one index implementation, one thing V2 and V4 have in common is size limits though V3 should be effectively unlimited
j
The index size is actually small (less than 100MB), so it shouldn't be overflow
I suspect there are some special value in the stemmed_queries column that triggers the bug
l
does that stacktrace tell you anything at all? on the server w v3
Copy code
select DISTINCT stemmed_query from etsyads_listing_query_metrics limit 100000
yeah a query like this triggers it
r
you mean triggers a bug in both compression libraries?
l
we get different errors depending on the writer we use, right now that query triggers the bug for v3, with snappy compressioncodec
j
Might not be the compression library, but how pinot handles the value write/read
r
it's interesting that it's similar across several implementations too
j
We have tried a lot of combinations of version + compression, all of them runs into issues, which makes me think it might be some common code bug
r
I got involved thinking this was a V4 bug, which I would have been happy to diagnose and fix, but it feels like it's a bigger issue than that and might even need access to data under NDA or something... Please ping me if help is needed with V4
l
tryna think how can i catch the faulty stemmed_query that’s causing this, i guess i can find a way to query the topic itself with ksqlDB, cause all the data is in the topic…
🤔 1
j
Thanks Richard! Will let you know if we find something
l
thank you Richard 🙏
r
perhaps you can do a binary search on another column to narrow it down, e.g. if you have numeric identifiers add a filter
where x > mid_value
and if it doesn't error, check the lower quarters of the range and so on
j
Good point. To narrow it down, we can actually use the virtual column
$segmentName
and
$docId
l
noob question… what’s the docId
j
The document id (row id) within the segment
So you can filter out any document
l
is that like the order in which it was ingested (?) and like an auto increment?
and can i order by it or use it in where clauses (?)
j
Let's have a zoom chat so we can run some queries together?
@Richard We find the root cause, which is actually fixed in 0.11.0 with this PR: https://github.com/apache/pinot/pull/9059
🌟 1
r
Pleased to hear it!
l
thank you so much @Jackie for all your support 🙏