Hi Folks I am facing an issue with sorted index on...
# troubleshooting
a
Hi Folks I am facing an issue with sorted index on an offline table. The sortedColumn is set to user_id (String) and input csv file is physically sorted on this column.
Copy code
"sortedColumn": [
        "user_id"
      ]
Metadata shows the column is sorted if row count in segment is < 2.5M (180 MB), if row count is increased to 3M ( 220 MB) the metadata shows isSorted value as false Using the
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
and
jobType: SegmentCreationAndTarPush
for creating and pushing the segments. The segment creation job logs shows its creating dictionary (default) for each column. Is there a limit to number of rows/segment size if we use sorted index? Could this be related to cardinality of column user_id?
m
Hello, for offline, the input is expected to be sorted outside of Pinot.
a
Thanks! got it From the docs, sorted index can be applied to only one column Is there a way to indicate that data is sorted on a group of columns? Eg : data is sorted on col A,B ie first sorted on A, then on B for same value of A
m
Pinot not does have support for secondary sorting right now. Typically one sorted index with inv index can already improve performance a lot. And we have also recently added higly performs the range index recently
w
@Mayank just curious, the realtime table has timeCol, is it equivalent to the sortedCol here?
m
No, you should specify a column other than time column as sorted column. Time column can have range index if needed. There is also metadata for time that is used for pruning data from being scanned