can we apply inverted indexing or range indexing to existing Apache Pinot #troubleshooting

can we apply inverted indexing or range indexing t...

Sadim Nadeem

08/20/2021, 6:18 AM

can we apply inverted indexing or range indexing to existing old table columns with tens of millions of records and what impact it will have on the table performance and should I expect query latency to improve for old data queried as well ..

Xiang Fu

08/20/2021, 6:20 AM

it will improve the queries with predicates on the columns you put indexes

Xiang Fu

08/20/2021, 6:20 AM

you can apply it on old tables then reload all the segments

Sadim Nadeem

08/20/2021, 6:21 AM

what im,pact wil it have on heap memory

Xiang Fu

08/20/2021, 6:21 AM

there will be extra disk space overhead

Sadim Nadeem

08/20/2021, 6:21 AM

will i require to increase heap memory since indexing will be added separately as bitmap in heap memory

Sadim Nadeem

08/20/2021, 6:21 AM

will it be stored in ram or ssd

Xiang Fu

08/20/2021, 6:22 AM

index are on ssd, runtime pinot will load data/indexes

Sadim Nadeem

08/20/2021, 6:23 AM

ok once processed .. they will be written on ssd and gc will clear the heap eventually for hot segments which are being consumed?

Xiang Fu

08/20/2021, 6:23 AM

right

Xiang Fu

08/20/2021, 6:23 AM

it will be off-heap

👍 1

Sadim Nadeem

08/20/2021, 6:23 AM

how difficult is it to apply lucene fst indexing for regexp queries

Sadim Nadeem

08/20/2021, 6:24 AM

I mean i onl;y need to apply the indexing on the string columns or anything else also needed

Xiang Fu

08/20/2021, 6:24 AM

yes

Xiang Fu

08/20/2021, 6:24 AM

apply index, then it will improve the text search query

Xiang Fu

08/20/2021, 6:24 AM

but you pay extra disk space for index

✔️ 1

Sadim Nadeem

08/20/2021, 6:24 AM

what is the preferred or recommended indexing for substring while doing group by and filter .. lucene fst?

Xiang Fu

08/20/2021, 6:25 AM

lucene will help on filter not group by

Sadim Nadeem

08/20/2021, 6:25 AM

ok filter will prune a lot of segments / records

Sadim Nadeem

08/20/2021, 6:25 AM

that will also improve latency drastically i guess

Sadim Nadeem

08/20/2021, 6:26 AM

since we are also doing filter with substring on a string column

Sadim Nadeem

08/20/2021, 6:26 AM

so lucene fst on that string column where we are doing substring while filter will help?

Sadim Nadeem

08/20/2021, 6:28 AM

apply index, then it will improve the text search query => will regexp query or substring query performance wont improve if i use lucene fst and only will improve if i use text_match clause ?

Sadim Nadeem

08/20/2021, 6:29 AM

cc: @Mayank @Jackie @Subbu Subramaniam @Kishore G @Neha Pawar @Ken Krugler @Daniel Lavoie

Sadim Nadeem

08/20/2021, 6:32 AM

cc: @Mohamed Kashifuddin @Shailesh Jha @Mohamed Hussain

Xiang Fu

08/20/2021, 6:54 AM

it will improve the text_match

Xiang Fu

08/20/2021, 6:54 AM

substring won’t get benefit or regex

✔️ 1

Sadim Nadeem

08/20/2021, 6:57 AM

Sadim Nadeem

08/20/2021, 7:01 AM

if time column is primary time column in schema and stored as long data type in milliseconds epoch with granularity as seconds .. is it necessary to use range indexing on time column on my tables if all queries have filter(where clause) with respect to time say last 5 min or last 1 hour or last 30 days @Xiang Fu.. I mean by default segments do have start time and end time if time is primary time column I GUESS .. and segment pruning will happen by default or range indexing on time column is needed?

Xiang Fu

08/20/2021, 7:10 AM

segment prune will take care of start/end time if it’s in your predicate

Sadim Nadeem

08/20/2021, 7:10 AM

so range indexing needed ?

Xiang Fu

08/20/2021, 7:10 AM

Sadim Nadeem

08/20/2021, 7:10 AM

ok got it

Xiang Fu

08/20/2021, 7:11 AM

it will help, but since your segments are ordered by time already, so I feel it’s not necessary to put index

✔️ 1

Xiang Fu

08/20/2021, 7:11 AM

do you have sort index

Xiang Fu

08/20/2021, 7:11 AM

you can sort time

Sadim Nadeem

08/20/2021, 7:11 AM

Sadim Nadeem

08/20/2021, 7:12 AM

but the time coming on kafka topic may not be sorted .. I mean some unordered /unsorted time may come while ingestion

Sadim Nadeem

08/20/2021, 7:12 AM

so is sorted index recommeded in such a scenario?

Sadim Nadeem

08/20/2021, 7:12 AM

for time column

Xiang Fu

08/20/2021, 7:13 AM

pinot will sort data when it’s persisting the data to disk

Xiang Fu

08/20/2021, 7:13 AM

but you can only pick one sorted index

Sadim Nadeem

08/20/2021, 7:14 AM

ok then time column i will select as sorted indexing column

Sadim Nadeem

08/20/2021, 7:14 AM

then range indexing wont be needed on time column .. right

Sadim Nadeem

08/20/2021, 7:14 AM

also sorted indexing will internally also be doing inverted indexing

Sadim Nadeem

08/20/2021, 7:18 AM

also one last doubt .. inverted indexing performance will be same for string type columns and long/int columns or int/long columns are going to give better performance even after indexing applied .. my use case is I have some tenantId or other ID columns as string but they are digits only .. so is it necessary to keep them as long type in schema and then only apply inverted indexing or I can even use string data type with inverted indexing and it wont have much impact on query latency while filtering or group by ..

Xiang Fu

08/20/2021, 7:30 AM

you can use string, but you will pay more storage cost

Sadim Nadeem

08/20/2021, 7:30 AM

latency wise not much impact?

Xiang Fu

08/20/2021, 7:31 AM

if you can use long type to represent id, then suggest to use long

Sadim Nadeem

08/20/2021, 7:31 AM

Xiang Fu

08/20/2021, 7:31 AM

latency wise, using long will definitely help

✔️ 1

Sadim Nadeem

08/20/2021, 7:31 AM

cool

Subbu Subramaniam

08/20/2021, 4:20 PM

@Sadim Nadeem please use the controller endpoint to recommend indexes. As for inverted indexes, the only time they are on heap is when a segment is consming state. Once segments are completed (or they are offline segments) there is nothing stored in heap

✔️ 1

Sadim Nadeem

08/20/2021, 5:01 PM

Ok @Subbu Subramaniam got it

Sadim Nadeem

08/23/2021, 7:37 AM

HI @Subbu Subramaniam @Xiang Fu @Jackie @Mayank: will inverted indexing also help for IN clause query or only for equals?

Xiang Fu

08/23/2021, 7:38 AM

Yes it will

Sadim Nadeem

08/23/2021, 7:39 AM

thanks @Xiang Fu..

Sadim Nadeem

08/23/2021, 7:49 AM

also is it possible to remove indexing from any partiular column in the table and reload all segments like suppose in future i dont want lucene fst indexing on one string type column say col1.. so can i edit the table and remove that indexing and reload all segments and it wont cause much trouble?

Sadim Nadeem

08/23/2021, 7:50 AM

also if the table schema is changed by adding a column .. then what impact it will have on existing indexes and can i add indexing on the newly added column @Xiang Fu

Xiang Fu

08/23/2021, 7:53 AM

Indexing deletion feature is under development

Xiang Fu

08/23/2021, 7:53 AM

Right now, delete indexing won't delete physical index on disk

Sadim Nadeem

08/23/2021, 7:53 AM

but wont cause any trouble on the table

Xiang Fu

08/23/2021, 7:53 AM

Adding new column has no impact on existing index

Xiang Fu

08/23/2021, 7:54 AM

No trouble, just more disk usage

✅ 1

Sadim Nadeem

08/23/2021, 7:54 AM

ok and can we apply index on the newly added column in the exisitng table by editing the schema and then reload all segments?

Sadim Nadeem

08/23/2021, 7:56 AM

I mean first step will be editing schema and adding a column > reload all segments > edit table > add inverted indexing on newly added column > again reload all segments => will it cause any issue?

Xiang Fu

08/23/2021, 8:50 AM

once you add the new column, then you can add index and reload segments

Xiang Fu

08/23/2021, 8:51 AM

your steps will work

Sadim Nadeem

08/23/2021, 9:43 AM

thanks

Open in Slack

Previous Next