can we apply inverted indexing or range indexing t...
# troubleshooting
s
can we apply inverted indexing or range indexing to existing old table columns with tens of millions of records and what impact it will have on the table performance and should I expect query latency to improve for old data queried as well ..
x
it will improve the queries with predicates on the columns you put indexes
you can apply it on old tables then reload all the segments
s
what im,pact wil it have on heap memory
x
there will be extra disk space overhead
s
will i require to increase heap memory since indexing will be added separately as bitmap in heap memory
will it be stored in ram or ssd
x
index are on ssd, runtime pinot will load data/indexes
s
ok once processed .. they will be written on ssd and gc will clear the heap eventually for hot segments which are being consumed?
x
right
it will be off-heap
👍 1
s
how difficult is it to apply lucene fst indexing for regexp queries
I mean i onl;y need to apply the indexing on the string columns or anything else also needed
x
yes
apply index, then it will improve the text search query
but you pay extra disk space for index
✔️ 1
s
what is the preferred or recommended indexing for substring while doing group by and filter .. lucene fst?
x
lucene will help on filter not group by
s
ok filter will prune a lot of segments / records
that will also improve latency drastically i guess
since we are also doing filter with substring on a string column
so lucene fst on that string column where we are doing substring while filter will help?
apply index, then it will improve the text search query => will regexp query or substring query performance wont improve if i use lucene fst and only will improve if i use text_match clause ?
cc: @Mayank @Jackie @Subbu Subramaniam @Kishore G @Neha Pawar @Ken Krugler @Daniel Lavoie
cc: @Mohamed Kashifuddin @Shailesh Jha @Mohamed Hussain
x
it will improve the text_match
substring won’t get benefit or regex
✔️ 1
s
ok
if time column is primary time column in schema and stored as long data type in milliseconds epoch with granularity as seconds .. is it necessary to use range indexing on time column on my tables if all queries have filter(where clause) with respect to time say last 5 min or last 1 hour or last 30 days @Xiang Fu.. I mean by default segments do have start time and end time if time is primary time column I GUESS .. and segment pruning will happen by default or range indexing on time column is needed?
x
segment prune will take care of start/end time if it’s in your predicate
s
so range indexing needed ?
x
no
s
ok got it
x
it will help, but since your segments are ordered by time already, so I feel it’s not necessary to put index
✔️ 1
do you have sort index
you can sort time
s
ok
but the time coming on kafka topic may not be sorted .. I mean some unordered /unsorted time may come while ingestion
so is sorted index recommeded in such a scenario?
for time column
x
pinot will sort data when it’s persisting the data to disk
but you can only pick one sorted index
s
ok then time column i will select as sorted indexing column
then range indexing wont be needed on time column .. right
also sorted indexing will internally also be doing inverted indexing
also one last doubt .. inverted indexing performance will be same for string type columns and long/int columns or int/long columns are going to give better performance even after indexing applied .. my use case is I have some tenantId or other ID columns as string but they are digits only .. so is it necessary to keep them as long type in schema and then only apply inverted indexing or I can even use string data type with inverted indexing and it wont have much impact on query latency while filtering or group by ..
x
you can use string, but you will pay more storage cost
s
latency wise not much impact?
x
if you can use long type to represent id, then suggest to use long
s
ok
x
latency wise, using long will definitely help
✔️ 1
s
cool
s
@Sadim Nadeem please use the controller endpoint to recommend indexes. As for inverted indexes, the only time they are on heap is when a segment is consming state. Once segments are completed (or they are offline segments) there is nothing stored in heap
✔️ 1
s
Ok @Subbu Subramaniam got it
HI @Subbu Subramaniam @Xiang Fu @Jackie @Mayank: will inverted indexing also help for IN clause query or only for equals?
x
Yes it will
s
thanks @Xiang Fu..
also is it possible to remove indexing from any partiular column in the table and reload all segments like suppose in future i dont want lucene fst indexing on one string type column say col1.. so can i edit the table and remove that indexing and reload all segments and it wont cause much trouble?
also if the table schema is changed by adding a column .. then what impact it will have on existing indexes and can i add indexing on the newly added column @Xiang Fu
x
Indexing deletion feature is under development
Right now, delete indexing won't delete physical index on disk
s
but wont cause any trouble on the table
x
Adding new column has no impact on existing index
No trouble, just more disk usage
1
s
ok and can we apply index on the newly added column in the exisitng table by editing the schema and then reload all segments?
I mean first step will be editing schema and adding a column > reload all segments > edit table > add inverted indexing on newly added column > again reload all segments => will it cause any issue?
x
once you add the new column, then you can add index and reload segments
your steps will work
s
thanks