Does pinot stores min max values for dimensions in...
# general
o
Does pinot stores min max values for dimensions in segment metadata? Or does it just store min max values for date time fields? And can we create inverted or any other indices on date time column?
k
Its store min max for all columns.
๐Ÿ‘ 1
o
So, can we create indices on time column?
m
You can create Inv index for any column including time. However, if time column is sorted naturally, you donโ€™t need to.
What is your time unit
o
time column is not sorted. time unit is day in epoch timestamp format
m
If day, then you will have one inv index value point to all days worth of records, which might be same as metadata based pruning.
o
perfect actually, i can store this column as dimension as "2021-01-03 000000" format for fast filtering grouping operation. but i don't know if there is any difference to define it as dimension or time column
m
Mind sharing your query patterns?
General rule of thumb is to partition on primary key (if there is one). If not then sort on dimension that appears on most queries. and then add secondary in index as needed for columns that have higher selectivity first
o
i have two time column; one is eventDate and the other is represents orderDate. All of my queries based on sellerId and orderDate.
Copy code
select abc, xyz, sum(x).. 
from table 
group by abc, xyz
where sellerId = 123 and orderDate > X and orderDate < y
the other query example can be like that;
Copy code
select orderDate, abc, xyz, sum(x).. 
from table 
group by orderDate, abc, xyz
where sellerId = 123 and orderDate > X and orderDate < y
I think to partition data by sellerid(for broker side pruning), create inverted or sorted index on sellerId(i'm not sure which one is the better now), and create inverted index on other required columns
m
Sort and partition on seller and inv index on other
Sorted is always better than inv as it gives better data locality
o
ohh, thank you much!
๐Ÿ‘ 1
I've tried an example. I created realtime table, and set
orderDate
as time field. Totally 4 segments created. When i send below query;
select * from orders
query stats:
Copy code
numSegmentsQueried: 4
numSegmentsProcessed: 4
numSegmentsMatched: 4
And, when i send below query;
select * from orders where orderDate > 1619458466000
query stats:
Copy code
numSegmentsQueried: 4
numSegmentsProcessed: 1
numSegmentsMatched: 1
Seems pinot prune segments based on min/max orderDate of segment metadata, right? But i can't see min/max orderDate in realtime segment metadata. How can i see it?
m
It should be there in metadata.properties
o
is
/segments/{tableName}/{segmentName}/metadata
looking to metadata.properties?
m
Yes
For segments that are committed to disk
For ones that are still open it might be looking at segment zk metadata.
The fact that num segments matched is 1 you can be assured they pruning worked
o
yes segment pruning based on min/max values of a time column seems works. But i couldn't sure, because when i try to get metadata of completed segment;
Copy code
{
  "segment.realtime.endOffset": "3",
  "segment.time.unit": "MILLISECONDS",
  "segment.start.time": "1616250195000",
  "segment.flush.threshold.size": "50000",
  "segment.realtime.startOffset": "0",
  "segment.end.time": "1616250220000",
  "segment.total.docs": "3",
  "segment.table.name": "orders_REALTIME",
  "segment.realtime.numReplicas": "1",
  "segment.creation.time": "1616261054496",
  "segment.realtime.download.url": "<http://10.212.234.26:9000/segments/orders/orders__0__0__20210320T1724Z>",
  "segment.name": "orders__0__0__20210320T1724Z",
  "segment.index.version": "v3",
  "custom.map": null,
  "segment.flush.threshold.time": null,
  "segment.type": "REALTIME",
  "segment.crc": "1271635767",
  "segment.realtime.status": "DONE"
}
I can't see min/max orderDate values. Is that segment can be still located in memory? But it should be closed about 30 minutes ago
Also, I set `realtime.segment.flush.threshold.time`to 2M, totally 4 segments created. The creation time of the last segment is 20:30 but segment.realtime.status of it is still
"IN_PROGRESS"
m
Do you see any functional issues or you are just trying to verify?
o
When i see response stats of query, it seems works. I'm just trying to verify
Copy code
numSegmentsQueried: 5
numSegmentsProcessed: 2
numSegmentsMatched: 1
what is the difference between numSegmentsProcessed and numSegmentsMatched? I've looked document, but i did not understand it. Only one segment matched, but pinot processed 2 segments?
m
I think
numSegmentsProcessed
is the number of segments that were processed/looked-at. And
numSegmentsMatched
can be <=
numSegmentsProcessed
, because even if a segment is processed there may be 0 records matched in that segment.
Does that make sense?
o
yes make sense, thank you again!
๐Ÿ‘ 1