Hi Community, I am fairly new to pinot. Does the s...
# troubleshooting
l
Hi Community, I am fairly new to pinot. Does the star tree index help when we have aggregate queries with time range (and exclusion) filters ?
Here is the query:
Copy code
SELECT 
  created_at_1_hour_seconds as time_col, 
  SUM(id) as total

FROM table 

WHERE 
  (created_at_seconds BETWEEN 1615924566 AND 1617134166) AND 
  ((field1 <> 'null') AND 
  (entity_id NOT IN ('uuid1', 'uuid2', 'uuid3'))

GROUP BY time_col 
ORDER BY time_col ASC
table/index config:
Copy code
indexSpec:
  starTreeIndexConfigs:
    - dimensionsSplitOrder:
      - created_at_1_hour_seconds
      - created_at_seconds
      - field1
      - entity_id
      functionColumnPairs:
        - function: SUM
          column: id
      skipStarNodeCreationForDimensions:
        - created_at_seconds
  bloomFilterColumns:
    - field1
    - entity_id
j
It helps if the cardinality of the columns are relatively low
l
Not sure if created_at_seconds should be on the star tree index dimension split order as the cardinality is very high (timestamps are millisecs granularity).
j
In order to use star-tree to solve queries with filter on it, it needs to be included in the split order
created_at_seconds
should be second granularity right?
l
my bad, yeah its seconds.
j
Cardinality wise it should be fine. How about
entity_id
?
I feel it's cardinality is going to be high
l
should created_at_seconds be on the dimension order given the high Cardinality ?
entity_id has high cardinality, but total_records (in millions) >> total_entities (in 10s of thousands).
j
Ok, so both
entity_id
and
created_at
would have cardinality of 10s thousands per segment
I would recommend not including
created_at_seconds
in the split order in this case. Segments that are fully covered in the time range will use the star-tree index. Segments that are partially covered will fall back to the non-aggregated records.
👍 1
Then splitOrder will be
created_at_1_hour_seconds, field1, entity_id
l
Got it, we have this in the star tree index config as well:
Copy code
skipStarNodeCreationForDimensions:
        - created_at_seconds
thanks for the reply @Jackie!
j
Yeah, you may remove it from the skip list as well
👍 1
l
We got an another query with more complex aggregations:
Copy code
SELECT 
  dimension_uuid as dimension,
  AVG(total) AS avg_total,
  SUM(total)/DistinctCount(entity_id) AS total_per_entity,
  COUNT(order_id) AS order_count,
  SUM(total) AS total_amount,
  COUNT(order_id)/DistinctCount(entity_id) AS orders_per_entity,
  DISTINCTCOUNT(entity_id) AS entity_count 

FROM table

WHERE 
  (created_at_seconds BETWEEN 1617049575 AND 1617654375) AND 
  (field1 <> 'null') AND 
  (entity_id NOT IN ('uuid1', 'uuid2', 'uuid3'))

GROUP BY dimension 
ORDER BY order_count DESC 
LIMIT 50
this one has the same set of filters, but does a bunch of aggregations, including DISTINCTCOUNT. Is there a chance to improve performance using star tree index at all for the aggregations ?
j
Here you can find all the supported functions: https://docs.pinot.apache.org/basics/indexing/star-tree-index
If you need accurate distinct (
distinctcount
instead of
distinctcounthll
), then it cannot be supported by star-tree due to the risk of storage explosion
For the second query, you need to put both
dimension_uuid
and
entity_id
into the split list. The performance of star-tree comes from the pre-aggregation of the records, and I'm not sure if we can get much pre-aggregation with these 2 high cardinality dimensions
l
Was skeptical about DISTINCTCOUNT as it mentioned that is not supported due to storage explosion problem, will look into the possibility of using distinctcounthll and also measure the performance with and without index to see if there is a difference.
Any tool/command to understand the effect of the index when processing the query ?
j
There is no in-built tool for that. For experiment, you can set up 2 tables with the same data, one with index and one without, and query them separately to compare the throughput and latency
l
Got it, thats the pretty much the setup we got, along with checking the query stats (especially
numEntriesScannedPostFilter
in the response stats to see the impact of star tree filter on aggregations) for the same query with and without index, along with latency.
Is there an accuracy (standard error) expectation on the distinctcounthll algorithm used in pinot ?
j
We expect ~2% standard error for HyperLogLog. Reference: https://en.wikipedia.org/wiki/HyperLogLog
l
is it 2% for low value metrics as well ? Any plans to implement something like HLL+ ?
j
Based on my understanding, it will be quite accurate with few values
The HLL Sketch from
DataSketches
seems promising: https://datasketches.apache.org/docs/HLL/HLL.html
We don't have it supported yet. It should not be hard to add. Contributions are very welcome.
👍 1
l
Thank you so much for the responses @Jackie !
j
You're welcome 😉
l
Does pinot use HLL from datasketches that you pointed above ?
In star-tree, we use the default
log2m: 8
l
got it, which means per this formula, if we plug in log2m as 8, we get 1-1.04/sqrt(2^8) => 93.5% is the accuracy ?
j
Hmm, I think you are correct
You can easily test the error rate by comparing the
distinctcount
and
distinctcounthll
results
👍 2
🙏 1
l
Hi @Jackie, We have 3 servers in the tenant with replication factor of 3 (pretty much all the servers have the replica of all the segments). Like we saw above, all of our queries are filtered by timestamp. Is it possible to limit the no of segments queried by servers using segmentPartitionConfig ? Was wondering if the segment pruning for timeColumn filter works automatically or if we need to configure something.
j
It works automatically based on the min/max value of the segment
👍 1
No need to explicitly configure
l
Sounds good, thanks !