Lakshmanan Velusamy
04/05/2021, 11:01 PMLakshmanan Velusamy
04/05/2021, 11:09 PMSELECT
created_at_1_hour_seconds as time_col,
SUM(id) as total
FROM table
WHERE
(created_at_seconds BETWEEN 1615924566 AND 1617134166) AND
((field1 <> 'null') AND
(entity_id NOT IN ('uuid1', 'uuid2', 'uuid3'))
GROUP BY time_col
ORDER BY time_col ASC
table/index config:
indexSpec:
starTreeIndexConfigs:
- dimensionsSplitOrder:
- created_at_1_hour_seconds
- created_at_seconds
- field1
- entity_id
functionColumnPairs:
- function: SUM
column: id
skipStarNodeCreationForDimensions:
- created_at_seconds
bloomFilterColumns:
- field1
- entity_id
Jackie
04/05/2021, 11:11 PMLakshmanan Velusamy
04/05/2021, 11:11 PMJackie
04/05/2021, 11:12 PMJackie
04/05/2021, 11:12 PMcreated_at_seconds
should be second granularity right?Lakshmanan Velusamy
04/05/2021, 11:13 PMJackie
04/05/2021, 11:13 PMentity_id
?Jackie
04/05/2021, 11:13 PMLakshmanan Velusamy
04/05/2021, 11:13 PMLakshmanan Velusamy
04/05/2021, 11:15 PMJackie
04/05/2021, 11:16 PMentity_id
and created_at
would have cardinality of 10s thousands per segmentJackie
04/05/2021, 11:20 PMcreated_at_seconds
in the split order in this case. Segments that are fully covered in the time range will use the star-tree index. Segments that are partially covered will fall back to the non-aggregated records.Jackie
04/05/2021, 11:22 PMcreated_at_1_hour_seconds, field1, entity_id
Lakshmanan Velusamy
04/05/2021, 11:25 PMskipStarNodeCreationForDimensions:
- created_at_seconds
Lakshmanan Velusamy
04/05/2021, 11:26 PMJackie
04/05/2021, 11:26 PMLakshmanan Velusamy
04/05/2021, 11:26 PMSELECT
dimension_uuid as dimension,
AVG(total) AS avg_total,
SUM(total)/DistinctCount(entity_id) AS total_per_entity,
COUNT(order_id) AS order_count,
SUM(total) AS total_amount,
COUNT(order_id)/DistinctCount(entity_id) AS orders_per_entity,
DISTINCTCOUNT(entity_id) AS entity_count
FROM table
WHERE
(created_at_seconds BETWEEN 1617049575 AND 1617654375) AND
(field1 <> 'null') AND
(entity_id NOT IN ('uuid1', 'uuid2', 'uuid3'))
GROUP BY dimension
ORDER BY order_count DESC
LIMIT 50
Lakshmanan Velusamy
04/05/2021, 11:28 PMJackie
04/05/2021, 11:30 PMJackie
04/05/2021, 11:31 PMdistinctcount
instead of distinctcounthll
), then it cannot be supported by star-tree due to the risk of storage explosionJackie
04/05/2021, 11:33 PMdimension_uuid
and entity_id
into the split list. The performance of star-tree comes from the pre-aggregation of the records, and I'm not sure if we can get much pre-aggregation with these 2 high cardinality dimensionsLakshmanan Velusamy
04/05/2021, 11:36 PMLakshmanan Velusamy
04/05/2021, 11:37 PMJackie
04/05/2021, 11:41 PMLakshmanan Velusamy
04/05/2021, 11:43 PMnumEntriesScannedPostFilter
in the response stats to see the impact of star tree filter on aggregations) for the same query with and without index, along with latency.Lakshmanan Velusamy
04/05/2021, 11:50 PMJackie
04/06/2021, 12:12 AMLakshmanan Velusamy
04/06/2021, 12:16 AMJackie
04/06/2021, 12:23 AMJackie
04/06/2021, 12:25 AMDataSketches
seems promising: https://datasketches.apache.org/docs/HLL/HLL.htmlJackie
04/06/2021, 12:25 AMLakshmanan Velusamy
04/06/2021, 2:43 AMJackie
04/06/2021, 2:44 AMLakshmanan Velusamy
04/06/2021, 2:44 AMJackie
04/06/2021, 2:46 AMJackie
04/06/2021, 2:48 AMlog2m: 8
Lakshmanan Velusamy
04/06/2021, 2:59 AMJackie
04/06/2021, 6:31 AMJackie
04/06/2021, 6:32 AMdistinctcount
and distinctcounthll
resultsLakshmanan Velusamy
04/07/2021, 6:39 AMJackie
04/07/2021, 7:21 AMJackie
04/07/2021, 7:22 AMLakshmanan Velusamy
04/07/2021, 8:23 AM