James Mnatzaganian
02/02/2022, 9:00 PMwhere foo = 'something'
• The secondary query pattern is the same as the primary, but inexact matching which could potentially included regex
• Data will be treated as immutable (exceptions for data deletions for GDPR)
• Data will be bulk loaded and already pre-aggregated
Questions:
1. Can Pinot efficiently handle data with cardinalities of this magnitude?
2. Are sub-second response times feasible for exact matching?
3. Can Pinot efficiently handle aggregates across multiple fields with high cardinality or is it better to split them into smaller subsets?
4. Any gut feeling estimate as to how large of a cluster I would need? Even if Pinot can work, the next question is how much will it cost 🙂Mayank
1. Yes high cardinality is not an issue.
2. Yes, with indexing, you can do a lot better than sub second for matching equality predicates.
3. Is it just aggregates or group-by too? Both are fine, will need to know more concrete info to suggest optimizations.
4. For cluster size, will need more info like data size, read/write qps, latency requirements query selectivity on average et.
James Mnatzaganian
02/02/2022, 9:31 PM-- lowest-level granularity
SELECT
a
, b
, c
, SUM(num_events) AS num_events
FROM ...
WHERE event_date BETWEEN ... AND ...
AND a = ... AND b = ... AND c = ...
GROUP BY 1, 2, 3
B: Count distinct
SELECT
a
, b
, COUNT(DISTINCT c) AS num_uq_c
, SUM(num_events) AS num_events
FROM ...
WHERE event_date BETWEEN ... AND ...
AND a = ... AND b = ...
GROUP BY 1, 2
C: Complex rollup, producing summarized view with multiple levels of aggregations
WITH data AS (
-- Reduce dataset
SELECT
a
, b
, c
, SUM(num_events) AS num_events
FROM ...
WHERE event_date BETWEEN ... AND ...
AND c = ...
GROUP BY 1, 2, 3
), low_level AS (
-- Find specific low-level result
SELECT
a
, b
, c
, num_events
FROM data
WHERE a = ... AND b = ... AND c = ...
LIMIT 1
), mid_level AS (
-- Do the next-level rollup
SELECT
a
, c
, COUNT(1) AS num_uq_b
, SUM(num_events) AS num_events_r1
FROM data
WHERE a = ...
GROUP BY 1, 2
), top_level AS (
-- Top-level rollup
SELECT
c
, COUNT(DISTINCT a) AS num_uq_a
, SUM(num_events) AS num_tot_events
, COUNT(1) AS num_uq_ab
FROM data
GROUP BY 1
)
SELECT
ll.*
, ml.*
, tl.*
FROM low_level ll
JOIN mid_level ml ON ll.a = ml.a AND ll.c = ml.c
JOIN top_level tl ON ml.c = tl.c
(4) The expectation is to load data once / hr. So let's say 100M records / hr (in reality there'd be multiple loads, but I can do math from here). Data size is small, let's say 1KiB / record, resulting in <100 GiB / hr. Writes will only be on the hourly bulk load cadence. Reads are low volume - a few thousand / day, with most (let's say ~90%) doing exact searches.
edit - forgot to add the group by
in some of the queries 🤦♂️Mayank
James Mnatzaganian
02/15/2022, 10:33 PMSELECT DISTINCTCOUNTHLL(...) FROM ...
fails to run. It seems that a single core gets pegged and a 427 error code is returned.
A more representative query is much better. It takes ~3s for a DISTINCTCOUNT
and ~300ms for a DISTINCTCOUNTHLL
where the where
clause matches on ~3M documents (again exact string matching on a column with a bloom index). The box seems to be asleep.
I'm not sure why the first query fails or if there are settings to tweak it to make things faster, but overall the results are promising. It looks like the bounding isn't by cardinality but by documents matching the predicate.Mayank
James Mnatzaganian
02/16/2022, 2:30 PMMayank
James Mnatzaganian
02/16/2022, 2:57 PMpsf
files are the segments, those are ~1.3 GiB.DISTINCTCOUNT
on and this was really hurting the performance (even with the field being part of the index). Once I removed that, the problem went away. I'm assuming that if I made an impression with for DISTINCTCOUNTHLL
it would similarly be very fast.
Overall, I was able to validate that Pinot does solve the query performance problem (to be frank, better than any technology that I have seen thus far), provided that your access pattern is rigid and you precompute all the indexes in advance. I still had issues with any query that didn't hit the index not returning and seemingly not utilizing the cluster well, but for this specific use case with established query patterns Pinot could be a good fit.
Thanks for the help!