James Mnatzaganian02/02/2022, 9:00 PM
• The secondary query pattern is the same as the primary, but inexact matching which could potentially included regex • Data will be treated as immutable (exceptions for data deletions for GDPR) • Data will be bulk loaded and already pre-aggregated Questions: 1. Can Pinot efficiently handle data with cardinalities of this magnitude? 2. Are sub-second response times feasible for exact matching? 3. Can Pinot efficiently handle aggregates across multiple fields with high cardinality or is it better to split them into smaller subsets? 4. Any gut feeling estimate as to how large of a cluster I would need? Even if Pinot can work, the next question is how much will it cost 🙂
where foo = 'something'
1. Yes high cardinality is not an issue. 2. Yes, with indexing, you can do a lot better than sub second for matching equality predicates. 3. Is it just aggregates or group-by too? Both are fine, will need to know more concrete info to suggest optimizations. 4. For cluster size, will need more info like data size, read/write qps, latency requirements query selectivity on average et.
James Mnatzaganian02/02/2022, 9:31 PM
B: Count distinct
-- lowest-level granularity SELECT a , b , c , SUM(num_events) AS num_events FROM ... WHERE event_date BETWEEN ... AND ... AND a = ... AND b = ... AND c = ... GROUP BY 1, 2, 3
C: Complex rollup, producing summarized view with multiple levels of aggregations
SELECT a , b , COUNT(DISTINCT c) AS num_uq_c , SUM(num_events) AS num_events FROM ... WHERE event_date BETWEEN ... AND ... AND a = ... AND b = ... GROUP BY 1, 2
(4) The expectation is to load data once / hr. So let's say 100M records / hr (in reality there'd be multiple loads, but I can do math from here). Data size is small, let's say 1KiB / record, resulting in <100 GiB / hr. Writes will only be on the hourly bulk load cadence. Reads are low volume - a few thousand / day, with most (let's say ~90%) doing exact searches. edit - forgot to add the
WITH data AS ( -- Reduce dataset SELECT a , b , c , SUM(num_events) AS num_events FROM ... WHERE event_date BETWEEN ... AND ... AND c = ... GROUP BY 1, 2, 3 ), low_level AS ( -- Find specific low-level result SELECT a , b , c , num_events FROM data WHERE a = ... AND b = ... AND c = ... LIMIT 1 ), mid_level AS ( -- Do the next-level rollup SELECT a , c , COUNT(1) AS num_uq_b , SUM(num_events) AS num_events_r1 FROM data WHERE a = ... GROUP BY 1, 2 ), top_level AS ( -- Top-level rollup SELECT c , COUNT(DISTINCT a) AS num_uq_a , SUM(num_events) AS num_tot_events , COUNT(1) AS num_uq_ab FROM data GROUP BY 1 ) SELECT ll.* , ml.* , tl.* FROM low_level ll JOIN mid_level ml ON ll.a = ml.a AND ll.c = ml.c JOIN top_level tl ON ml.c = tl.c
in some of the queries 🤦♂️
James Mnatzaganian02/15/2022, 10:33 PM
fails to run. It seems that a single core gets pegged and a 427 error code is returned. A more representative query is much better. It takes ~3s for a
SELECT DISTINCTCOUNTHLL(...) FROM ...
and ~300ms for a
clause matches on ~3M documents (again exact string matching on a column with a bloom index). The box seems to be asleep. I'm not sure why the first query fails or if there are settings to tweak it to make things faster, but overall the results are promising. It looks like the bounding isn't by cardinality but by documents matching the predicate.
James Mnatzaganian02/16/2022, 2:30 PM
James Mnatzaganian02/16/2022, 2:57 PM
files are the segments, those are ~1.3 GiB.
on and this was really hurting the performance (even with the field being part of the index). Once I removed that, the problem went away. I'm assuming that if I made an impression with for
it would similarly be very fast. Overall, I was able to validate that Pinot does solve the query performance problem (to be frank, better than any technology that I have seen thus far), provided that your access pattern is rigid and you precompute all the indexes in advance. I still had issues with any query that didn't hit the index not returning and seemingly not utilizing the cluster well, but for this specific use case with established query patterns Pinot could be a good fit. Thanks for the help!