https://pinot.apache.org/ logo
#getting-started
Title
# getting-started
j

James Mnatzaganian

02/02/2022, 9:00 PM
👋 Hi all, I have a use case that I wanted to see if Pinot would potentially be a good fit for. Note that I haven't yet used Pinot, I'm merely testing the waters before I dive in deep 🙂 Context: • Very high cardinality: ◦ 100's of millions / hour ◦ Billions / day ◦ Expected to be 10s to 100s of billions / month ◦ The cardinality eventually becomes asymptotic ◦ These big numbers are the "simple" case and it's expected that a single field could be one or two orders of magnitude larger than this case • Low dimensionality: <10 fields • The primary query pattern is as follows: ◦ Perform aggregations (count, count distinct, sum, etc...) across one or more fields ◦ Exact matching filtering on aggregated fields, e.g.
where foo = 'something'
• The secondary query pattern is the same as the primary, but inexact matching which could potentially included regex • Data will be treated as immutable (exceptions for data deletions for GDPR) • Data will be bulk loaded and already pre-aggregated Questions: 1. Can Pinot efficiently handle data with cardinalities of this magnitude? 2. Are sub-second response times feasible for exact matching? 3. Can Pinot efficiently handle aggregates across multiple fields with high cardinality or is it better to split them into smaller subsets? 4. Any gut feeling estimate as to how large of a cluster I would need? Even if Pinot can work, the next question is how much will it cost 🙂
👀 1
My current approach is to create pre-computed facets for each required access pattern for the various time frames. I can compute this with Spark and Apache Hudi makes it possible to maintain the equivalent of a materialized view. Trino and Alluxio enable single digit second latencies. I'm interested in Pinot, since it would dramatically simplify things and be more flexible. My hope is that it can also address the sub-second aspect as well. Even if it's still expected to be single digit seconds, the flexibility of Pinot is appealing.
m

Mayank

02/02/2022, 9:12 PM
@User Thanks for posting these details. Pinot definitely is the right choice for your case:
Copy code
1. Yes high cardinality is not an issue.
2. Yes, with indexing, you can do a lot better than sub second for matching equality predicates.
3. Is it just aggregates or group-by too? Both are fine, will need to know more concrete info to suggest optimizations.
4. For cluster size, will need more info like data size, read/write qps, latency requirements query selectivity on average et.
💯 1
j

James Mnatzaganian

02/02/2022, 9:31 PM
Thanks! (3) both - here are some examples: A: Simple summation
Copy code
-- lowest-level granularity
SELECT
   a
   , b
   , c
   , SUM(num_events) AS num_events
FROM ...
WHERE event_date BETWEEN ... AND ...
   AND a = ... AND b = ... AND c = ...
GROUP BY 1, 2, 3
B: Count distinct
Copy code
SELECT
  a
  , b
  , COUNT(DISTINCT c) AS num_uq_c
  , SUM(num_events) AS num_events
FROM ...
WHERE event_date BETWEEN ... AND ...
   AND a = ... AND b = ... 
GROUP BY 1, 2
C: Complex rollup, producing summarized view with multiple levels of aggregations
Copy code
WITH data AS (
    -- Reduce dataset
    SELECT
        a
        , b
        , c
        , SUM(num_events) AS num_events
    FROM ...
    WHERE event_date BETWEEN ... AND ...
        AND c = ...
    GROUP BY 1, 2, 3
), low_level AS (
    -- Find specific low-level result
    SELECT
        a
        , b
        , c
        , num_events
    FROM data
    WHERE a = ... AND b = ... AND c = ...
    LIMIT 1
), mid_level AS (
    -- Do the next-level rollup
    SELECT
        a
        , c
        , COUNT(1) AS num_uq_b
        , SUM(num_events) AS num_events_r1
    FROM data
    WHERE a = ...
    GROUP BY 1, 2
), top_level AS (
    -- Top-level rollup
    SELECT
        c
        , COUNT(DISTINCT a) AS num_uq_a
        , SUM(num_events) AS num_tot_events
        , COUNT(1) AS num_uq_ab
    FROM data
    GROUP BY 1
)
SELECT
    ll.*
    , ml.*
    , tl.*
FROM low_level ll
JOIN mid_level ml ON ll.a = ml.a AND ll.c = ml.c
JOIN top_level tl ON ml.c = tl.c
(4) The expectation is to load data once / hr. So let's say 100M records / hr (in reality there'd be multiple loads, but I can do math from here). Data size is small, let's say 1KiB / record, resulting in <100 GiB / hr. Writes will only be on the hourly bulk load cadence. Reads are low volume - a few thousand / day, with most (let's say ~90%) doing exact searches. edit - forgot to add the
group by
in some of the queries 🤦‍♂️
m

Mayank

02/02/2022, 9:35 PM
Pinot doens’t do complex joins. For that one option is to use Presto or Trino connectors for Pinot.
👍 1
Looking at other queries
Following up on DM. Will update here.
👍 1
j

James Mnatzaganian

02/15/2022, 10:33 PM
I wanted to follow-up on this and post some early results. I loaded in a day's worth of data, which was about 5.5B documents (this is already pre-aggregated data to the hour-level). I'm starting simple and using bloom indexes, since I'm primarily doing exact string matching. For hardware, I'm using a single m5.16xl (64 cores, 256GiB RAM) with a 2TiB attached EBS volume. I'm running one server, broker, and controller. I have 123 segments, each having around 50M records. The data in S3 is ~42GB and on the box is ~165GB. The results are a bit mixed: A simple
SELECT DISTINCTCOUNTHLL(...) FROM ...
fails to run. It seems that a single core gets pegged and a 427 error code is returned. A more representative query is much better. It takes ~3s for a
DISTINCTCOUNT
and ~300ms for a
DISTINCTCOUNTHLL
where the
where
clause matches on ~3M documents (again exact string matching on a column with a bloom index). The box seems to be asleep. I'm not sure why the first query fails or if there are settings to tweak it to make things faster, but overall the results are promising. It looks like the bounding isn't by cardinality but by documents matching the predicate.
m

Mayank

02/16/2022, 12:27 AM
@User for exact string match, you want to use inverted index, not bloom index.
Could you share the query (specifically the where clause), and the broker response metadata? Looking at the metadata, we can suggest some optimizations
And yes, the bounding is essentially based on two components - a) Number of entries to be scanned to match the predicates, b) Total number of docs matched, and the aggr/grouping on it. a) can be reduced by using the right index. b) can be reduced by using advanced indexing such as startree index, especially queries are selecting a really large number of rows.
j

James Mnatzaganian

02/16/2022, 2:30 PM
Thanks @User, I'll try inverted and start tree and see how it goes. I can share the query, I'm not sure how to get the metadata (is that available in the query window within Pinot?). Right now I'm primarily interested in forecasting how Pinot will scale. Resource-wise, it seems to be doing great. What I'm trying to figure out is if there's a way to introduce more parallelism (I think server + broker is analogous to scatter + gather?). I just have one server, but my box is asleep. My understanding was that a single thread would process a segment and I assumed that in this scenario multiple servers vs one shouldn't matter, since the server could spin up as many threads as needed? An upper bound on matched predicates could be 10M / hour, so for one month (the needed time range for this use case), I'm looking at potentially billions of matches. Assuming I can control the degree of parallelism (e.g. one server per day or something), then I should be able to get near constant time performance. Does this make sense? It's possible I'm understanding Pinot incorrectly?
m

Mayank

02/16/2022, 2:39 PM
For query execution, the parallelism on server side is based on number of segments. Number of threads is fixed to 2x number of cores by default. When you say scale, do you want to optimize for latency or throughput (what’s the rough requirements for those)? Do really low selectivity, startree will definitely help
👍 1
j

James Mnatzaganian

02/16/2022, 2:57 PM
Latency is the primary driver. I can tolerate <5s (for the full month), but was hoping to make it faster for a smooth UX (<1s). Throughput will be low -- at most <100 simultaneous queries. So from what you said, it sounds like one server should be fine. I think the pain point here is that all segments have to be read. Looking at this doc, it seems like throughput can get addressed more easily. Aside from that, startree seems most promising. Partitioning could work, but then I'd only be able to optimize for some of the query patterns, so instead I'd probably opt to pre-aggregate further. Another thought I had was to reduce the number of records in a segment to push up the parallelism. (I'm looking at this). Assuming the
psf
files are the segments, those are ~1.3 GiB.
I was able to get this to work with a star tree index. It took a few iterations, but once I had it hitting the index properly, I was able to get 10ms response times. One of my issues was that I had a composite key, so I made a new item representing the composite and I was then able to get that into the star tree index. The other challenge was that it seems that every field and operation in the query needs to be in the index. I have a low cardinality field that I was doing a
DISTINCTCOUNT
on and this was really hurting the performance (even with the field being part of the index). Once I removed that, the problem went away. I'm assuming that if I made an impression with for
DISTINCTCOUNTHLL
it would similarly be very fast. Overall, I was able to validate that Pinot does solve the query performance problem (to be frank, better than any technology that I have seen thus far), provided that your access pattern is rigid and you precompute all the indexes in advance. I still had issues with any query that didn't hit the index not returning and seemingly not utilizing the cluster well, but for this specific use case with established query patterns Pinot could be a good fit. Thanks for the help!