Hey good morning I read some articles about Pinot and feel P Apache Pinot #pinot-perf-tuning

Hey good morning. I read some articles about Pinot...

Leon Liu

06/30/2021, 12:43 PM

Hey good morning. I read some articles about Pinot, and feel Pinot can be a great tool for our real time analytics platform. we currently use snowflake and redshift. I tried it with a simple usecase (63 million records with percentileest, avg aggration) on a single ec2 instance and the performance is amazing. I want to pursue further and have a few questions related with building star tree index for the aggregators. mainly we want to make sure building the star tree indexes takes much shorter than the full cubing. hope you can help me out: 1. for our percentile aggregation, we only care the values for 10, 25, 50, 75 and 90 percent. is there any way to do the aggregation only for those percentiles? 2. How do i know if a star tree index is built? from the UI “Reload Status” screen, I don’t see anything related with the star tree index 3. currently we are doing very intensive monthly cubing to support realtime analytics (percentile on 12 columns, avg on 12 columns, approx_cont_distinct on 5 columns). at the end of each month, we are batch feeding about 70 million records. is it possible to build the star tree index in a couple of hours? if so what are the recommended ways to speed up the index building process? some context for our table: 1. 40 dimension columns, 1 time column and 15 metric column 2. we have monthly feed about 70 million records 3. we need monthly, quarterly and yearly analytics Thanks in advance

Mayank

06/30/2021, 2:04 PM

Hello:

Copy code

1. The TDigest based percentile size remains same regardless of what percentiles you want to query (unless there's a hidden feature that I am unaware of).
2. If you have access to the segment dir on server, you can check the segment folder, there would be a startree index. But if you file an issue, we can expose it in some fashion.
3. Build time depends on data size and the configuration you specified, but could be possible.

Mayank

06/30/2021, 2:04 PM

What's your latency requirement? And have you tried without startree index if the requirement can be met?

Ken Krugler

06/30/2021, 2:06 PM

For Q3 - You can use the Spark or Hadoop (MapReduce) job runner to build segments in parallel. If you configure your table to have indexes generated when the segment is being built, you avoid some potential CPU/memory bottlenecks when pushing these segments to the cluster. This all works well if your Pinot cluster has access to HDFS (or some other shared filesystem), which you can configure as your Pinot cluster’s deep store.

Mayank

06/30/2021, 2:16 PM

+1 to what @User said ^^

Leon Liu

06/30/2021, 2:17 PM

sub-second is our requirement. we also are client facing with high concurrency

Mayank

06/30/2021, 2:26 PM

what's the read qps?

Leon Liu

06/30/2021, 2:33 PM

200 qps is good enough for us

Leon Liu

06/30/2021, 2:34 PM

if we use star tree index, the query performance will be better that we are looking for, the only concern is how long it needs to build the index. right now it takes 12 hours in AWS to do the full cubing with spark.

Leon Liu

06/30/2021, 2:36 PM

we are using a lot of percentile aggregation in our query, i tried it without using any index, one query returns in about 6s. I’m relatively new here, not sure if there is any other way to make it much faster and avoid heavy indexing.

Leon Liu

06/30/2021, 2:45 PM

@User our data is in AWS s3. For the Q3 suggestion, is there any example i can take a look for reference? or some more detailed documentation will greatly help

Mayank

06/30/2021, 3:03 PM

How many docs did the 6s query take? If too many (>100k), then star tree is the right index

Leon Liu

06/30/2021, 3:26 PM

for the 6s query, it scans all of the docs i loaded (63 million)

Leon Liu

06/30/2021, 3:27 PM

if i load all of the data (36 months), the total docs will be above 1 billion

3 Views

Open in Slack

Previous Next