https://pinot.apache.org/ logo
#general
Title
# general
a

Aaron Wishnick

05/14/2021, 5:28 PM
I got some data ingested and am using a star tree index and I'm running a query like
select foo, percentiletdigest(bar, 0.5) from mytable group by foo
. I've got
foo
in my
dimensionsSplitOrder
and I've got
PERCENTILE_TDIGEST__bar
as well as
AVG__bar
in my
functionColumnPairs
. My query takes about 700 ms but if I switch it to
avg(bar)
it takes 15 ms. Is it expected that the t-digest would be that much slower? Anything I can do to speed it up?
x

Xiang Fu

05/14/2021, 5:31 PM
@User does pinot support percentile tdigest in startree?
in response stats, do you see same number of docs scanned for both queries?
j

Jackie

05/14/2021, 5:36 PM
Yes, startree supports TDigest. See https://docs.pinot.apache.org/basics/indexing/star-tree-index for more details
Is the query constantly taking 700ms?
a

Aaron Wishnick

05/14/2021, 5:42 PM
For avg and percentiletdigest, numDocsScanned is 969792.
Yeah, consistently in that range. It just took 1057 ms when I ran it
m

Mayank

05/14/2021, 7:35 PM
Yeah tdigest aggregation over 1M docs might take that long
a

Aaron Wishnick

05/14/2021, 7:36 PM
What does
numDocsScanned
mean in the context of a star tree index?
m

Mayank

05/14/2021, 7:36 PM
Do you have query latency with just tdigest?
a

Aaron Wishnick

05/14/2021, 7:36 PM
What do you mean?
m

Mayank

05/14/2021, 7:37 PM
Query with percentile tdigest but without avg
a

Aaron Wishnick

05/14/2021, 7:37 PM
Oh sorry, that's what I meant
m

Mayank

05/14/2021, 7:37 PM
Oh ok
a

Aaron Wishnick

05/14/2021, 7:37 PM
select foo, percentiletdigest(bar, 0.5) from mytable group by foo
is slow,
select foo, avg(bar) from mytable group by foo
is fast
m

Mayank

05/14/2021, 7:37 PM
Docs scanned should mean the same
Split order helps with filtering
@User does it help with group by or just filtering?
a

Aaron Wishnick

05/14/2021, 7:40 PM
If I have 969792 numDocsScanned and 8950109972 totalDocs, what does numDocsScanned mean? Is that the number of star tree nodes or something?
j

Jackie

05/14/2021, 7:45 PM
@User Most time just filtering
@User Do you need 0.5 percentile or 50 percentile? The aggregation cost of
percentiletdigest
is expected to be much higher than
avg
a

Aaron Wishnick

05/14/2021, 8:04 PM
Eh I don't actually care about which percentile just yet -- just the performance
Is there anything I can do to speed it up? A lot of my users here prefer quantiles, I think performance there will really matter
The avg performance is... awesome
m

Mayank

05/14/2021, 8:05 PM
Your query does not have filters
Will it be the case always?
a

Aaron Wishnick

05/14/2021, 8:06 PM
Could be
Right now I only have a small subset of the data, but yeah people might be filtering by date at the very least
Do you expect filters to help a lot?
m

Mayank

05/14/2021, 8:06 PM
It will cut down numDocsScanned right
a

Aaron Wishnick

05/14/2021, 8:07 PM
Right
I'd expect people to be scanning a similar number of documents if not an order of magnitude more
m

Mayank

05/14/2021, 8:08 PM
@User Any ideas on using pre-aggergates within star tree here?
Also, @User In production you'll have the same cluster size as of right now? Because if you'll have more servers, you'll get better perf
j

Jackie

05/14/2021, 8:11 PM
If
foo
is the first dimension in the split order, then it will always use the pre-aggregate doc
@User What's the cardinality of
foo
? How many segments do you have right now?
a

Aaron Wishnick

05/14/2021, 8:22 PM
Foo's cardinality is about 6
462 segments
5 servers
Foo is third in dimensionsSplitOrder, there are 7 fields total in there
j

Jackie

05/14/2021, 9:05 PM
In that case, in order to further optimize the performance, you may reduce the
maxLeafRecords
threshold. While this will increase the size of the star-tree
m

Mayank

05/14/2021, 9:06 PM
Just to callout, a lot of the latency inherently comes from the TDigest library.
It is pretty good in providing accuracy in limited storage, but there's a latency cost.
a

Aaron Wishnick

05/14/2021, 9:09 PM
Is q-digest any better? My understanding was that t-digest is faster and more accurate
Do you have any approximate guidelines around how much faster performance will be and how much more space the star tree will take up as maxLeafRecords is decreased?
m

Mayank

05/14/2021, 9:14 PM
Yes, t-digest is definitely better than others. But it may not give you 10ms latency if you are aggregating 1M records.
a

Aaron Wishnick

05/14/2021, 9:16 PM
How can I get to, say, 200ms?
m

Mayank

05/14/2021, 9:17 PM
Tuning star tree (Jackie?), index size, server cores/jvm/params, etc
j

Jackie

05/14/2021, 9:18 PM
For star-tree, you can trade performance with extra space by reducing the
maxLeafRecords
Reducing that to 1 will give you fully pre-cubed data