I got some data ingested and am using a star tree ...
# general
a
I got some data ingested and am using a star tree index and I'm running a query like
select foo, percentiletdigest(bar, 0.5) from mytable group by foo
. I've got
foo
in my
dimensionsSplitOrder
and I've got
PERCENTILE_TDIGEST__bar
as well as
AVG__bar
in my
functionColumnPairs
. My query takes about 700 ms but if I switch it to
avg(bar)
it takes 15 ms. Is it expected that the t-digest would be that much slower? Anything I can do to speed it up?
x
@User does pinot support percentile tdigest in startree?
in response stats, do you see same number of docs scanned for both queries?
j
Yes, startree supports TDigest. See https://docs.pinot.apache.org/basics/indexing/star-tree-index for more details
Is the query constantly taking 700ms?
a
For avg and percentiletdigest, numDocsScanned is 969792.
Yeah, consistently in that range. It just took 1057 ms when I ran it
m
Yeah tdigest aggregation over 1M docs might take that long
a
What does
numDocsScanned
mean in the context of a star tree index?
m
Do you have query latency with just tdigest?
a
What do you mean?
m
Query with percentile tdigest but without avg
a
Oh sorry, that's what I meant
m
Oh ok
a
select foo, percentiletdigest(bar, 0.5) from mytable group by foo
is slow,
select foo, avg(bar) from mytable group by foo
is fast
m
Docs scanned should mean the same
Split order helps with filtering
@User does it help with group by or just filtering?
a
If I have 969792 numDocsScanned and 8950109972 totalDocs, what does numDocsScanned mean? Is that the number of star tree nodes or something?
j
@User Most time just filtering
@User Do you need 0.5 percentile or 50 percentile? The aggregation cost of
percentiletdigest
is expected to be much higher than
avg
a
Eh I don't actually care about which percentile just yet -- just the performance
Is there anything I can do to speed it up? A lot of my users here prefer quantiles, I think performance there will really matter
The avg performance is... awesome
m
Your query does not have filters
Will it be the case always?
a
Could be
Right now I only have a small subset of the data, but yeah people might be filtering by date at the very least
Do you expect filters to help a lot?
m
It will cut down numDocsScanned right
a
Right
I'd expect people to be scanning a similar number of documents if not an order of magnitude more
m
@User Any ideas on using pre-aggergates within star tree here?
Also, @User In production you'll have the same cluster size as of right now? Because if you'll have more servers, you'll get better perf
j
If
foo
is the first dimension in the split order, then it will always use the pre-aggregate doc
@User What's the cardinality of
foo
? How many segments do you have right now?
a
Foo's cardinality is about 6
462 segments
5 servers
Foo is third in dimensionsSplitOrder, there are 7 fields total in there
j
In that case, in order to further optimize the performance, you may reduce the
maxLeafRecords
threshold. While this will increase the size of the star-tree
m
Just to callout, a lot of the latency inherently comes from the TDigest library.
It is pretty good in providing accuracy in limited storage, but there's a latency cost.
a
Is q-digest any better? My understanding was that t-digest is faster and more accurate
Do you have any approximate guidelines around how much faster performance will be and how much more space the star tree will take up as maxLeafRecords is decreased?
m
Yes, t-digest is definitely better than others. But it may not give you 10ms latency if you are aggregating 1M records.
a
How can I get to, say, 200ms?
m
Tuning star tree (Jackie?), index size, server cores/jvm/params, etc
j
For star-tree, you can trade performance with extra space by reducing the
maxLeafRecords
Reducing that to 1 will give you fully pre-cubed data