Hey folks, does anybody know how to forcefully pus...
# troubleshooting
d
Hey folks, does anybody know how to forcefully push down Pinot queries to the broker when using Trino? I really need a solution to be able to do multi-table queries and have been trying to use Trino for that, but I'm absolutely disappointed at how hard it's been to make this work - if I just do normal queries in Trino it tries to query ALL of the segments, completely bypassing any partitioning (including the time column), which is insane.
m
@User ^^
Also, could you elaborate a bit on what you are trying to query?
x
what’s your join query look like?
d
Let me paste the query here:
Copy code
select distinct weight from weights
  inner join brands_metrics on weights.pmxid = brands_metrics.pmxid 
  inner join filters on weights.pmxid = filters.pmxid
where
  brand_id = 1000226 and
  filters.name = 'gender' and
  contains(filters."values", 2) and
  weights.date > 20130501 and
  weights.date < 20130515 and
  brands_metrics.date > 20130501 and
  brands_metrics.date < 20130515 and
  filters.date > 20130501 and
  filters.date < 20130515
and here's some explanation about what's going on: • I only want data between 2013-05-01 and 2013-05-15 • I want to join data from
weights
,
brands_metrics
and
filters
where they match on the
caseid
column (which is common to all the 3 tables) • I expected Trino to do the querying in the Pinot Broker, but it seems like it first gets all of the segments existing in those tables and does the queries inside each of them, so for example doing each separate query takes only a few ms in Pinot itself, but running the full query in Trino takes 6s which is unacceptable
Looking at the Trino logs, I see that it tries to read all of the Pinot segments, no matter the time boundaries for it
m
Which part would you like to push down to Pinot?
x
hmm, I think in both trino/presto the filters will be pushed down
d
I'd like to push down everything except the joins, so that I can make use of Pinot's partitioning of data (per time and other configured columns I have)
x
although it queries all the segments, but the push down query won’t return if there is no match
true, pinot partitioning metadata hasn’t been utilized so far for presto/trino
d
@User I tried, yes, but Trino is not happy about my subquery statements and is unable to parse them 😞
Even if I run a simple query like:
Copy code
select weight, caseid as weights_caseid from weights where "date" > 20130501 and "date" < 20130515 limit 100
if I run it in Pinot, it runs in 5ms, and in Trino it's taking more than 1s. And then looking at the logs, I see this for example:
Copy code
trino-server-xavier      | 2022-04-22T21:44:38.522Z	INFO	20220422_214437_00113_jhvv6.1.0.0-105-102	io.trino.plugin.pinot.PinotSegmentPageSource	Query 'SELECT "caseid", "weight" FROM weights_OFFLINE  WHERE (("date" > '20130501' AND "date" < '20130515')) LIMIT 100' on host 'Optional[Server_172.30.0.5_8098]' for segment splits: [weights_OFFLINE_20131224_20131224_201]
meaning Trino is trying to query the
weights_OFFLINE_20131224_20131224_201
segment, which is totally out of the period I want to get data from. Pinot matches 5 segments but Trino tries to run it over all the 209 segments I have.
x
yes, that’s true, trino/presto is not that smart to prune at the segment level
The pruning logic needs to be added
1
how about query like
Copy code
SELECT * FROM pinot.default."select weight, caseid as weights_caseid from weights where \"date\" > 20130501 and \"date\" < 20130515 limit 100"
d
Got it. Thanks a lot, @User, I think this actually helps on my decision. I'll just drop Trino and try to work with the data myself doing the joins.
Let me try that
x
yeah, if the underlying queries can be executed from broker, then you can leverage trino for it
presto will query all segments
so if trino is not an option, you can opt out presto as well
d
That query doesn't work:
Copy code
trino.exceptions.TrinoUserError: TrinoUserError(type=USER_ERROR, name=SYNTAX_ERROR, message="line 1:94: mismatched input '" > 20130501 and "'. Expecting: '(', ',', 'CROSS', 'EXCEPT', 'FETCH', 'FULL', 'GROUP', 'HAVING', 'INNER', 'INTERSECT', 'JOIN', 'LEFT', 'LIMIT', 'MATCH_RECOGNIZE', 'NATURAL', 'OFFSET', 'ORDER', 'RIGHT', 'TABLESAMPLE', 'UNION', 'WHERE', 'WINDOW', <EOF>", query_id=20220422_215155_00117_jhvv6)
so yeah, I guess Trino or Presto are a no-go for me. 😕
x
hmm
is this syntax error ?
for reserved keyword?
d
It is, from the Trino perspective - I think because of the quotes around
date
x
what about
Copy code
SELECT * FROM pinot.default."select weight, caseid as weights_caseid from weights where date > 20130501 and date < 20130515 limit 100"
👀 1
yeah, those quotes are really annoying
😁 1
also if date is a keyword in pinot, then generated pinot query has to double quotes it to mention it’s an identifier
d
Copy code
trino.exceptions.TrinoQueryError: TrinoQueryError(type=INTERNAL_ERROR, name=GENERIC_INTERNAL_ERROR, message="Caught exception while parsing query: select weight, caseid as weights_caseid from weights where date > 20130501 and date < 20130515 limit 100", query_id=20220422_215444_00133_jhvv6)
probably because Pinot itself doesn't accept just
date
(since it's reserved)
Yeah, you read my mind 😄
x
right, I guess it’s unfair to ask to create a new column 😛
d
Hmmm... to avoid this quoting issue, you mean?
x
I feel we should ban create column names on reserved keyword soon for new tables lol
yeah
I was thinking your query might be runnable with
Copy code
select distinct weight from (SELECT * FROM pinot.default."SELECT weight, pmxid FROM weights WHERE date > 20130501 and date < 20130515 and brand_id = 1000226") weights
  inner join (SELECT * FROM pinot.default."SELECT weight, pmxid FROM weights WHERE date > 20130501 and date < 20130515 and brand_id = 1000226") brands_metrics on weights.pmxid = brands_metrics.pmxid 
  inner join (SELECT * FROM pinot.default."SELECT weight, pmxid FROM weights WHERE date > 20130501 and date < 20130515 and name = 'gender' and contains(filters."values", 2) ") filters on weights.pmxid = filters.pmxid
but I don’t know the quote parts
d
Yeah, that could be a good idea I guess 😄 No worries, let me actually try that. I was really rooting for Trino to work fine for me, so let's see if pushing down works but with non-reserved keywords.
x
kind of pushing down 3 queries to pinot and just use trino for joins on brokerResponses
1
d
Yeah, that's the last thing I tried by myself, but maybe it was failing because of the quotes. I'll try to prefix all columns with the table name, this should also make joins easier to read, and avoid any kind of quoting issue.
x
sure, or you can change filter to some other non-reserved keywords
if we can narrow down the problem to quotes, then it could be much simpler fixes
d
Yeah, just made a quick fix actually - to
date_
and
values_
. I'm too lazy 😄
Alright, dynamic tables do work now, albeit not performing as I expected. I get queries running ~10x slower than straight queries to Pinot, ~80ms in Trino compared to ~7ms in Pinot, which is not good, but at least it's not absurdly slow anymore. A full query with joins now takes ~270ms, which is far better than 6s, but still not performing as I would expect it to. I'll do some more experiments and check how fast I can get by just fetching those results into a Python script and joining the data myself.
@User in case you're interested ^
For reference, this is my full query now:
Copy code
select distinct weight from 
    pinot.default."select weight, caseid as weights_caseid from weights where date_ > 20150501 and date_ < 20150515" as weights
    inner join pinot.default."select brand_id, caseid as brands_metrics_caseid from brands_metrics where brand_id = 1000226 and date_ > 20150501 and date_ < 20150515 limit 100" as brands_metrics 
        on weights_caseid = brands_metrics_caseid 
    inner join pinot.default."select name, caseid as filters_caseid from filters where name = 'gender' and values_ = 2 and date_ > 20150501 and date_ < 20150515" as filters 
        on weights_caseid = filters_caseid
m
Thanks for the update 👍
e
Definitely interested - this likely due to a few things: • The connector parses the passthrough query and then reconstructs it again for type checking, double quoting identifiers (common pattern in trino connectors) • Queuing - this could add some small latency • Security - if you have it enabled in trino this will add some time also (for us it's 20-50ms)
We have a hacky workaround (base32 encoded passthrough query, no parsing just send to pinot) but trino will soon have the ability to do table functions which does something similar but in a more user friendly syntax.
d
Ah, got it. Thanks @User!
👍 1