Hi guys, I'm looking into adopting Pinot in our or...
# general
w
Hi guys, I'm looking into adopting Pinot in our organisation and it looks like good fit! The problem I'm trying to solve is to move away from using BigQuery for daily Superset dashboards and make use of Pinot in user-facing apps. I have pre-cubed data coming from Spark/Snappy data in Kafka and want to use it as source in Pinot. The only problem is the data is "append-only" (comes from Debezium), so we have all create and update records in one stream (let's say user bet's on sport, his ticket is created, has a state "accepted", then his ticket changes the state to winning or non-winning and we have another record in Kafka). In BQ we use
row_number over (partition by ticket_number order by source.lsn DESC)
which numbers rows with the 1 as the newest and then we look for rows numbers = 1 in sub-query (BQ docs). How would you solve it in Pinot, I didn't find windowing/analytic functions. Thanks in advance!
k
Pinot does not support Window functions. You will have to use Presto Pinot connector to achieve this functionality.
w
thanks a lot
we'll give it a try
k
If the queries per second is low, this might be ok. If not, this will be inefficient in terms of cost since it has to scan a lot of data.
w
yeah, it will be rather couple of queries per minute, not continous
k
Window function are good from user point of view but costly from a database point of view.
w
is there another way of achieving that?
k
What’s the exact query
w
without window function
sec.
Copy code
SELECT
  SUM(total_bet) AS total_stake_today,
  COUNT(*) AS number_of_tickets
FROM (
  SELECT
    id_ticket_state,
    total_bet
  FROM (
    SELECT
      t.after.id_ticket_state,
      t.after.ticket_total_bet AS total_bet,
      ROW_NUMBER() OVER (PARTITION BY t.after.ticket_number ORDER BY t.after.id_ticket_phase, t.source.lsn DESC) AS row_no
    FROM
      `dwh.tickets` t
    WHERE
      DATE(t.after.ticket_time_bet) = CURRENT_DATE()
      AND CAST(t._PARTITIONTIME AS DATE) >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
   )
  WHERE row_no = 1
)
WHERE
  id_ticket_state NOT IN (1,4,16)
I've simplified it a bit, but should get a general idea
you can ignore partitiontime condition, as it applies to BQ only
k
whats the cardinality of id_ticket_state
w
~30 distinct values
k
Go ahead and try Pinot with Presto connector - https://docs.pinot.apache.org/integrations/presto
❤️ 1