Hi guys I m looking into adopting Pinot in our organisation Apache Pinot #general

Hi guys, I'm looking into adopting Pinot in our or...

Wojtek Sznapka

10/05/2020, 7:18 AM

Hi guys, I'm looking into adopting Pinot in our organisation and it looks like good fit! The problem I'm trying to solve is to move away from using BigQuery for daily Superset dashboards and make use of Pinot in user-facing apps. I have pre-cubed data coming from Spark/Snappy data in Kafka and want to use it as source in Pinot. The only problem is the data is "append-only" (comes from Debezium), so we have all create and update records in one stream (let's say user bet's on sport, his ticket is created, has a state "accepted", then his ticket changes the state to winning or non-winning and we have another record in Kafka). In BQ we use

row_number over (partition by ticket_number order by source.lsn DESC)

which numbers rows with the 1 as the newest and then we look for rows numbers = 1 in sub-query (BQ docs). How would you solve it in Pinot, I didn't find windowing/analytic functions. Thanks in advance!

Kishore G

10/05/2020, 7:20 AM

Pinot does not support Window functions. You will have to use Presto Pinot connector to achieve this functionality.

Wojtek Sznapka

10/05/2020, 7:22 AM

thanks a lot

Wojtek Sznapka

10/05/2020, 7:22 AM

we'll give it a try

Kishore G

10/05/2020, 7:23 AM

If the queries per second is low, this might be ok. If not, this will be inefficient in terms of cost since it has to scan a lot of data.

Wojtek Sznapka

10/05/2020, 7:23 AM

yeah, it will be rather couple of queries per minute, not continous

Kishore G

10/05/2020, 7:23 AM

Window function are good from user point of view but costly from a database point of view.

Wojtek Sznapka

10/05/2020, 7:23 AM

is there another way of achieving that?

Kishore G

10/05/2020, 7:24 AM

What’s the exact query

Wojtek Sznapka

10/05/2020, 7:24 AM

without window function

Wojtek Sznapka

10/05/2020, 7:24 AM

sec.

Wojtek Sznapka

10/05/2020, 7:27 AM

Copy code

SELECT
  SUM(total_bet) AS total_stake_today,
  COUNT(*) AS number_of_tickets
FROM (
  SELECT
    id_ticket_state,
    total_bet
  FROM (
    SELECT
      t.after.id_ticket_state,
      t.after.ticket_total_bet AS total_bet,
      ROW_NUMBER() OVER (PARTITION BY t.after.ticket_number ORDER BY t.after.id_ticket_phase, t.source.lsn DESC) AS row_no
    FROM
      `dwh.tickets` t
    WHERE
      DATE(t.after.ticket_time_bet) = CURRENT_DATE()
      AND CAST(t._PARTITIONTIME AS DATE) >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
   )
  WHERE row_no = 1
)
WHERE
  id_ticket_state NOT IN (1,4,16)

Wojtek Sznapka

10/05/2020, 7:27 AM

I've simplified it a bit, but should get a general idea

Wojtek Sznapka

10/05/2020, 7:28 AM

you can ignore partitiontime condition, as it applies to BQ only

Kishore G

10/05/2020, 7:31 AM

whats the cardinality of id_ticket_state

Wojtek Sznapka

10/05/2020, 7:31 AM

~30 distinct values

Kishore G

10/05/2020, 7:52 AM

Go ahead and try Pinot with Presto connector - https://docs.pinot.apache.org/integrations/presto

❤️ 1

Open in Slack

Previous Next