I m trying to evaluate Pinot within our bank note we already Apache Pinot #general

I'm trying to evaluate Pinot within our bank (note...

Ram

05/17/2022, 8:46 PM

I'm trying to evaluate Pinot within our bank (note, we already use an existing commercial olap product) ingesting streaming data from Flink + kafka with pre-processed data into pinot. This seems to work all good and latency is matching our requirement. We also have client currently streaming the realtime data from cube so trying to see if there's any such streaming client API available for querying Pinot. I could see that Pinot integrates with Presto / Trino, so can someone please let me know if there's any link that I can refer to see how to implement streaming client to query realtime data from pinot (specifically the initial query may yield the initial snapshot of data and thereafter delta updates on the underlying query).

Mayank

05/17/2022, 8:47 PM

Could you elaborate on what you mean by streaming client? Do you just want to stream data out of Pinot (ie no aggregations)?

Mayank

05/17/2022, 8:49 PM

While there’s a GRPC endpoint in pinot-broker, simply streaming out raw data out of Pinot may not be the best use of it. If you can provide more details on the end requirements, perhaps we can suggest a better approach. @Ram

Ram

05/17/2022, 8:49 PM

Possibly with aggregation, say if we have query to aggregate trades on book level and stream this from Pinot

Mayank

05/17/2022, 8:51 PM

Is the result of aggregation too big and needs to best to be streamed? As opposed to just making a SQL query?

Ram

05/17/2022, 8:51 PM

Currently our traders have lot of different queries against the cube with drilldown to fact level data meaning with no aggregation but raw data and other use case is to look up from desk / book level aggregated data (possibly head trader looking at cross region trades),

Mayank

05/17/2022, 8:53 PM

You can get raw data and aggregates via SQL query (rest-client). You can also try the new GRPC (streaming endpoint) if the data to be returned is huge (> MBs).

Ram

05/17/2022, 8:53 PM

The need for streaming is because we have live trades and market data which will be processed from Flink +kakfa each sec / millisecs, like we receive live market data from Bloomberg and need to be processed at millisecs latency

Ram

05/17/2022, 8:54 PM

ok, thank you Mayank, is there link to look at the GRPC for pinot ?

Ram

05/17/2022, 8:54 PM

is this via Presto-pinot connector ?

Mayank

05/17/2022, 8:56 PM

No broker endpoint cc: @Rong R

Rong R

05/17/2022, 9:23 PM

yes. broker does support a streaming endpoint. However what I sense here is that you are not asking for a single query result to be stream back to client. but kind of like a long running query execution that materialized and produces data as Kafka message arrived/ingested into Pinot is that correct?

Rong R

05/17/2022, 9:26 PM

e.g. when running

SELECT * FROM tbl WHERE col > 0

, you are not asking for the OFFLINE + REALTIME segment (will data that has already been ingested) to be stream back (batch by batch). But rather you wanted to keep this query as long running / never ending; when data newly ingested into Pinot from Flink+kafka pipeline that matches the filter requirement here

col > 0

. you want this data be return to the user as well.

Ram

05/17/2022, 9:28 PM

Thats exactly correct...

Ram

05/17/2022, 9:30 PM

@Rong R, yes, you're spot on.. something like this implemented on Apache Ignite, https://www.gridgain.com/docs/latest/developers-guide/key-value-api/continuous-queries.

Ram

05/17/2022, 9:32 PM

Note, although in Apache ignite continuous queries have many constraints like SQL query cant be used which means no aggregation, rather just the raw streaming data can be scanned for continuous streaming and event listeners.

Rong R

05/17/2022, 9:33 PM

do you have any specific latency / batch size requirements? I dont think we have such long-running query endpoint / protocol. what I can think of is periodically issue new queries and ask for another minibatch for the timewindow from last execution till current timestamp. FYI: the one Mayank mentioned is for the static result but streaming return e.g. data will be returned in mini-batches, but it will end eventually (when all data at the time of the query submission has finished processing).

Ram

05/17/2022, 9:41 PM

Latency will be down to atleast 500 millisecs and possibly 1K batch size which can be iterated and/or conflated. We have roughly 1000 queries running at the same time concurrently, so each query will reconnect say every few millisecs, fetch the result .. is there any concern if we go down this route like too many queries iterating over time and causing pinot to melt down ?

Rong R

05/17/2022, 9:42 PM

i am no expert but i think you can maintain a JDBC driver connection with Pinot and use that as your connection manager for issuing mini-batch queries. CC @Kartik Khare ^

Ram

05/17/2022, 9:45 PM

ok thank you vm.

Mayank

05/17/2022, 9:47 PM

JDBC driver is not feature complete. You can use pinot-client and issue batch queries with specific time filters

Prayas

05/18/2022, 11:37 PM

@Ram have you thought about something like ksqldb for the continuous queries?

Vaibhav Mittal

05/19/2022, 1:34 AM

@Mayank, @Rong R, is raw data export something that is advocated via pinot? Would a high qps of such data be an issue? Also I believe we may land into issues with deep pagination

Mayank

05/19/2022, 1:35 AM

That is not the best use of Pinot, but would need to know more details before I can suggest any solutions

Vaibhav Mittal

05/19/2022, 2:15 AM

We have a use-case where customers need to pull out data (stored as json) at regular cadence (10 min) from the system. Size of the json is small 1-2 kb. The same data has json index on fields that we require to serve realtime queries

Mayank

05/19/2022, 3:11 AM

If you are doing analytics queries on the same table, and need to fetch raw data once in a while (1-2kb), that should be fine.

Vaibhav Mittal

05/26/2022, 2:26 AM

@Rahul Jain ^^

Peter Pringle

06/02/2022, 2:06 PM

For the streaming to live UI use case with conflation you might want to consider a different technology like 60East AMPS content based message broker and websocket client https://www.crankuptheamps.com/

Peter Pringle

06/02/2022, 2:07 PM

We find Pinot is more useful where we want to keep a timeseries history/look at trends/feed data into ML/backtesting

2 Views

Open in Slack

Previous Next