I'm trying to evaluate Pinot within our bank (note...
# general
r
I'm trying to evaluate Pinot within our bank (note, we already use an existing commercial olap product) ingesting streaming data from Flink + kafka with pre-processed data into pinot. This seems to work all good and latency is matching our requirement. We also have client currently streaming the realtime data from cube so trying to see if there's any such streaming client API available for querying Pinot. I could see that Pinot integrates with Presto / Trino, so can someone please let me know if there's any link that I can refer to see how to implement streaming client to query realtime data from pinot (specifically the initial query may yield the initial snapshot of data and thereafter delta updates on the underlying query).
m
Could you elaborate on what you mean by streaming client? Do you just want to stream data out of Pinot (ie no aggregations)?
While there’s a GRPC endpoint in pinot-broker, simply streaming out raw data out of Pinot may not be the best use of it. If you can provide more details on the end requirements, perhaps we can suggest a better approach. @Ram
r
Possibly with aggregation, say if we have query to aggregate trades on book level and stream this from Pinot
m
Is the result of aggregation too big and needs to best to be streamed? As opposed to just making a SQL query?
r
Currently our traders have lot of different queries against the cube with drilldown to fact level data meaning with no aggregation but raw data and other use case is to look up from desk / book level aggregated data (possibly head trader looking at cross region trades),
m
You can get raw data and aggregates via SQL query (rest-client). You can also try the new GRPC (streaming endpoint) if the data to be returned is huge (> MBs).
r
The need for streaming is because we have live trades and market data which will be processed from Flink +kakfa each sec / millisecs, like we receive live market data from Bloomberg and need to be processed at millisecs latency
ok, thank you Mayank, is there link to look at the GRPC for pinot ?
is this via Presto-pinot connector ?
m
No broker endpoint cc: @Rong R
r
yes. broker does support a streaming endpoint. However what I sense here is that you are not asking for a single query result to be stream back to client. but kind of like a long running query execution that materialized and produces data as Kafka message arrived/ingested into Pinot is that correct?
e.g. when running
SELECT * FROM tbl WHERE col > 0
, you are not asking for the OFFLINE + REALTIME segment (will data that has already been ingested) to be stream back (batch by batch). But rather you wanted to keep this query as long running / never ending; when data newly ingested into Pinot from Flink+kafka pipeline that matches the filter requirement here
col > 0
. you want this data be return to the user as well.
r
Thats exactly correct...
@Rong R, yes, you're spot on.. something like this implemented on Apache Ignite, https://www.gridgain.com/docs/latest/developers-guide/key-value-api/continuous-queries.
Note, although in Apache ignite continuous queries have many constraints like SQL query cant be used which means no aggregation, rather just the raw streaming data can be scanned for continuous streaming and event listeners.
r
do you have any specific latency / batch size requirements? I dont think we have such long-running query endpoint / protocol. what I can think of is periodically issue new queries and ask for another minibatch for the timewindow from last execution till current timestamp. FYI: the one Mayank mentioned is for the static result but streaming return e.g. data will be returned in mini-batches, but it will end eventually (when all data at the time of the query submission has finished processing).
r
Latency will be down to atleast 500 millisecs and possibly 1K batch size which can be iterated and/or conflated. We have roughly 1000 queries running at the same time concurrently, so each query will reconnect say every few millisecs, fetch the result .. is there any concern if we go down this route like too many queries iterating over time and causing pinot to melt down ?
r
i am no expert but i think you can maintain a JDBC driver connection with Pinot and use that as your connection manager for issuing mini-batch queries. CC @Kartik Khare ^
r
ok thank you vm.
m
JDBC driver is not feature complete. You can use pinot-client and issue batch queries with specific time filters
p
@Ram have you thought about something like ksqldb for the continuous queries?
v
@Mayank, @Rong R, is raw data export something that is advocated via pinot? Would a high qps of such data be an issue? Also I believe we may land into issues with deep pagination
m
That is not the best use of Pinot, but would need to know more details before I can suggest any solutions
v
We have a use-case where customers need to pull out data (stored as json) at regular cadence (10 min) from the system. Size of the json is small 1-2 kb. The same data has json index on fields that we require to serve realtime queries
m
If you are doing analytics queries on the same table, and need to fetch raw data once in a while (1-2kb), that should be fine.
v
@Rahul Jain ^^
p
For the streaming to live UI use case with conflation you might want to consider a different technology like 60East AMPS content based message broker and websocket client https://www.crankuptheamps.com/
We find Pinot is more useful where we want to keep a timeseries history/look at trends/feed data into ML/backtesting