I’m interested in building a segment fetcher that ...
# general
j
I’m interested in building a segment fetcher that builds virtual segments on the fly from an OLTP transaction log.
m
You just need to implement record reader interface for your format and the rest of segment generation will happen using existing code
Although I didn’t fully get what you mean by virtual segment
k
@User I think @User is talking about creating on-demand segments that are backed by the transaction log. Though there’s the triggering of the build…e.g. you could have some kind of smart web service fronting this dynamic builder, so that when an HTTP request is made to get a segment out of deep storage, it handles triggering the segment build (if needed).
m
I guess I am unfamiliar with the business use case. Why would we do a segment builds on-demand at query time, that would be super slow
k
I can’t speak to Joe’s use case, but I did something similar a while ago for a client. They had a lot of data stored in Parquet format, and needed to be able to arbitrarily query a sub-set of it for analytics. So rather than spend the up-front time to turn everything into the proper back-end format (Lucene, as they were using Elasticsearch for the query analytics), we wrapped the Parquet file with an interface that supported the Lucene calls used by ES. To do that we had to build some data structures (mostly bit sets) and cache those.
So there was latency on the first query for a particular sub-set of data, but after that it was fast.
m
Ah I see.
Thanks for the context @User
j
Fantastic question. Let me back up. I think I'm conflating the noun “segments” with what it means in my world. We have many databases, each which have a durable transaction log. We want to support ad hoc queries of aggregated metrics derived from this data. We would like to avoid duplicating this data somewhere else just to do the aggregations ( e.g. spark ). I think the way to accomplish this integration is through some kind of SPI plugin, but am unsure. What would you recommend?
k
If you want to use Pinot for this, the simplest approach would be to have some daemon process that converts transaction log entries into records that it pushes to a Kafka topic. Then set up Pinot to consume from that topic and update a realtime table.
If you want to avoid having to use Kafka, then you could have a daemon that periodically creates segments, saves them someplace reachable from your Pinot cluster, and does a “metadata push” to tell Pinot you have a new segment for that table.
For that second approach, you could generate CSV files that are then processed as-is by the Pinot admin tool to build segments. Or you could (as @User noted) write a record reader that is then used by the Pinot admin tool to directly read from the transaction log files.
j
Could you point to the api docs for the “metadata push”? These I'll evaluate these approaches, thanks for the help!