I m interested in building a segment fetcher that builds vir Apache Pinot #general

Join Slack

I’m interested in building a segment fetcher that ...

# general

Joe Lane

04/29/2022, 10:21 PM

I’m interested in building a segment fetcher that builds virtual segments on the fly from an OLTP transaction log.

Mayank

04/29/2022, 10:22 PM

You just need to implement record reader interface for your format and the rest of segment generation will happen using existing code

Mayank

04/29/2022, 10:23 PM

Although I didn’t fully get what you mean by virtual segment

Ken Krugler

04/29/2022, 10:35 PM

@User I think @User is talking about creating on-demand segments that are backed by the transaction log. Though there’s the triggering of the build…e.g. you could have some kind of smart web service fronting this dynamic builder, so that when an HTTP request is made to get a segment out of deep storage, it handles triggering the segment build (if needed).

Mayank

04/29/2022, 10:57 PM

I guess I am unfamiliar with the business use case. Why would we do a segment builds on-demand at query time, that would be super slow

Ken Krugler

04/29/2022, 11:00 PM

I can’t speak to Joe’s use case, but I did something similar a while ago for a client. They had a lot of data stored in Parquet format, and needed to be able to arbitrarily query a sub-set of it for analytics. So rather than spend the up-front time to turn everything into the proper back-end format (Lucene, as they were using Elasticsearch for the query analytics), we wrapped the Parquet file with an interface that supported the Lucene calls used by ES. To do that we had to build some data structures (mostly bit sets) and cache those.

Ken Krugler

04/29/2022, 11:01 PM

So there was latency on the first query for a particular sub-set of data, but after that it was fast.

Mayank

04/29/2022, 11:01 PM

Ah I see.

Mayank

04/29/2022, 11:02 PM

Thanks for the context @User

Joe Lane

04/29/2022, 11:06 PM

Fantastic question. Let me back up. I think I'm conflating the noun “segments” with what it means in my world. We have many databases, each which have a durable transaction log. We want to support ad hoc queries of aggregated metrics derived from this data. We would like to avoid duplicating this data somewhere else just to do the aggregations ( e.g. spark ). I think the way to accomplish this integration is through some kind of SPI plugin, but am unsure. What would you recommend?

Ken Krugler

04/29/2022, 11:10 PM

If you want to use Pinot for this, the simplest approach would be to have some daemon process that converts transaction log entries into records that it pushes to a Kafka topic. Then set up Pinot to consume from that topic and update a realtime table.

Ken Krugler

04/29/2022, 11:11 PM

If you want to avoid having to use Kafka, then you could have a daemon that periodically creates segments, saves them someplace reachable from your Pinot cluster, and does a “metadata push” to tell Pinot you have a new segment for that table.

Ken Krugler

04/29/2022, 11:13 PM

For that second approach, you could generate CSV files that are then processed as-is by the Pinot admin tool to build segments. Or you could (as @User noted) write a record reader that is then used by the Pinot admin tool to directly read from the transaction log files.

Joe Lane

04/29/2022, 11:15 PM

Could you point to the api docs for the “metadata push”? These I'll evaluate these approaches, thanks for the help!

Ken Krugler

04/29/2022, 11:16 PM

See https://docs.pinot.apache.org/basics/data-import/batch-ingestion

👍 1

Open in Slack

Previous Next