Hi, I want to ask question about pinot data struc...
# general
o
Hi, I want to ask question about pinot data structure. I was confused a bit. Firstly, I want to give a feedback about the Apache pinot documentations. I think, documentations can be improved. I could not find some necessary informations from docs. This could be my fault. I do not know, you can warn me. Pinot segments is created incrementally based on time. For example, for daily granularity, segments will be created for each day. This is the default and necessary feature for segments like Apache Druid segments. So, we can get best performance when use time based query. Can we define any segment configurations like maximum segment size? For example, for Sunday, we have 10gb data. I want to create 5 segments with 2gb data instead of one segment with 10gb data. Do this operation supported? Segment have metadata information. We can get all segment names of a table from controller api, then we can get metadata(segment uri etc) of each segment from controller api. • In segment metadata, what location "segment URI" represents? Deep storage or offline server location? • Can I get all segments metadata information from controller in one query? (because, i saw the api accepts only one segment to get segment metadata information) Also, I want to read segment. Apache Pinot creates segment file, then compress it to
.tar.gz
format. Can I read compressed segment file(eg: segment_1.tar.gz) with
PinotSegmentRecordReader
directly? Or do I
unTar
compressed file firstly? Shortly, can I read compressed segment file from offline server or deep storage directly? Or do I have to download compressed segment file to local first, and untar and read it? How flow can be to read segment? Thank you so much!
k
Thanks for the feedback, improving docs is definitely high priority. Let us know if there are specific sections that can be improved.
partitioning by time is not necessary requirement in Pinot. We do leverage it if its aligned on day/hourly boundary.
If you are ingesting in real-time, you can set the control the segment size by setting the right threshold for completing the segment - by time, number of rows or size in mb
In offline ingestion, Pinot creates one segment per input file.
If you really want to control the size, there is a prepartitioning job that you can run before running the segment generation job.
segment URI in metadata points to location in deep store
Dont know if there is an api to get segment metadata of everything in one shot
you need to read untar the segment to read it using PinotSegmentRecordReader.
by looking at your questions, you seem to have figured out a lot already 🙂
o
I have worked with Apache Druid a lot. Thus I can learn and image somethings easily 🙂 So, segments can be time based partition or id(or any column) or key value store? Am i wrong? In the real time segment creating, how segments is created? Possible scenarios based on time, size etc? I want to write a connector for apache pinot to use it in ETL processing. For this, I do not want to use broker for read operation, and call creation and push job for write operation. Thus, I want to learn what is the detailed data structured of apache pinot. @Kishore G
k
There is already a presto-pinot connector that does all the things you need.
o
So, for the read operation, flow can be like that: • Download compressed segment file(.tar.gz) from deep storage. • Untar it. • Read it using PinotSegmentRecordReader. Okey, I will look presto connector for that. Thank you so much!
k
out of curiosity, why do want to use Pinot for ETL processing
we plan to add a streaming API for presto-pinot connector
r
@Kishore G, what options are there to store segment other than local file system?
k
mounted volume if you are on cloud
r
ok
i am trying to compare timeinmillis with current datetime , what is the solution
k
Use the standard > <
Time > x
r
pinot error: {'errorCode': 150, 'message': 'PQLParsingError:\n' 'org.apache.pinot.pql.parsers.Pql2CompilationException: Comparison ' 'between two columns is not supported.\n' '\tat ' 'org.apache.pinot.pql.parsers.pql2.ast.ComparisonPredicateAstNode.addChild(ComparisonPredicateAstNode.java:60)\n' '\tat '
Time>=EXPECTED_DATEInMillis , i am trying this in query..