https://pinot.apache.org/ logo
#general
Title
# general
n

Neil Teng

06/18/2021, 9:36 PM
Hey, can anyone recommend other materials related to the "Raw value forward index" I am having a really difficult time understanding the Raw value forward index example .
m

Mayank

06/18/2021, 9:37 PM
What are you looking for? It just stores raw data chunk compressed, as opposed to dictionary encoding
n

Neil Teng

06/18/2021, 9:49 PM
where does the chunk size come from? And the "chunkoffset = docId % chunkSize" is hard to understand in the example. if the chunk is compressed, what is the difference between it and the compression on disk as a column-oriented DB? If purpose is to improve large sequential scan, do you mean a scan on this col without any where clause? If there is where clause, I think we still need to check each value.
m

Mayank

06/18/2021, 9:54 PM
The math (modulo etc) is on uncompressed chunk. Compression is for on-disk index.
Say you wanted to read docId 1 to 1000. In case of dictionary, the dict encoding may scatter these 1000 values all over the disk (in the worst case requiring 1000 disk seeks). In case of raw index, there is no dictionary, and all 1000 values would be contiguous on disk (minimizing disk seeks)
Typically, you want to use this for high cardinality string columns, where dictionary encoding does not provide much compression.
n

Neil Teng

06/18/2021, 10:17 PM
I think I missed a point here -- the indexed column is always sorted.
m

Mayank

06/18/2021, 10:18 PM
No, only the sorted column is sorted. And dictionaries are sorted.
n

Neil Teng

06/18/2021, 10:20 PM
OK. What does those pointer from colA to colB trying to say?
m

Mayank

06/18/2021, 10:21 PM
So consider this:
Copy code
Your use case has queries mostly for a primary column (eg where customerId = xxx).

If you sort on customerId, then you will always pick contiguous docIds for a given query.

Now consider you have a high cardinality string column that you project in the query.

With dictionary, the fwd index will have dictionary ids, that may point to different disk blocks.

Without dictionary for this high cardinality column, the contiguous docIds will correspond to contiguous disk blocks.
Hopefully that makes sense?
n

Neil Teng

06/18/2021, 10:32 PM
OK, I think I understand it. I have a question about "sort on customerId", do we mean all the columns are sorted with the same order as customerId. how do we config that all the records sorted according to one columns in the disk?
m

Mayank

06/18/2021, 10:33 PM
Yes, that is implicit. A docId represents a row in the table and has to match across columns, nothing special needs to be done for that
n

Neil Teng

06/18/2021, 10:38 PM
Is docId a theoretical auto-incremental UUID in pinot or a primary key we actual specify? But I dont see pinot has a concept of primary key.
because "A docId represents a row in the table and has to match across columns", I think for a column-oriented DB, every column is sorted with this docId and compressed in default. That is the way data lay out in the disk.
m

Mayank

06/18/2021, 10:42 PM
docId is just a contiguous integer (0, 1, 2, 3...) in the scope of a Pinot segment
I think for a column-oriented DB, every column is sorted with this docIdĀ and compressed in default. That is the way data lay out in the disk.
Hmm, then how do you identify a row across columns. If you sort each column independently you will loose which value in colA corresponds to which value in colB. I am not sure what other column oriented DBs do, but Pinot does not do this
n

Neil Teng

06/18/2021, 10:48 PM
By sorted, I just mean layer out in the order as the docID does.
I think we mean the same thing.
m

Mayank

06/18/2021, 10:48 PM
Yes, seems so
n

Neil Teng

06/18/2021, 10:50 PM
wait, "dictionaries are sorted", do you mean the docID is sorted according to the indexed column?
m

Mayank

06/18/2021, 10:51 PM
dictionary is separate from docId
n

Neil Teng

06/18/2021, 10:51 PM
I am sorry, it is not.
OK, back to the raw value forward index, if the data in disk are already in the order of docId, what is the meaning of it?
m

Mayank

06/18/2021, 10:53 PM
From docId you get dictionaryId
The actual data for that dictionary id can be anywhere on disk
Look at the forward index section to understand docId -> dictId -> rawData
n

Neil Teng

06/18/2021, 10:55 PM
thanks, I understand the Dictionary-encoded forward index.
Just I am not sure why we need to specify the "raw value forward index" because the data in the disk is already in that way.
m

Mayank

06/18/2021, 10:56 PM
It is not
The dictionary of a column is generated by "sorting values of the column". dictId = 0 is the first sorted value and so on.
n

Neil Teng

06/18/2021, 10:57 PM
yes.
Do you mean the data in the same column are not sitting next to each other in the disk in some cases?