Hey, can anyone recommend other materials related ...
# general
n
Hey, can anyone recommend other materials related to the "Raw value forward index" I am having a really difficult time understanding the Raw value forward index example .
m
What are you looking for? It just stores raw data chunk compressed, as opposed to dictionary encoding
n
where does the chunk size come from? And the "chunkoffset = docId % chunkSize" is hard to understand in the example. if the chunk is compressed, what is the difference between it and the compression on disk as a column-oriented DB? If purpose is to improve large sequential scan, do you mean a scan on this col without any where clause? If there is where clause, I think we still need to check each value.
m
The math (modulo etc) is on uncompressed chunk. Compression is for on-disk index.
Say you wanted to read docId 1 to 1000. In case of dictionary, the dict encoding may scatter these 1000 values all over the disk (in the worst case requiring 1000 disk seeks). In case of raw index, there is no dictionary, and all 1000 values would be contiguous on disk (minimizing disk seeks)
Typically, you want to use this for high cardinality string columns, where dictionary encoding does not provide much compression.
n
I think I missed a point here -- the indexed column is always sorted.
m
No, only the sorted column is sorted. And dictionaries are sorted.
n
OK. What does those pointer from colA to colB trying to say?
m
So consider this:
Copy code
Your use case has queries mostly for a primary column (eg where customerId = xxx).

If you sort on customerId, then you will always pick contiguous docIds for a given query.

Now consider you have a high cardinality string column that you project in the query.

With dictionary, the fwd index will have dictionary ids, that may point to different disk blocks.

Without dictionary for this high cardinality column, the contiguous docIds will correspond to contiguous disk blocks.
Hopefully that makes sense?
n
OK, I think I understand it. I have a question about "sort on customerId", do we mean all the columns are sorted with the same order as customerId. how do we config that all the records sorted according to one columns in the disk?
m
Yes, that is implicit. A docId represents a row in the table and has to match across columns, nothing special needs to be done for that
n
Is docId a theoretical auto-incremental UUID in pinot or a primary key we actual specify? But I dont see pinot has a concept of primary key.
because "A docId represents a row in the table and has to match across columns", I think for a column-oriented DB, every column is sorted with this docId and compressed in default. That is the way data lay out in the disk.
m
docId is just a contiguous integer (0, 1, 2, 3...) in the scope of a Pinot segment
I think for a column-oriented DB, every column is sorted with this docIdĀ and compressed in default. That is the way data lay out in the disk.
Hmm, then how do you identify a row across columns. If you sort each column independently you will loose which value in colA corresponds to which value in colB. I am not sure what other column oriented DBs do, but Pinot does not do this
n
By sorted, I just mean layer out in the order as the docID does.
I think we mean the same thing.
m
Yes, seems so
n
wait, "dictionaries are sorted", do you mean the docID is sorted according to the indexed column?
m
dictionary is separate from docId
n
I am sorry, it is not.
OK, back to the raw value forward index, if the data in disk are already in the order of docId, what is the meaning of it?
m
From docId you get dictionaryId
The actual data for that dictionary id can be anywhere on disk
Look at the forward index section to understand docId -> dictId -> rawData
n
thanks, I understand the Dictionary-encoded forward index.
Just I am not sure why we need to specify the "raw value forward index" because the data in the disk is already in that way.
m
It is not
The dictionary of a column is generated by "sorting values of the column". dictId = 0 is the first sorted value and so on.
n
yes.
Do you mean the data in the same column are not sitting next to each other in the disk in some cases?